Deep Reinforcement Learning for Demand Driven Services in Logistics and Transportation Systems: A Survey

08/10/2021 ∙ by Zefang Zong, et al. ∙ Tsinghua University 0

Recent technology development brings the booming of numerous new Demand-Driven Services (DDS) into urban lives, including ridesharing, on-demand delivery, express systems and warehousing. In DDS, a service loop is an elemental structure, including its service worker, the service providers and corresponding service targets. The service workers should transport either humans or parcels from the providers to the target locations. Various planning tasks within DDS can thus be classified into two individual stages: 1) Dispatching, which is to form service loops from demand/supply distributions, and 2)Routing, which is to decide specific serving orders within the constructed loops. Generating high-quality strategies in both stages is important to develop DDS but faces several challenging. Meanwhile, deep reinforcement learning (DRL) has been developed rapidly in recent years. It is a powerful tool to solve these problems since DRL can learn a parametric model without relying on too many problem-based assumptions and optimize long-term effect by learning sequential decisions. In this survey, we first define DDS, then highlight common applications and important decision/control problems within. For each problem, we comprehensively introduce the existing DRL solutions, and further summarize them in We also introduce open simulation environments for development and evaluation of DDS applications. Finally, we analyze remaining challenges and discuss further research opportunities in DRL solutions for DDS.



There are no comments yet.


page 4

page 8

page 21

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The continuous urbanization and development of mobile communication have brought many new application demands into urban daily lives. Among all, the services that transport either humans or parcels to provided destinations following individual demands or system requirements are critical in both urban logistics and transportation nowadays. We define such kinds of services as Demand-Driven Services (DDS). For example, the on-demand food delivery service as a typical DDS is widely used since it improves diet convenience significantly. More than 30 million orders are generated every day on the Meituan-Dianping platform, one of the world’s largest on-demand delivery service providers [1]. As another example, large-scale online ridesharing services such as Uber and DiDi have substantially transformed the transportation landscape, offering huge opportunities for boosting the current transportation efficiency. These DDS applications provide striking efficiency to city operations in both logistics and transportation as well as many opportunities to related research fields. Intelligent control with minimal manual intervention upon DDS systems is critical to guarantee their effectiveness and has drawn many research interests.

In a typical DDS task, there are several roles involved that an implemented system should consider, including the service workers, service providers, and corresponding targets. For example, in an on-demand delivery system, a man who orders food can be seen as a service target, while the restaurant from which the food is ordered is the service provider. A group of such delivery tasks is then assigned to and accomplished by a courier, i.e., the worker. These core DDS elements form a DDS loop, and such an example is illustrated in Figure 1. The same formulation can also be constructed within the ride-sharing scenario. Each customer who calls for a ride has his/her destination as the target, and the driver serves as the service worker. The DDS platforms that support either delivery or ride-sharing services are supposed to provide corresponding algorithms to 1) construct reasonable service loops and 2) conduct workers to complete assignments within loops. We show DDS loop formulations in several typical scenarios in Table I, including on-demand delivery, ridesharing, express systems, and warehousing.


DDS Scenario DDS Loop
Service Provider Service Target Service Worker
On-demand Delivery Restaurant Customer Courier
Ridesharing Passenger Origin Destination Driver
Express (Sending) Consignor Depot Courier
Express (Delivery) Depot Consignee Courier
Warehousing Shelf, Entry, Station Shelf, Entry, Station AGV


TABLE I: Elements in a DDS loop of several typical DDS scenarios, including on-demand delivery, ridesharing, express systems and warehousing. AGV is the abbreviation for Autonomous Guiding Vehicle.
Fig. 1: The visualization of two independent service loops using instant delivery as an example. The restaurant, customer, and courier serve as the service provider, service target, and service worker respectively.

With fundamental DDS elements defined, how to manage different service demand pairs (providers and targets), schedule available service workers and control the entire service system become the major objectives of developing a centralized intelligent DDS platform. Major research problems can be classified into two aspects. First, forming DDS loops upon demand pairs and workers, which is also named Dispatching , is the first-hand challenge to deal with. The loop forming process, i.e., the Matching between demands and workers can be originated from the traditional bipartite graph matching problem, while the dynamic features in the entire environment bring much more complexity. A good dispatching mechanism should not only consider the current states of workers with scattered demands but also take future distributions into account for long-term optimization. Furthermore, even a worker is not matched with service demands at present, there still remains a large action space to arrange the idling workers into other areas, which forms the Fleet Management problem. The loop-forming stage can be seen as the first stage for the complete DDS.

Second, after being assigned with numerous demands to satisfy, how to execute formed loops, i.e., to schedule the detailed Routing strategies including planning the visiting orders of the demand set and the selection on real-world road maps is also critical to determine the entire system efficiency. The routing problem can be originated from the conventional Traveling Salesman Problem  [2], where a salesman is supposed to visit all cities without revisiting anyone of them. The further Vehicle Routing Problems (VRP) [3], and its variants are valuable in the mathematical formulation of most real-world routing scenarios [4, 5, 6, 7]. A high-quality routing strategy should minimize the total traveling distance to decrease the expenses of the workers. The routing stage can be seen as the second stage after dispatching. A robust and stable routing strategy generation is also important to provide decision information back to the dispatching stage. We illustrate the relationship between the two stages in Figure 2.

Fig. 2: The overview of DDS problems, including the dispatching stage and the routing stage. We demonstrate the transformation from originated mathematical formulation to industrial applications in the vertical axis, and distinguish the two different planning stages in the horizontal axis. Note that the two stages are not rigidly separated, but such a classification is necessary to concentrate on primary challenges in different practical scenarios. A low demand/worker ratio implies that the primary challenge is to determine how workers and demands should be matched, while a large one indicates that the major optimization space lies in the routing stage. We will discuss such a relationship in details in Sec 2.3.

The solutions to the mathematical formulations of both two stages were widely studied previously. For instance, the Kuhn-Munkres (KM) algorithm for Bipartite-Graph Matching and Branch-and-Bound for TSP and VRP could provide exact solutions for simple static problems with limited scales [8, 9]. Considering multiple real-world constraints and additional factors, more complicated dispatching and routing problems are also further investigated extensively in the field of operations research, applied maths, etc [10, 11, 12, 13]

. In complicated scenarios with larger problem scales, exact optimizations are almost impossible to obtain. Meanwhile, heuristics and meta-heuristics were widely accepted as an alternative to generate approximate solutions within a much more reasonable time in both stages of DDS 

[14, 15]. These heuristics-based methods could generate satisfactory solutions in online scenarios and are thus practical in many real-world DDS systems. However, there is still much potential in exploring better solutions with higher quality, higher efficiency on larger scales.

As machine learning showing astonishing performance in recent years, it is of great potential to utilize the learning-based techniques to further develop DDS systems. Reinforcement Learning (RL) methods have developed and been applied in many planning tasks 


. RL could generate strategies by modeling a decision process as a Markov Decision Process (MDP). A predefined reward from a long-term perspective works as the feedback signal to any action attempts so that RL can optimize sequential decisions. The trial-and-error process could train the agent to learn to select the best action corresponding to different inner states and outside environments. As deep neural networks providing much stronger ability on feature representative and pattern recognition, combining neural networks and RL shows great performances 

[17]. Many deep RL (DRL) algorithms are further proposed and become state-of-the-art frameworks in control and scheduling tasks. DRL does not have to rely on manually designed assumptions and features by training a parameterized model to learn the optimal control. It is trivial to consider using it as the structure for solving the series of planning tasks in DDS.

In this survey, we focus on how DRL can benefit to the development of DDS systems in both the dispatching stage and the routing stage respectively. We first introduce major DRL algorithms and four typical DDS in urban operations. Then we summarize existing DRL based solutions according to the following dimensions:

  • Problem. We classify the research problems in both dispatching and routing stages into preciser sub-problems. Order dispatching along with fleet management included in the dispatching stage are investigated. As for numerous DRL solutions for the routing stage, we first introduce the ones solving typical Capacitied VRP (CVRP) as mathematical solutions, while more practical routing solutions for VRP variants are also discussed. We consider four variant problems with additional constraints in this survey, including dynamic VRP (DVRP), Electric VRP (EVRP), VRP with Time Windows (VRPTW) and VRP with pickup and delivery (VRPPD).

  • Scenario. The aforementioned research problems exist in several applicable scenarios, and four common DDS scenarios are included in this survey. In transportation systems, we introduce ridesharing services, where vehicles are assigned to transport passengers to their destinations. Specifically, ridesharing can be further classified into ride-hailing, where each driver serves only one passenger in a loop, and ride-pooling, where multiple passengers can share a ride at the same time. As for the logistic systems where parcels are transported from providers to targets, we summarize solutions for both on-demand delivery systems that fulfill people’s instant demands and traditional express systems with longer service duration. We also introduce modern warehousing systems where Autonomous Guiding Vehicles (AGVs) transport parcels from locations to another. Note that some important literature providing solutions within a mathematical formulation is also included [18, 19].

  • Algorithm. We distinguish the detailed RL algorithm used during model training. Most commonly used ones in existing works belong to model-free RL methods, including DQN [17], PPO [20], REINFORCE [21], etc. We also discuss whether the DDS task is constructed as a single agent MDP or a multi-agent one.

  • Network Structure.

    We also distinguish the neural network design in each literature. Commonly used networks include Convolutional Neural Networks (CNN), Graph Neural Networks (GNN) and its variants (including GCN and others), attention (ATT) based networks and its variants (including single

    multi head attentions).

  • Data Type and Data Scheme. We indicate the data type used in each literature, either real-world or generated based on pre-defined random seeds with a given distribution. Meanwhile, spatial locations of data are utilized in several ways for simplification to a different extent. Generally, there are four data schemes originated from the real road networks as follows: 4-way connectivity with cardinal directions, 8-way connectivity with ordinal directions, 6-way connectivity based on hexagon-grids, and original discrete graph-based structure. Note that the first two can also be summarized as square-grids. Different data schemes are shown in Figure 3.

  • Data and Code Availability.

    To present the extent of reproducibility of the investigated literature, we report the data availability of the proposed methods. A checkmark means that the data is released by the researchers or could be easily found via a direct web search. We also report the availability of the code. Both original open-sourced codes and re-implementation from the third party are considered.

Besides, we also introduce the available simulation environments for DDS, which is critical to simulate real-world scenarios with much fewer expenses. Finally, several challenges of using DRL to solve DDS and remaining open research problems are summarized.

Previous literature investigating relative problems includes surveys by Haydari et al. [22], Qin et al. [23] and several reviews on VRP [24]. However, Haydari et al. [22] focused on the general planning problems in Intelligent Transportation Systems from where Transportation Signal Control (TSC) and Autonomous Driving are emphasized. Qin et al. [23] only investigated the dispatching problems in ridesharing scenarios, and Mazyavkina et al. [24]

introduced DRL solutions on mathematical VRPs included in more general combinatorial optimizations. In contrast, we are the first to define DDS from a practical system level and classify specific research problems in several scenarios with DRL-based solutions. We investigate how DRL can benefit to its development. The two stages of DDS are discussed, including dispatching that forms service loops and routing that executes services loops. The related literature is summarized in Table 

II and Table III.

Overall, this paper presents a comprehensive survey on DRL techniques for solving planning problems in DDS systems. Our contributions can be summarized as follows:

  • To the best of our knowledge, this is the first comprehensive survey that thoroughly defines and investigates DDS systems and up-to-date DRL techniques as solutions.

  • We classify different stages within a complete DDS system, including the dispatching stage and the routing stage. We also investigate the common applications corresponding to the two stages, introduce the theoretical background of DRL from a broad perspective and explain several important algorithms.

  • We investigate existing works that utilize DRL for DDS systems. We summarize these works in several dimensions and discuss the individual approaches.

  • We illustrate the challenges and several open problems in DDS problems using DRL. We believe the summarized research directions will benefit relevant research and help to direct future work.

The remaining survey is organized as follows. We first introduce the background of this survey, including DRL and four common DDS scenarios in Sec  2. The stage definition and more specific problems with corresponding solutions of both dispatching and routing are summarized in Sec 3 and Sec  4 respectively. The commonly used simulation environments for both stages are introduced in Sec 5. Then we summarize several challenges of DRL for DDS design and open research problems in Sec 6 and Sec 7. Finally, we summarize this survey in Sec 8.

Fig. 3: A sample of different grid-based navigation and partitioning schemes: (a) 4-way connectivity through cardinal directions, (b) 8-way connectivity with ordinal directions, (c) 6-way connectivity using hexagon-based representations. (d) Full connectivity, can also be modeled as a graph structure.

2 Background

Fig. 4: Reinforcement learning control loop.
Fig. 5: Classification and development of DRL algorithms. The solid arrow indicates the category attribution, and the dashed arrow indicates the development of the method.

2.1 Reinforcement Learning

RL is a kind of learning that maps from environmental state to action. The goal is to enable the agent to obtain the largest cumulative reward in the process of interacting with the environment [25]. Usually, the Markov Decision Process (MDP) can be used to model RL problems. There are several core elements within RL under an MDP setting, including the agent, the environment, the state, the action, reward, and transition. We drew Figure 4 to represent the reinforcement learning control loop and the detailed descriptions are as follows,

  • Environment. The environment of DRL is the fundamental setting that provides basic information from exogenous dynamics.

  • Agent. The agent in the RL is supposed to provide actions and interact with the entire environment. There could be even more than one agent, which further forms the multi-agent RL setting.

  • State. is the set of all environmental states. By modeling the planning task as an MDP as the prior, the state of the agent, , at decision step describes the latest situation. The state of the agent serves as the endogenous feature that influences the decision making.

  • Action. is the set of executable actions of the agent. The action, is the way that agents interact with the environment at decision step . Any action could influence the current state of the agent.

  • Reward. is the reward function. By continuously carrying out actions to change states, the agent will finally obtain the corresponding reward, that is related to the task which is obtained by the agent performing the action in the state at decision step . With as the task signal, the entire training process of RL is to obtain a high reward, which represents how successful the agent is in completing the given task.

  • Transition.

    is the state transition probability distribution function.

    represents the probability that the agent performs the action at in the state

    and transits to the next state .

In RL, policy is a mapping from state space to action space. It means that the agent selects an action with state , executes the action and transits to the next state with probability , and receives the reward from environmental feedback at the same time. Assuming that the immediate reward obtained at each time step in the future must be multiplied by a discount factor . From the time to the end of the episode at time , the cumulative reward is defined as , where , which is used to weigh the impact of future rewards.

The state action value function refers to the cumulative reward obtained by the agent during the process of executing action in the current state and following the strategy until the end of the episode, which can be expressed as . For all state-action pairs, if the expected return of one policy is greater than or equal to the expected return of all other policies, then policy is called the optimal strategy. There may be more than one optimal policy, but they share a state-action value function , which is called the optimal state-action value function. Such a function follows the Bellman optimality equation,

In traditional RL, solving the value function is generally through iterating the Bellman equation Through continuous iteration, the state-action value function will eventually converge, thereby obtaining the optimal strategy:

. However, for practical problems, such a process to search for an optimal strategy is not feasible, since the computation cost of iterating the Bellman equation grows rapidly due to the large state space. To tackle such a problem, deep learning (DL) is introduced in RL to form deep reinforcement learning (DRL), which utilizes deep neural networks for function approximation in the traditional RL model and significantly improves the performances of many challenging applications 

[26, 17, 27]. In general, an RL agent can act in two ways: (1) by knowing/modeling state transition, which is called model-based RL, and (2) by interacting with the environment without modeling a transition model, which is called model-free RL. Model-free RL algorithms include two categories of algorithms: value-based methods and policy-based methods. In value-based RL, an agent learns the value function of a state-action pair and then selects actions based on such a value function [25]. While in policy-based RL, the action is determined by a policy network directly, which is trained by policy gradient [25]. We will first introduce value-based methods and policy-based methods, and then discuss the combinations of them. We drew Figure 5 to show the classification and development of these methods. Besides, we also introduce multi-agent RL as a special category.

Value-based RL. Mnih et al. [26, 17, 27] first combined the convolutional neural network with the learning [28] algorithm from traditional RL, and proposed the Deep Q-Network (DQN) framework. This model is first used to process visual perception, which is a pioneering and representative work in the field of value-based RL. DQN uses an experience replay mechanism [29] in the training process, and processes the transferred samples for training. At each time step , the transferred samples obtained from the interaction between the agent and the environment are stored in a replay buffer . During training, a small batch of transferred samples is randomly selected from

each time, and the stochastic gradient descent (SGD) algorithm and TD error 

[30] is used to update the network parameters . During training, samples are usually required to be independent of each other. Such a random sampling method greatly reduces the relevance between samples, thereby improves the stability of the algorithm. In addition to using a deep convolutional network with parameter to approximate the current value function, DQN also uses another network to generate the target value. Specifically, represents the output of the current value network, which is used to evaluate the value function of the action pair in the current state. Meanwhile, represents the output of the target value network, which is used to approximate the value function, namely the target value. The parameters of the current value network are updated in real-time. After every iterations, the parameters of target value network are updated by and kept frozen for another iterations. The entire network is trained by minimizing the mean square error between the current value and the target . Such a frozen target mechanism reduces the correlation between the current value and the target value and thus improves the stability of the training process.

The selection and evaluation of actions are based on the target value network , which will easily overestimate the value in the learning process. Tackling this problem, researchers have proposed a series of methods based on DQN. Hasselt et al.  [31] proposed the Deep Double Q-Network (DDQN) algorithm based on the double Q-learning algorithm [32] (double Q-learning). There are two different sets of parameters in double Q-learning: and . Where is used to select the action corresponding to the maximum value, and is used to evaluate the

value of the optimal action. Such a parameter separation separates action selection and strategy evaluation so as to reduce the risk of overestimating the Q value. Experiments show that DDQN can estimate

value more accurately than DQN. The success of DDQN shows that reducing evaluation error on value improves performance. Inspired by this, Bellemare et al. [33] defined a new operator based on advantage learning (AL) [34] in the Bellman equation to increase the difference between the optimal action value and the sub-optimal action value, in order to alleviate the evaluation error caused by the action corresponding to the largest value. Experiments show that AL error terms can effectively reduce the deviation in the evaluation of the value and thus promote learning quality. In addition, Wang et al.  [35] improved the network architecture based on DQN, and proposed Dueling DQN, which greatly accelerates task learning speed.

Value-based RL methods are suitable for low-dimensional discrete action spaces. However, they cannot solve the decision-making problems in the continuous action space, such as autonomous driving, robot movement, etc. Therefore, we further introduce the policy-based RL methods that are capable of solving continuous decision-making problems.

Policy-based RL. Policy-based RL [36] updates the policy parameters directly by computing the gradient of the cumulative reward of the policy with respect to the policy parameters, and finally converges to the optimal policy where represents the sum of rewards received in an episode. The most common idea of policy gradient is to increase the probability of of the trajectories with higher reward. Assume the state, action and reward trajectory of a complete episode is . Then the policy gradient is expressed as : Such a gradient can be used to adjust the policy parameters where is the learning rate, which controls the rate of policy parameter update. The gradient term represents the direction that can increase the probability of occurrence of trajectory . After multiplying by the score function , it can make the probability density of the trajectories with higher reward greater. While trajectories with different total rewards are collected, the above training process will guide the probability density to these trajectories with higher total rewards and maximize the corresponding appearance probability.

However, the above method lacks the ability of distinguishing trajectories with different quality, which will lead to a slow and unstable training process. In order to solve these problems, Williams et al. [21] proposed the REINFORCE algorithm with a baseline as a relative standard for the reward : where is a baseline related to the current trajectory , which is usually set as an expected estimate of

in order to reduce the variance of

. It can be seen that the more exceeds the reference , the greater the probability that the corresponding trajectory will be selected. Therefore, in the DRL task of large-scale state, the policy can be parameterized by the deep neural network, and the traditional policy gradient method can be used to solve the optimal policy.

However, the policy-based RL methods are very unstable during training due to the inaccurate estimation of baseline and are inefficient due to that complete episodes are required for parameter updates. In order to solve these problems, researchers proposed actor critic methods, which combine the value-based RL methods and policy-based RL methods.

Actor Critic RL. V. R. Konda et al.  [37] first proposed the actor-critic (AC) methods leveraging advantages from both value-based and policy-based methods. The AC methods include two estimators: an actor that plays the role of the policy-based method via interacting with the environment and generating actions according to the current policy, while a critic who plays the role of the value-based method by estimating the value of the current state during training. In AC methods, the critic’s estimation of the value of the current state makes the RL training process more stable. In addition, there are some actor-critic RL methods introducing gradient restrictions or replay buffers so that the collected data can be reused, and therefore improve the training efficiency.

R. S. Sutton et al. [25] proposed the Advantage Actor-Critic (A2C) method, which adds a baseline to the value so that the feedback can be either positive or negative. V. Mnih et al.  [38] introduced distributed machine learning methods into A2C and got a new algorithm named Asynchronous Advantage Actor-Critic (A3C), which greatly improved the efficiency of the A2C algorithm. Wang et al. combined the AC method with experience replay and proposed actor-critic with experience replay (ACER) [39] method. This method enables the AC framework to train in an off-policy way to improve data utilization efficiency. Lillicrap et al.  [40] leveraged the idea of DQN to extend the Q learning algorithm to transform the Deterministic Policy Gradient [41]

(DPG) method, and proposed a Deep Deterministic Policy Gradient method (DDPG) based on the actor-critic framework, which can be used to solve decision-making problems in continuous action space. Moreover, it also introduced the replay buffer so that the collected data can be reused to improve training efficiency. Although DDPG can sometimes achieve good performance, it is still fragile in terms of hyperparameters. A common failure reason for DDPG is overestimating the real

value, thus making the learned policy worse. To solve this problem, Fujimoto et al.  [42] proposed Twin Delayed DDPG (TD3), which introduced three techniques based on DDPG. TD3 employs clipped double-Q learning to reduce the deviation of the value estimation. It also utilizes delayed policy updating and target policy smoothing to reduce the impact of the value estimation deviation on policy training. Furthermore, Schulman et al. [20] employed the importance sampling [43] method and tailored the network gradient update of reinforcement learning to make the training process more robust. Schulman et al. [44] also proposed a method called Trust Region Policy Optimization (TRPO). The core idea of TRPO is to force the differences of the prediction distribution of the old and new policies on the same batch of data so as to avoid excessive gradient updates and ensure the stability of the training process. However, TRPO employs the conjugate gradient algorithm to solve the constrained optimization problem, which greatly reduces the computational efficiency and increases the implementation cost. Therefore, Schulman et al. [20] proposed a Proximal Policy Optimization algorithm (PPO) to get rid of the calculations generated by constrained optimization by introducing a reduced proxy objective function.

Multi-Agent RL. Many real-world problems require interaction modeling among different agents, and multi-agent RL algorithms are thus needed. A common approach is to assign each agent with a separate training mechanism. Such a distributed learning architecture reduces the implementation of learning difficulty and computational complexity. For DRL problems with large-scale state space, using the DQN algorithm instead of the Q learning algorithm to train each agent individually can construct a simple multi-agent DRL system. Tampuu et al. [45] dynamically adjust the reward model according to different goals and proposed a DRL model in which multiple agents can cooperate and compete with each other. When faced with reasoning tasks that require multiple agents to communicate with each other, the DQN models usually fail to learn an effective strategy. To solve this problem, Foerster et al. [46] proposed a method called Deep Distributed Recurrent Q-Networks (DDRQN) model for multi-agent communication and cooperation with partially observable state. Except for distributed learning, other mechanisms including cooperative learning, competitive learning, and direct parameter sharing are also used in different multi-agent scenarios [47].

2.2 Application Overview

As defined above, a DDS system transports either humans or parcels to provided destinations following individual demands or systematic requirements. We briefly introduce several urban DDS applications that have significant importance in our daily lives, as illustrated in Figure 6.

Fig. 6: An overview of several urban DDS applications that have significant importance in our daily lives.

2.2.1 Ridesharing

Compared to traditional taxi-hailing services in which passengers are offered rides by chance, a ridesharing service matches passengers with drivers according to their demands from mobile apps, such as DiDi [48], Uber [49], etc. When a potential passenger submits a request from the apps to the centralized platform, the platform will first estimate the trip price and send it back. If the passenger accepts it, a matching module will attempt to match the passenger to a nearby available driver. The matching process may take time due to real-time vehicle availability and thus pre-matching cancellation may exist. After a successful match, the driver will drive to the passenger and transport him/her to the destination. A trip fare will be obtained by the driver after arrival. To reduce the average waiting time for a successful match, the platforms usually utilize a fleet management module in the backend to rebalance idling vehicles continuously by guiding vehicles to places with a higher possibility of new requests. The decisions from matching and fleet management are finally executed within the routing stage. Vehicles are navigated to serve passengers or repositioned to new areas following these strategies. In the ridesharing scenario, the service worker of a loop refers to the vehicle, while the provider and the target refer to the passengers’ pickup locations and their destinations.

A ridesharing service can be further classified into ride-hailing where a driver is assigned only one passenger at a time, and ride-pooling (also known as carpool) where several passengers share a vehicle at a time. Note that in some literature the scenario of multiple passengers is also named as ridesharing. In this survey, we use ride-pooling specifically for disambiguation following  [23].

2.2.2 On-demand Delivery

Many platforms around the world provide food delivery services such as PrimeNow [50], UberEats [51], MeiTuan [1], and Eleme [52]. Except for delivering food, the newly rising instant delivery services can also deliver small parcels from one customer to another or helps to purchase other daily merchandise directly from local shops or pharmacies, such as medicines. Both food and instant delivery can be seen as a type of on-demand delivery. Compared with traditional delivery platforms, e.g., FedEx and UPS, the orders in on-demand delivery platforms are expected to be fulfilled in a relatively short time, e.g., 30 minutes to 1 hour. A typical on-demand delivery process involves four parties: a customer as the service target, a merchant as the service provider, a courier as the worker, and the centralized platform. A customer-first places an order in a smartphone app of a platform, while a merchant starts to prepare the order and the platform assigns a courier to pick up the order. Finally, the courier delivers the order to the customer.

2.2.3 Express Systems

As a long-existing DDS system, an express system is required to both pick up parcels from the consignors to the fixed depots, and deliver parcels that were loaded from the depots to the consignees. In a practical express system such as FedEx [53], Cainiao [54], the pickup and delivery are usually considered simultaneously and could be obtained within the same service loop. A courier loads parcels at the depot and then delivers them to their destinations one by one via a delivery van. Meanwhile, new pick-up requests may come from local customers during the delivery process, each of which is associated with a service location. The courier should also go to these places to fulfill pickup requests. Couriers are required to depart from and return to the depots by a specific time, to fit the schedule of trucks that send and pick packages to or from stations regularly.

2.2.4 Warehousing

Except for the DDS applications that have direct interactions with humans, the rising autonomous technologies enable unmanned management in local warehousing. The shipment requests of cargoes, usually with large size and weight, are common within a repository or among several repositories. Cargoes are moved into a targeted shelf and moved out continuously to accommodate the global shipping requirements. To reduce expenses and improve efficiency, autonomous guiding vehicles (AGVs) are commonly used in the modern warehousing scenario. In a warehousing service loop, the service provider refers to the original shelf, and the service target refers to the corresponding destination. AGVs serve as the workers in the entire process. An intelligent centralized platform is responsible to control all AGVs for efficient operations.

(a) Illustration of On-demand Delivery.
(b) Illustration of Ridesharing.
(c) Illustration of Express Systems.
(d) Illustration of Warehousing.
Fig. 7: Illustration of the four typical DDS applications.

2.3 Relationship between Two Stages

Generally, the research problems within practical DDS systems can be classified into two stages, i.e., dispatching and routing. The dispatching stage mainly handles the relationship between service workers and demand pairs and thus constructs service loops, while the routing stage focuses on how to execute the services within each established loop. We hereby note that the two stages are not rigidly separated. A reasonable dispatching algorithm should consider future in-loop routing strategies as a measurement proxy. Whether a better routing solution can be generated is a direct criterion to judge different dispatching strategies. For example, a courier should not be assigned with a demand request which is far away from him since the routing distance within such a loop will be too long. On the other hand, in practical routing scenarios where a fleet of workers are on duty, it is implicated that the cooperation among different workers needs consideration and thus the dispatching is included.

However, such a classification is necessary to concentrate on primary challenges in different practical scenarios. An important reference metric for such a classification we demonstrate in this survey is the Demand/Worker Ratio. A low ratio means that the number of workers and demand pairs are balanced in each constructed loop and thus the major space of optimization is to determine how different requests should be assigned. For instance, a driver can only take one passenger in ride-hailing and no more than two passengers in ride-pooling. How to match drivers with customer requests is critical to global efficiency, while computing in-loop routing strategies is not computationally expensive. Meanwhile, in scenarios with a large ratio, it implies that a worker has to serve lots of demand requests within its loop. The routing stage, i.e., how to execute the loops thus has a high problem complexity and requires an intensive optimization process. For instance, a courier in express systems may be assigned with hundreds of parcels, and generating its optimal routing strategy becomes the primary challenge due to its NP-hard nature.

In the following sections, we will focus on the dispatching and routing stages. We will discuss the within subproblems and introduce existing solutions respectively.

3 Stage 1: Dispatching

Given the information of available workers and continuously updated service demand pairs, the first stage of DDS is to coordinate the relationship between demands along with the available workers, and thus establish service loops both effectively and efficiently. We name such a loop forming process as ’Dispatching’. Generally, the dispatching stage consists of two aspects: 1) Order matching, which aims to find the best matching strategy between workers and demands, and 2) Fleet management, which repositions idling workers to balance the local demand-supply ratio so that better order matching could be obtained in the future. Figure 8 shows an overview of the dispatching phase in DDS.

Formulated as an optimization problem, both tasks in the dispatching scenario are complicated due to three challenges. First, the continuously changing demand distributions and worker states bring high dynamics to the entire Markov Decision Process. It is non-trivial to accurately evaluate returns of different decision attempts. Second, a successful matching strategy should consider long-term returns [55]. A simple maximum result only considering current service distributions may result in a long-term loss. For example, assigning all vehicles to serve every current demand may be a local maximum in ridesharing, but may decrease the profit in the next time window since some vehicles are assigned to areas where barely any new demands appear. Third, a centralized platform should consider multiple, even large amount of workers simultaneously. Effectively modeling the cooperation and sometimes competition among them is critical to improving the system efficiency.

Concerning the given challenges, DRL has its natural advantage to solve the order matching problem compared to conventional methods and other learning-based methods. Many online reinforcement learning methods are developed to handle the non-stationariness in MDP modeling. Taking expected returns as learning signals, DRL is a proper framework to optimize sequential decision tasks, including dispatching tasks. Besides, modeling workers as agents is a natural way to handle the decision problem, either by homogeneously modeling all workers using the same policy, or consider the in-between interactions among multiple agents.

Fig. 8: Overall dispatching architecture proposed by [56].

In this section, we introduce both order matching and fleet management problems. Specifically for each problem, we first introduce the problem definition and common metrics, along with several conventional methods respectively. Then we thoroughly discuss detailed applications for transportation and logistics. The DRL-based literature for the dispatching stage is summarized in Tabel II.

3.1 Order Matching

An order matching process is to assign current unserved service demands to available workers. It is also defined with other names, such as order-driver assignment in ridesharing services. The mathematical formulation originates from the online bipartite graph matching problem, where both supplies (the workers) and the demands are dynamic. It is an important module in the real-time online DDS applications with high dynamics, such as ridesharing and on-demand delivery [14, 11, 57]. Information including unserved demands, travel costs, and worker availability is updated continuously, which brings complexity to the problem.

Without purely assigning demands to workers, practical DDS systems also consider other additional action choices. For instance, vehicles in a ridesharing system can be designated to idle when no proper demand can be assigned to them. As electric vehicles are widely used and deployed, whether to recharge or continue to accept new demands forms new decision problems [58]. Furthermore, controlling the number of demands assigned to the same worker also expands the action space, such as considering ride-hailing and ride-pooling scenarios simultaneously [58]. When each driver can have more than one customer in a loop, the action is to determine how many customers and which ones to pick up.

As for the goal of order matching, there are generally two aspects to consider, including optimizing profits for the platform and experience from the demands’ side:

  • Maximize the Gross Merchandise Volume (GMV).[56]

    With each service loop priced, a core evaluation metric of an effective order matching system is to maximize the total revenue of all services over time. In the ride-hailing services specifically, it is also called Accumulated Driver Income (ADI) in some literature

    [59]. Generally, the profit perspective stands for the interest of both workers and the entire platform.

  • Maximize the Order Response Rate (ORR).[59] Since not fulfilling all demands is usual in real-world scenarios, another goal is to maximize the ORR, which evaluates the satisfaction from the demands’ side. Based on the intuition that total response time increases along with ORR, it is also an alternative to represent the interest of customers. Note that ORR is highly correlated to GMV since that the more demands are fulfilled, the more revenues the platform can obtain within a certain period.


Reference Year Problem Scenario Algorithm Network Structure Dscheme Dtype Davail Code


Li et al.[59] 2019 Order Matching Ridesharing MFRL[60] MLP Hexagon-Grid real x x
Zhou et al.[61] 2019 Order Matching Ridesharing Double-DQN[31] MLP Hexagon-Grid real, sim x
Xu et al.[56] 2018 Order Matching Ridesharing TD[30] - Square-Grid real, sim x x
Wang et al.[62] 2018 Order Matching Ridesharing DQN[17] MLP, CNN Hexagon-Grid real x x
Tang et al.[63] 2019 Order Matching Ridesharing double-DQN[31] MLP Hexagon-Grid real x x
Jindal et al.[58] 2018 Order Matching Ride-pooling DQN[17] MLP Square-Grid real x
He et al.[64] 2019 Order Matching Ridesharing Double DQN[31] MLP, CNN Square-Grid real x x
Al-Abbasi et al.[65] 2019 Order Matching Ridesharing DQN[17] CNN Square-Grid real x x
Qin et al.[66] 2021 Order Matching Ridesharing AC [37], ACER [39] MLP Square-Grid real x x
Wang et al.[67] 2019 Order Matching Ridesharing Q-Learning [28] - Graph Based real, sim x
Ke et al.[68] 2020 Order Matching Ridesharing DQN [17], A2C [25], ACER [39], PPO [20] MLP Square/Hexagon-Grid real, sim x
Yang et al.[69] 2021 Order Matching Ridesharing TD [30] MLP Square-Grid real x x
Chen et al.[70] 2019 Order Matching On-demand Delivery PPO[20] MLP Square-Grid real, sim x x
Li et al.[59] 2019 Order Matching Express DQN [17] MLP, CNN Square-Grid real x x
Li et al.[71] 2020 Order Matching Express DQN [17] MLP, CNN Square-Grid real x x
Hu et al.[72] 2020 Order Matching Warehousing DQN [17] MLP Graph-based real x x
Lin et al.[73] 2018 Fleet Management Ridesharing A2C [25], DQN[17] MLP Hexagon-Grid real
Zhang et al.[74] 2020 Fleet Management Ridesharing Dueling DQN [35] MLP Hexagon-Grid real
Wen et al.[75] 2017 Fleet Management Ridesharing DQN[17] MLP Square-Grid real, sim x
Oda et al.[76] 2018 Fleet Management Ridesharing DQN[17] CNN Square-Grid real x x
Liu et al.[77] 2020 Fleet Management Ridesharing DQN [17] GCN [78] Square-Grid real
Shou et al.[79] 2020 Fleet Management Ridesharing DQN [17] AC [37] Square-Grid real
Jin et al.[80] 2019 Matching+Fleet Management Ridesharing DDPG[40] MLP, RNN Hexagon-Grid real
Holler et al.[81] 2019 Matching+Fleet Management Ridesharing DQN[17], PPO[20] MLP Square-Grid real, sim x x
Guo et al.[82] 2020 Matching+Fleet Management Ridesharing Double DQN[31] CNN Square-Grid sim x x
Liang et al.[83] 2021 Matching+Fleet Management Ridesharing DQN[17], A2C [25] MLP Graph-based real x x


TABLE II: Applications using DRL to solve DDS problems. The information of each literature reference consists of the publishing year, the problem solved, the DRL training paradigm used, the data type (Dtype) used, whether the data is available (Davail) and whether the code is released.

3.1.1 Conventional Methods for Order Matching

The order matching problem and many variants were widely studied in the field of Operations Research (OR). Given the deterministic information of both workers and demands, the problem can be summarized as bipartite matching and can be solved via the traditional Khun-Munkres (KM) algorithm[8]. Early methods were proposed using greedy algorithms to assign the nearest available vehicle to a ride request[10]. These methods omit the global demands and supplies, and thus cannot achieve optimal performances in the long run. With new demands and worker states updating continuously, stochastic modeling becomes a major challenge. Researchers developed heuristics to deal with it efficiently [11, 84, 85, 14]. Based on historical data and the predictable pattern of demands, Sungur et al.[84] use stochastic programming to model the uncertain demands in the courier delivery scenario. Lowalekar et al.[85] tackle the problem with stochastic optimization with Bender’s decomposition and propose a matching framework for on-demand ride-hailing. Hu and Zhou et al.[14] also formulate it as a dynamic problem and use heuristic policies to explore the structural space of the optimal.

3.1.2 DRL for order matching in Transportation Systems

Order matching is an essential decision and optimization problem in applications in transportation systems, such as ride-sharing services. Modern taxi and in-service vehicles can share their real-time coordinates and states to the centralized platform via mobile networks. On the other hand, each customer can generate new requests including the provided pick-up locations and the destinations as a demand pair. The platform receives emerging demands and thus executes online matching policies. In transportation DDS where the demand/worker ratio is relatively low, the coordination between demands and supplies is the principal issue. Thus the order matching among them is critical to improving the system operation efficiency. From the agents’ perspective, an intuitive idea to formulate the MDP of order matching is to model all drivers in the system as different agents, leveraging multi-agent RL (MARL) techniques [59, 61]. However, direct multi-agent formulation in real-world scenarios without any simplification may suffer from the enormous joint action space of thousands of agents. As a solution, Li et al. [59] used the Mean Field Reinforcement Learning (MFRL) [60] that models the interactions in-between as each agent with the average of others. Zhou et al. [61] argue that no explicit cooperation or communication is needed in a large-scale scenario. They propose a decentralized execution method to dispatch orders following a joint evaluation.

As a comparison, another simplified and commonly accepted method to model the cooperation is to train a single policy and implement it to all workers online [56, 62, 63, 86]. In this formulation, all workers are defined with homogeneous state, action space, and reward definitions. Even though the system is still multi-agent from the global perspective, the training stage only considers a single one. Specifically, Xu et al. [56] model order matching as a sequential decision-making problem and develop a joint learning-and-planning approach. They use Temporal Difference (TD) [30] to learn the approximate driver value function in the learning stage, and then use the KM algorithm to solve the bipartite matching problem based on learned values during planning. Wang et al. [62]

propose a transfer learning method to increase the learning adaptability and efficiency, where the learned order matching model can be transferred to other cities. They use the DQN algorithm to estimate the value network. Tang et al. 

[63] further utilize the double-DQN framework to obtain a more stable learning process. Since online dynamic order matching scenario requires comprehensive consideration upon spatial-temporal features, they develop a special network structure using hierarchical coarse coding and cerebellar embedding memories for better representations. Leveraging the ST-features, He et al. [64] also develops a capsule-based network for better representations. Jindal et al. [58] only concentrated on the ride-pooling task, and design their agent to decide whether a vehicle is to take a single or multiple passengers. Detailed matching is left to low-level algorithms. The homogeneous agent formulation avoids common challenges of multi-agent RL, including the exponential decision space of different agents. Besides, complicated communication is also avoided since all agents share the same state.

Instead of referring different workers as agents, a request of the complete request list is treated as the agent. Yang et al. [69] models each demand as an agent and train a value network to estimate the values of demands instead of workers. A separate many-to-many matching process is further executed based on the learned values. Since online order matching includes non-stationariness from high dynamics, some literature also attempts to find solutions by concentrating on each time window to transform it into a static problem [68, 67] following such an agent modeling. Ke et al. [68] models each request as an agent, and all agents share the same policy. The action space of each agent is considered as whether to delay the current request to the next time window for further matching decisions. Wang et al. [67] train a single agent which represents the entire request list and decides how long the current window lasts. In both formulations, eventual matching results are generated by static bipartite graph matching.

3.1.3 DRL for Order Matching in Logistic Systems

Not only important in practical applications in transportation, but order matching is also essential in modern logistic systems. As pick-up requests come in real-time with many couriers picking up packages, how to manage couriers to ensure cooperation among them and to complete more pick-up tasks in a long time is important but challenging.

With the requirement of fast responding to on-demand delivery customers, modern on-demand delivery systems need effective matching strategies to assign new demands to couriers. Chen et al. proposed a framework that utilizes multi-layer images of the spatial-temporal maps to capture real-time representations in the service areas. They model different couriers as multiple agents and use Proximal Policy Optimization (PPO) [20] to train the corresponding policy. As for the more common express systems, researchers also focus on developing an effective and efficient intelligent express system by optimizing the order matching problem. Zhang et al. first systematically study the large-scale dynamic city express problem, and adopt a batch assignment strategy that computes the pickup-delivery routes for a group of requests received in a short period rather than dealing with each request individually [87]. Rather than using the heuristic-based methods, Li et al. further proposed a soft-label clustering algorithm named BDSB to dispatch parcels to couriers in each region [59]. A novel Contextual Cooperative Reinforcement Learning (CCRL) model is further proposed to guide where should each courier deliver and serve in each short period. Rather than considering both pickup and delivery tasks, Li et al. further proposed a Cooperative Multi-Agent Reinforcement Learning model to learn courier dispatching policies [71].

3.2 Fleet Management

When a service worker is not assigned with demands and idling temporarily, a well-considered reposition strategy upon him can increase the possibility of future service chances and thus increase the entire platform’s revenue. Such a repositioning process forms the important fleet management problem which is also presented as vehicle positioning or taxi dispatching [77]. A straightforward intuition is that reasonable management can help balance the demands and supplies in different regions, thus help to improve the demand matching rate. We present the commonly accepted MDP modeling for fleet management problem and investigate the related DRL applications.

3.2.1 Conventional Methods for Fleet Management

Balancing the distributions of both DDS workers and demands was extensively studied, especially for the transportation systems. For instance, the balance of taxis and customers is essential in an efficient transportation system [88]. Traditional methods were mostly based on data-driven approaches, which highly investigate the historical records of the supply and demand distributions. Miao et al. capture the uncertain sets of random demand probability distributions via spatial-temporal features [89]. Yuan et al. and Qu et al. also construct a recommend system for vehicles to provide recommended options for repositioning [90, 91]. Various techniques, including mixed-integer programming and combinatorial optimizations, are utilized to model and solve the fleet management problem [92, 93].

3.2.2 DRL for Fleet Management

Following the idea of partitioning the city area into local grids to reduce computation cost, the MDP modeling of fleet management is also constructed based on the discrete dimension space. Given the spatial-temporal states of the workers in the fleet as individual agents and the information of dynamically updated customers, an intuition is to reposition available workers to locations with a larger demand/supply ratio than their current ones. For computational efficiency, the agents within the same grid at the same period are often considered as the same agents[73]. The goal of the platform is to maximize the long-term revenue of the entire platform of all agents or the total response rate, so as in order matching. Since measurement includes detailed matching between demands and workers, an intuitive assumption is that a worker can only be matched with the demand providers from its current neighbor grids. The action of each agent is defined based on the grid maps, which contains discrete action choices including moving to one of its neighbors in the -way connected grids or staying as where it is.

Following such a formulation, many DRL-based methods have been proposed recently to address the fleet management problem [73, 76, 75, 94, 95, 74, 77, 79] in recent years. Lin et al. [73] model the cooperation within the fleet as a multi-agent environment and propose a MARL-based solution for fleet management. Zhang et al. [74] develop a DDQN [35] based framework to learn to rewrite the current repositioning policy. Wen [75] explore a new taxi driver perspective upon the fleet management problem. They focus on increasing the individual incomes of drivers and demonstrate that higher revenues for drivers can help bring more drivers into the platform, and thus improve service availability for service customers. Shou et al. [79] further address the suboptimal equilibrium due to the competition among different drivers. They propose a reward design scheme and establish multi-agent modeling of different drivers. In these works, as the action space could be extremely large for fleet management in a city, deep Q-network learning [17] has been commonly adopted by state-of-the-art approaches to accelerate the policy learning process. The agents could fast interact with the environment based on the learned Q-value and decide their next movements accordingly.

3.3 Joint Scheduling of Order Matching and Fleet Management

Besides individual studies upon order matching and fleet management, researchers also attempt to develop algorithms for both problems and consider it as an integrated dispatching stage[80, 81, 82].

Since the action spaces of the two problems are heterogeneous, Jin et al. [80]

proposed a hierarchical DRL-based structure to measure the two stages. Specifically, they design a unified action as the ranking weight vector to rank and select the specific order for matching or the destination for fleet managing. Holler et al. 

[81] separate the two phases of the joint platform. They first treat the drivers as individual agents for order matching and then establish a central fleet management agent that is responsible for all individual drivers. Guo et al. [82, 8] use a double DQN based framework to solve the fleet management problem ahead and leave the detailed Order Matching to the traditional Khun-Munkres (KM) algorithm. Liang et al. [83] preserve the topology of the initial graph-based supply-demand distribution structure instead of discretizing them using a grid view. A special centralized programming planning module is developed to dispatch thousands of taxis on a real-time basis.

A major challenge of integrating the entire dispatching stage is that it is difficult to model the heterogeneous actions in two individual phases. A well-designed unified latent representation of agent states is essential to augment the policy exploration ability and the robustness of training in a joint DRL framework. Further joint research of the two phases remains as the opportunity for effective DDS.

4 Stage2: Routing

By assigning tasks to different workers and balancing the relationship between workers and demands, the service loops are constructed. The second stage of DDS scheduling is to determine how to serve each demand pair within the constructed service loops. For example, in the ride-pooling situation, a driver at a time may have several customers on the car and should decide the individual service priority. Routing is a more common problem in logistic systems where the ratio of worker/demand is much larger. For example, an express van may be assigned with more than a hundred delivery demands within its current service loop. A well-considered visiting strategy to execute the loop is critical to reducing the expenses.

Generally, the routing problems can be derived from the conventional VRP problem. For convenience, we first provide a mathematical formulation of typical Capacitied VRP (CVRP), then discuss the recent DRL-based solutions on solving the routing problems.

4.1 Formulation of Typical CVRP

The basic requirement of VRP is to design a routing strategy with a minimum cost for a fleet of vehicles, given the demands of a set of known customers. All customers must be assigned to one vehicle to have their parcels either be picked up or delivered. All vehicles have limited capacities, , and should originate and terminate at a given depot, , which also offers reloading service.

We represent a fleet of vehicles denoted by , and a set of known customers by , which formulate a directed graph . The total graph includes vertices, where the depot is double represented by vertex and . The set of arcs denoted by represents the traveling cost between customers and the depot and among customers. We associate a spatial distance and a temporal distance cost with each when . includes subgraphs. Each connected subgraph represents a single route by vehicle and has to start from vertex and ends at vertex with several customers in-between, denoted by . Each vehicle has a capacity . Each customer has a demand . The real-time shipment should not exceed .

We further denote two decision variables and , and define , if and only if the is included in , where , while represents the time stamp when vehicle serves customer . By such denotations, we formulate the VRP mathematically as follows:


where (1) represents the routing objective. Constraints (2), (3), and (4) make all customers visited and only visited once. (5) indicates that a vehicle should always yield to its capacity limit, and (6) requires all services made within the individual time windows.

Fig. 9: A summary illustration of the typical VRP and its common variants.


Reference Year Problem Scenario Algorithm Network Dscheme Dtype Davail Code


Nazari et al. [18] 2018 Typical VRP Mathematical REINFORCE [21], A3C[38] RNN Graph-based sim
Kool et al. [19] 2019 Typical VRP Mathematical REINFORCE[21] ATT Graph-based sim
Chen et al. [96] 2019 Typical VRP Mathematical A2C[25] MLP Graph-based sim
Lu et al. [97] 2019 Typical VRP Mathematical REINFORCE [21] MLP, ATT Graph-based sim
Duan et al. [98] 2020 Typical VRP Logistics REINFORCE[21] GCN, ATT Graph-based real x x
Delarue et al. [99] 2020 Typical VRP Mathematical Model-based MLP Graph-based sim x x
Xin et al. [100] 2020 Typical VRP Mathematical REINFORCE[21] ATT Graph-based sim x x
Joe et al. [101] 2020 dynamic VRP Logistics DQN[17] MLP Graph-based real x x
Ottoni et al. [102] 2021 TSP with Refueling (as EVRP) Mathematical Q-Learning [28], SARSA [16] - Graph-based sim x x
Qin et al.[103] 2021 Heterogeneous VRP Mathematical Double-DQN[31] MLP, CNN Graph-based sim x x
Bogyrbayeva et al.[104] 2021 Electric VRP Ridesharing REINFORCE[21] RNN Graph-based sim x x
Shi et al.[105] 2020 Dynamic Electric VRP Ridesharing TD[30] MLP Graph-based sim x x
James et al.[106] 2019 Electric with Time Windows Logistics REINFORCE[21] RNN Graph-based real x
Lin et al. [107] 2020 Electric VRP with Time Windows Logistics REINFORCE[21] ATT, RNN Graph-based sim x x
Zhang et al. [108] 2020 VRP with Time Windows Logistics REINFORCE[21] ATT Graph-based sim x x
Falkner et al.[109] 2020 VRP with Time Windows Mathematical REINFORCE[21] MLP, ATT Graph-based sim x
Zhao et al. [110] 2020 VRP with Time Windows Mathematical AC [37] ATT Graph-based sim x
Li et al. [111] 2021 VRP with Pickup and Delivery Mathematical REINFORCE[21] ATT Graph-based sim x x
Li et al.[112] 2021 VRP with Pickup and Delivery Logistics Double-DQN[31] MLP, ATT Graph-based real x x
Lee et al.[113] 2021 VRP with Pickup and Delivery Warehousing Q-Learning[28] - Square-Grid Sim x x


TABLE III: Applications of using deep reinforcement learning to solve DDS problems. The information of each literature reference consists of the published year, the problem solved, the DRL training paradigm used, the data type (Dtype) used, whether the data is available (Davail) and whether the code is released.

4.2 Realistic Routing Problems

Besides the typical VRP setting, real-world routing problems often require additional considerations with more realistic constraints and objectives. Many variants of VRP that tackle these practical constraints are thus closer to industrial applications and are also widely studied by researchers. We briefly introduce several important VRP variants, including dynamic VRP (DVRP), electric VRP (EVRP), VRP with time windows (VRPTW), and VRP with pickup and deliveries (VRPPD). An overview of the typical VRP and the above variants is illustrated in Figure 9.

4.2.1 Dynamic VRP (DVRP)

Service demands may not be pre-obtained by the platform in real-world scenarios, thus the newly updated demands require assignment with workers dynamically [6]. This is the same common challenge as discussed in the dispatching stage. However, rather than simply coordinating demands with customers, the routing stage also requires a specific routing strategy with the visiting orders of the matched demands for each worker. Joe et al. [101] utilize DQN to estimate the Q-value of the individual states for vehicles and insert new demands into the existing solution sequence. DRL is with potential to estimate the future reward with possible action attempts and is suitable to solve dynamic VRPs.

4.2.2 Electric VRP (EVRP)

As electric vehicles (EVs) become commonly accepted in recent years, researchers gradually establish their interest in how to route the EV fleets and thus form a special Electric VRP (EVRP) problem [4]. They focus on the application potential of EV in both ride-hailing and express systems [105, 106]. Since current EVs have shorter battery lives than traditional vehicles, EVRP considers the charging phase as an additional and essential action of the EV agents. Furthermore, the environment often contains information on the locations of charging stations.

For EV usage in ride-hailing services, Shi et al. model the EV fleet operating as a dynamic EVRP[105]. At each decision step, an EV agent could either pass to keep idling, to charge at the local station, or to serve the customer demands. The detailed order dispatching of customer assignment is executed by KM algorithm[8]. For EV usage in delivery and express systems, James et al. consider both charging requirements of EVs and the possibility that not all customers are visited within the given time [106]. The optimization goal of such a framework is to both maximize the number of delivered logistic requests and minimize the total driving distance of all EVs. The two objectives are considered simultaneously using a weighted sum. Lin et al. consider the EVRP modeling along with individual time window limits of different customers, which will be discussed in the following  [107].

4.2.3 VRP with Time Windows (VRPTW)

When a service demand is provided, it may also be attached with a corresponding time window that the worker should satisfy, which means the service must arrive at the service target location within the given time window [5, 114]. In practice, a customer who orders food from a restaurant may expect its food to be delivered before it cools down. Detailed consideration of time window limits is essential in practical routing scenarios.

Zhang et al.[108] proposed a multi-agent framework by constructing the time window constraint as an additional penalty and generate the routing solutions of different vehicles one after another. James et al. [106] also consider the same constraint in the online electric vehicle routing problem, while does not force the vehicles to visit all given demands. Falkner et al. [109] proposed a joint attention mechanism to balance the coordination between vehicles and demands. Zhao et al. [110] designed a hybrid structure of both DRL and local search to solve both typical VRP and VRPTW.

4.2.4 VRP with Pickup and Deliveries (VRPPD)

Other than the simplified situation where the service provider and the service destination share the same location, VRPPD is a more common problem setting in practical usage [7]. For example, the driver for ride-sharing is supposed to first pick up the customer from the origination, and then send him to his destination. How to handle the relationship between different service provider-target pairs, i.e., pickup-delivery pairs, is of great challenge. Li et al. [111] [111] proposes an attention-based structure by designing a special heterogeneous attention. They design several heterogeneous attention to leverage the different relations between customers within the static graph, including the pickup with paired-delivery, the pickup with other-deliveries, the pickup with other-pickups, and counter-wise if we switch the role of pickups and deliveries.

4.3 Conventional Methods for Routing

When VRP was defined in the early stage [3], researchers attempted to find exact methods to explore the exact optimal strategies.

Researchers attempted to find exact methods for solving VRP at the very beginning when it was early defined and constructed. The branch-and-bound method as a common approach for combinatorial optimizations was used as a solution [9]. Lagrange relaxation based methods were proposed [115, 116], by which the problem could be solved with a minimum degree-constrained K-tree problem. Besides, Desrochers et al. firstly used the column generation to solve VRP[117]

. The following column generation based methods initialize the problem with a small subset of variables and compute a corresponding solution, and keep improving the results based on linear programming gradually. However, due to the NP-hard nature of VRP, the performances of exact approaches are often poor and computationally expensive. The exact methods could only generate results slowly on small-sized datasets.

As a complementary to the poor performances, many heuristic-based methods were further developed to find near-optimal results instead. Compromise to the complexity of VRP and its variants, an acceptable loss on the solution quality can earn great efficiency improvement. For instance, the tabu search and local search as conventional metaheuristics were proposed to solve VRP[118, 119]

. New solutions in the neighborhood of the current one are continuously established and evaluated. On the contrary, genetic algorithms operate in a series of solutions instead of only one solution

[120, 116]. Following the idea of genetics, children’s solutions are generated from the best solution parents from the previous generation. Such an iteration can help to find the approximate optimal. Instead of treating objectives to be optimized altogether, ant colony optimizations utilize several ant colonies to optimize different functions: the number of vehicles, the total distance, and others[12].

Even though these heuristics outperform the exact methods in finding better solutions, they are limited in real-time decision-making. A recreate search method, for example, takes hours to generate solutions for ten thousand instances with 100 customers each, which is not suitable in real-time applications. As another drawback, the optimal approximation of the heuristic methods highly relies on manually defined rules and expert knowledge, which is far from enough compared to the enormous searching space. New technology mechanisms are needed to further improve the solution quality.

4.4 DRL for Routing

Fig. 10: The illustration of sequence generation methods for generating VRP solutions via RL. In each decision step, the agent selects the next provider/target location to visit.

In recent years, many researchers attempt to utilize the deep reinforcement learning (RL) method to solve VRP and other combinatorial optimizations due to the ability to improve solution quality via a self-driven mechanism and the potential of an efficient solution generation process. With solution quality guaranteed, DRL-based methods benefit from the separation of offline training and online inferring. Even though it may take hours or even days to train a fully converged policy in the offline stage, the inference to new problem instances in industrial online applications may only take a second compared to the meta-heuristics that will take minutes or hours [19]. Generally, current works using DRL for solving VRP and its variants can be classified into sequence generation based methods and rewriting based methods.

4.4.1 Sequence Generation Methods

A common approach to generate VRP solutions is to generate a partial sequence gradually and finally obtains a complete solution. In such an MDP modeling, the updating between different states is to include a new unvisited node into the current solution, which naturally forms the action of the sequence generation agent. In each decision step, the action space is the selection upon all unvisited nodes, and the agent selects one of them as the next to visit. It is notable that in more realistic VRP variants, practical constraints may limit the selection space since unfeasible solutions are possible to be generated. The agent thus should consider these constraints, and usually design a mask scheme to filter out the unfeasible choices. The sequence generation method is illustrated in Figure 10.

A special (PN) was first proposed under such a mechanism. Following a classic encoder-decoder structure, the PN structure is independent of the encoder length, and the output sequence is a subset of the input with a generated order [121]

. Even though the original PN was trained using supervised learning, it started following research exploration via DRL for more advanced VRP solutions. The basic structure of it is commonly utilized in the following research for routing problems. Bello et al. 

[122] first developed a special neural combinatorial optimization framework (NCO) via DRL, which showed its effectiveness on both performance and generating efficiency on TSP and the knapsack problem. Even though the typical VRP was not studied, NCO serves as an important benchmark that utilizes DRL to explore more effective combinatorial optimization solution structures. Nazari et al. [18] further followed the NCO structure and first applied it to VRP. Kool et al. [19] combined the structure with attention mechanism as augmentation and obtained performance improvement. They investigated several routing formulations, including the TSP, typical CVRP, Split Delivery VRP (SDVRP), Orienteering Problem (OP), Prize Collecting TSP (PCTSP), and Stochastic PCTSP (SPCTSP). The attention-based structure was further developed by the following researchers. Rather than relying on a single decoder for sequence generation, Xin et al.[100] proposed a multi-decoder mechanism to generate several partial solutions simultaneously and combine them using tree search to expand the searching space. Duan et al.[98] on the other hand, focused on more effective feature representation ability of the network itself. They augmented the structure with GCN, and develop a joint learning approach using both DRL and supervised learning. Delarue et al.[99] models the action selection from each state as a mixed-integer program(MIP) and combines the combinatorial structure of the action space with the pre-trained neural value functions by adapting the branch-and-cut algorithm[123].

Owing to the independent training and inference stages, the sequence generation methods benefit from high inference result generation speed. For instance, generating routing solutions for 100 customers takes only 8 seconds, while the state-of-the-art heuristic-based solver, LKH3[124], takes more than 13 hours as a comparison. Sequence generation-based methods have great potential in handling new demand changes in online platform implementation. When new DDS demands arrive or past ones are either modified or canceled, such a fast framework could make real-time responses to these changes.

4.4.2 Rewriting based Methods

Besides generating partial solutions until completeness, researchers also searched for other MDP formulations for solving VRPs. An intuition originates from the continuous modification of current solutions in operations research, which is the core idea of many practical heuristics for VRP. Following this idea, researchers attempted to parameterize such a modification process to continuously improve solution quality. Such a modification process is also called Rewriting-based methods [96]. The rewriting-based method is illustrated in Figure 11.

Fig. 11: The illustration of rewriting based methods for generating VRP solutions via RL. An initial solution is pre-established. In each decision step, the agent utilize the one of the pre-defined rules to modify the current solution, and thus improve the overall solution quality.

A framework is proposed by Chen et al.[96] where a complete solution is constructed at the first stage and promoted gradually guided by the RL agent. A local rewriting rule is designed that keeps updating the current solutions. Lu et al. [97] further propose a Learn-to-Iterate (L2I) framework which not only improves the current solution but also creates perturbation for more exploitation choices. These methods borrow ideas from the operations research to keep rewriting the current solution. These rewriting-based methods are relatively slower compared with the sequence generation ones but have more potential in obtaining better performance due to the extended exploration ability.

5 Open Simulators and Datasets for DDS

Since the existing RL methods for DDS problems are model-free, a large amount of training data generated by interaction with the environment is required. However, direct interactions with the real environment indicate high costs and high risks. Therefore, simulating upon DDS scenarios is very necessary. A reliable simulator is of great practical significance. There are already some existing DDS simulators. We will introduce existing open simulators and datasets.

Simulator and dataset for Dispatching.

Simulators for dispatching learn orders’ generation and state transition from the real data [63]. There are many public dispatching-related data sets. The most commonly used data set is provided by the New York City TLC (Taxi and Limousine Commission)  [125], which contains travel records for various services, Yellow taxis, Green taxis, and FHV (For-Hire Vehicle) from 2009 to 2020.  [126] provides a subset of NYC FHV data, which contains GPS coordinates for pickup locations. Travel time data between OD pairs can be obtained through Uber Movement  [127]. In addition, Didi Chuxing has released the travel records (regular and aggregated) and vehicle trajectory datasets of Chengdu, China through [128], from which they also developed a simulator to model the dispatching state. The simulator usually consists of two parts: order generation model, driver movement, and transition model. The order generation model learns the order generation and distribution, while the driver movement and transition model learn the state transition from the dataset. The simulator for dispatching often verifies the real effectiveness of the simulator by comparing the gross merchandise volume (GMV) generated by its simulation with the GMV of real data [73]. To make the simulator more realistic, sometimes the climate and traffic conditions are modeled in more details [129]. Recently, DiDi  [130] has developed its dispatching simulation platform based on the research of existing dispatching simulators, which serves as an open-source ride-hailing environment for training dispatching agents.

Simulator and dataset for Routing.

Simulator for routing learns the orders’ and vehicles’ generation and state transition from historical data [112]. When the simulation starts, it first initializes orders’ and vehicles’ states. The RL agent observes the state and then dispatches a determined vehicle to serve an order. After the RL agent implements the learned policy, the generation model and vehicle state transition model are utilized to update the selected vehicle’s information. This process repeats until the end of the simulation. Recently, Huawei [131] has developed its simulation platform based on the research of existing routing simulators, which serves as an open-source vehicle routing environment for training dispatching agents to tackle the scenario in logistics based on dynamic VRP with pickup and delivery. As for open datasets for routing,  [132] summarized several open CVRP instance datasets with different scales with maximal to . VRPTW as a special variant routing problem has its public dataset included in Solomon dataset [133] and Homberg dataset [134]. Meanwhile, the Li&Lim benchmark [135] tackles VRPPD with time windows specifically.

6 Challenges

Owing to the advantages of DRL, many practical frameworks to tackle the two stages for DDS are developed and could generate high-quality solutions efficiently on large scales. However, challenges still exist in building more practical DDS applications. We briefly summarize the major challenges in developing DDS solutions.

6.1 Coupled Spatial-Temporal Representations

Capturing the dynamic changes of distributions within different service loops is essential to an effective DDS application. A good ST-representation can well reflect the in-between spatial relationships and the potential temporal consumption to accomplish the services. For instance, He et al. [64] develop a capsule-based network to capture the representations of both new demands from passengers and available drivers in the order matching task. A well-learned representation can enhance the performance of the overall framework.

Even though ST-representation is not a new research topic and has been addressed as an important task in urban computation literature [136, 137, 138], the coupled ST forms a new challenge. In most DDS applications, a service target is a bond with its provider, thus the ST representation should reflect such a paired relationship. Some literature proposes special designs for the coupled challenge. For instance, Li et al. [111] propose a special attention-based structure to leverage different relationships among all customer nodes in VRP with pick and delivery. Six different attention mechanisms in total are computed as a thorough measurement upon all nodes. However, the above solution follows an ergodic way to consider all possible solutions in all extent of measurement based on the given network structure. Developing more eligible and flexible representation methods, learning mechanisms and overall algorithms are still challenges for DDS development.

6.2 Fleet Heterogeneous

In both real-world transportation and logistic systems, it is common that workers in a given worker fleet may be heterogeneous. For example, vehicles with different capacities can be deployed for parcel delivery, and electric taxis with different battery volumes can be arranged together for ride-hailing services. Under such circumstances, how to consider the heterogeneity of the entire fleet forms another challenge for real-world DDS. Most current solutions in both dispatching and routing stages make the hypothesis that all workers in the system are homogeneous to greatly reduce computation costs in large-scale settings. However, heterogeneity is still an inevitable challenge.

For a DRL-based approach, an intuitive idea is to model such a problem as a MARL model, where several types of agents cooperate to accomplish the entire service tasks given by the platform. However, the state and action space will grow rapidly with the agent scale. How to solve such a problem using MARL, or finding an alternative method using centralized modeling remains a valuable question to solve the fleet heterogeneous challenge.

6.3 Variant Constraints in MDP Modeling

Considering multiple constraints in DRL designs is an important research problem that has been widely studied [139, 140]. Meanwhile, the practical constraints in DDS design are especially significant. For instance, a practical routing problem is much more complicated than the mathematical VRP due to numerous constraints, including time windows, charging requirements, structural limitations between pickups and deliveries. Effective DRL training requires corresponding considerations on these additional constraints.

A commonly used solution is to develop ad-hoc designs to these constraints. However, adopting such a solution for all constraints may result in complicated structure design when the constraint amount increases. It may be much more difficult to train the entire model. Another solution is to measure the limitations into soft constraints. For instance, to deal with VPR with time windows, Zhang et al. [108] measure possible violation on time windows as a penalty to the total reward. The agent can learn to minimize the exceeds upon the constraint. However, such a solution is not suitable in scenarios where limits are strict and can not be violated even sightly. How to well model these practical challenges remains a critical challenge.

6.4 Large-Scale Deployment

Implementing algorithms on large scales is a necessary step from pure research to industrial usage. However, training models directly in large-scale scenarios requires enormous computation resources and time. To tackle such a challenge, a commonly used method is to abandon the natural formulation of multi-agent on different service workers. Either directly provide centralized control or model them as homogeneous agents with shared parameters can help to simplify the training process [56, 62, 63]. Another approach of reducing computation burden is to utilize the idea of divide and conquer by reducing a city-wide planning task into multiple regional ones [59]. Such an idea is widely used in real-world on-demand delivery systems, where the entire delivery scope is divided into regions and couriers are assigned to accomplish ”last kilometer deliveries” [141].

However, current solutions are still far from enough to solve the large-scale problem, especially in the routing stage. With an NP-hard nature, the complexity of generating an optimal solution grows exponentially along with the problem scales. As a result, most existing literature on solving routing problems limit their experiment scales to no more than one hundred demand nodes within a graph-based data scheme [19, 97]. Developing new training frameworks via either more agile formulation or more advanced lightweight training algorithms can help to fit in large-scale environment and promote deployment ability.

6.5 Dynamics and Real-time Scheduling

Real-world DDS scenarios include high dynamics from the environment. New demands arrive continuously and the existing ones may also change. For instance, a passenger who calls for a ride-hailing service may change his destination or even cancel the current request directly. Such dynamics are critical in real-time scheduling in both stages.

Much existing literature using DRL for dispatching measures the dynamic features explicitly using specially designed network structures. For instance, Tang et al. [63] represent the spatial-temporal features using hierarchical coarse coding, and He et al.[64] develop a special capsule-based network accordingly. As for the routing stage, DVRP is specially constructed to leverage the dynamics in real-time scheduling. Changing demands update the current service loops and thus bring more complexity. However, only a limited number of DRL solutions for such a dynamic routing scenario are proposed curretnly [101, 105]. Real-time scheduling based on the dynamics is still challenging for DDS development.

7 Open Problems

With multiple challenges, there are still many open problems with future research opportunities in developing more effective DDS systems. In this section, we briefly discuss some research directions that we feel may be potentials in this area.

7.1 Advanced DRL Methods for DDS

As DRL theory and methods develop rapidly in recent years, new advanced DRL algorithms are of great potential in developing more robust, effective, and efficient DDS applications. For example, leveraging dispatching problems, most literature we investigate in this survey utilizes DQN as the training algorithm. However, such an off-policy framework suffers from limitations to interact with the environment even multi-sourced data is used. Consequently, reproducibility within other environments will face great difficulty and thus cause fairness issues on performance comparison. While recent development on off-line RL [142] provides new opportunities to tackle the challenges respectively. A complete off-line learning paradigm based on large-scale agent experience data may help to improve training robustness and solve the reproducibility problem. Besides off-line RL, other advanced RL techniques such as causal RL [143] leveraging multi-objective optimization may also bring new research opportunities into DDS development.

7.2 Joint Optimization of Two Stages

Currently, even though both the dispatching phase and the routing phase are well studied, works that consider both stages into a DRL learning paradigm are still missing. This forms a major problem, especially in the reaction speed to new changes in applicable systems. For example, current order learning-based dispatching systems are still computationally intensive since conventional VRP solvers are adopted instead of the DRL based [144, 145], which serves as the role to help predict future income and vehicle states. A joint consideration in two stages can help to improve the overall performance, including both planning quality and inference speed. A major challenge of such modeling is the even more complicated state space and the heterogeneity of different action spaces. Research potential lies in the cross-stage representation for both states and actions. The planning quality will be highly related to the hierarchical framework design.

7.3 Fairness from Workers’ Perspective

In the current solutions for DDS problems, it almost defaults to set the scheduling objective to maximize the profit of the entire platform. Even though new objectives are proposed, such as Order Response Rate (ORR) in Sec 3.1.1, they still conform to the overall centralized profit. Rather than it, few works stand from the perspective of the service workers. As both DDS and AI ethics develop, researchers of social science gradually focus on how service workers think about their roles in DDS. While keeping optimizing the centralized profit as the prime goal, how to consider the individual differences among the group of service workers and their initiatives is a research question with potentials. On the one hand, fairness between the workers is an essential problem. Maximizing the overall profit alone might result in extreme differences in individual incomes. A well-designed DDS system should guarantee the fairness of different service workers. On the other hand, individual workers might have his or her preference for the dispatching or routing strategies. Personal historical patterns with no intelligent algorithm intervention may help in finding these preferences, and may further be considered as a factor in using intelligent ones to guide them.

7.4 Partial Compliance Consideration

In the dispatching stage, current algorithms usually consider full compliance from the service workers as a simplified assumption. However in real-world applications, workers may reject recommendations from the centralized platform and operate based on individual preference. Such disobeying may result in inaccurate overall performance prediction and thus requires additional investigation.

Besides considering partial compliance as a factor in the system, the reason that results in such compliance and corresponding solutions form another important research task. For example, couriers on rainy days usually have reluctance on accepting distant delivery tasks. A solution is to provide extra allowance to the couriers. Joint decision on generating order matching decisions and determining the specific allowance quota for different couriers based on specific tasks is a sequential decision task, which is suitable for DRL modeling.

7.5 Pricing Problems

Other than directly scheduling how different roles in the human-engaged DDS loops should operate to improve efficiency as a whole, another important research question lies in how to determine the price of service provided by the workers. Dynamic pricing problems exist in the DDS applications with short serve durations and human engagement, such as ridesharing and instant/food delivery. The pricing module influence both supply distribution from the service workers and new service demands from the customers’ side.

RL-based approaches for dynamic pricing were studied and used in one-sided retail markets [146, 147]. In these scenarios, pricing changes only the demand pattern from the customers’ perspective. However, in much more complicated DDS systems, a good pricing strategy should consider both customers and workers and other dynamic spatial-temporal information.

It is not trivial to develop a high-quality pricing module within human-engaged DDS applications due to two major challenges. First, optimizing the pricing is closely coupled with other scheduling tasks in DDS as discussed in this survey. The routing estimation in advance is a decisive factor to pricing, and further influences the quality of the dispatching stage. How to optimize several modules jointly is a critical research problem. Second, designing the metrics to evaluate pricing quality is another challenge. A good pricing strategy should both be reasonable and explainable to customers and help to improve the overall income of the platform and all service workers as well as maintaining fairness. Multi-side consideration of the problem formulation and the explainable RL design of the algorithm itself could thus be a core research focus in the future.

7.6 Simulation Environments

For scenarios that need high interaction with the environment, such as ridesharing or on-demand delivery with frequent updating to demand and worker distributions, it is too expensive to evaluate the algorithm by directly interacting with real-world scenarios. Thus it is essential to deploy and evaluate the algorithms in simulation environments. However, current simulators are rather simple in environment settings and are not enough for complete modeling for both dispatching or realistic routing in DDS. As for the dispatching stage, many real-world issues that might happen should be considered in the simulation test to evaluate the robustness of proposed algorithms, including stochastic cancellation of requests, intelligent pricing strategy upon different matching results, and the individual preference of different drivers, etc. As for the routing stage, real-world execution in the routing can seldom be as exact as the algorithm predicts. Simulation upon environment changes, real-time traffic congestion, and demand updates are essential to dynamic decision-making.

As far as we know, few simulators considering the factors above are released. Great research potential lies in developing simulators that could allow agents to entirely interact with the most realistic environments and thus has enough robustness for deployment. A well-designed simulator can be critical as the role of the offline training environment for advanced DRL algorithm for DDS.

7.7 Large-Scale Online Scheduling System

Tackling the challenges as discussed in Sec 6, the ultimate benchmark of utilizing DRL in DDS is to build a large-scale online scheduling system to handle real-world DDS tasks. Complete system development requires thorough consideration on solving coupled and dynamic features, modeling heterogeneity within fleets, remaining high efficiency in large scales, and adapting to practical constraints. Both general high-quality and robust algorithms and ad-hoc considerations in specific scenarios are needed in constructing a centralized platform for DDS design. Developing a large-scale online scheduling DDS system via DRL will have a strong impact on both relative research and industrial areas.

8 Conclusion

Demand driven services (DDS), such as ride-hailing and express systems, are of great importance in urban life nowadays. The planning and scheduling process within these applications require effectiveness and efficiency. In this survey, we focus on the DDS problems and derive the entire DDS into two stages, the dispatching stage and the routing stage respectively. The dispatching stage is responsible to coordinate the unassigned service demands and the available service workers, while the routing stage generates strategies within each service loops. We investigate the recent works using deep reinforcement learning (DRL) techniques to solve DDS problems in the two stages. We also discuss the further challenges and open problems in using DRL to help build high-quality DDS systems.


This work was supported in part by The National Key Research and Development Program of China under grant 2018YFB1800804, the National Nature Science Foundation of China under U1936217, 61971267, 61972223, 61941117, 61861136003, Beijing Natural Science Foundation under L182038, Beijing National Research Center for Information Science and Technology under 20031887521, and research fund of Tsinghua University - Tencent Joint Laboratory for Internet Innovation Technology.


  • [1] Meituan, “Meituan homepage,”
  • [2] M. M. Flood, “The traveling-salesman problem,” Operations research, vol. 4, no. 1, pp. 61–75, 1956.
  • [3] G. B. Dantzig and J. H. Ramser, “The truck dispatching problem,” Management science, vol. 6, no. 1, pp. 80–91, 1959.
  • [4] M. Schneider, A. Stenger, and D. Goeke, “The electric vehicle-routing problem with time windows and recharging stations,” Transportation science, vol. 48, no. 4, pp. 500–520, 2014.
  • [5] J. Desrosiers, F. Soumis, and M. Desrochers, “Routing with time windows by column generation,” Networks, vol. 14, no. 4, pp. 545–565, 1984.
  • [6] H. N. Psaraftis, “Dynamic vehicle routing problems,” Vehicle routing: Methods and studies, vol. 16, pp. 223–248, 1988.
  • [7] H. Min, “The multiple vehicle routing problem with simultaneous delivery and pick-up points,” Transportation Research Part A: General, vol. 23, no. 5, pp. 377–386, 1989.
  • [8] J. Munkres, “Algorithms for the assignment and transportation problems,” Journal of the society for industrial and applied mathematics, vol. 5, no. 1, pp. 32–38, 1957.
  • [9] P. Toth and D. Vigo, “Branch-and-bound algorithms for the capacitated vrp,” in The vehicle routing problem.   SIAM, 2002, pp. 29–51.
  • [10] Z. Liao, “Real-time taxi dispatching using global positioning systems,” Communications of the ACM, vol. 46, no. 5, pp. 81–83, 2003.
  • [11] E. Özkan and A. R. Ward, “Dynamic matching for real-time ride sharing,” Stochastic Systems, vol. 10, no. 1, pp. 29–70, 2020.
  • [12] L. M. Gambardella, Éric Taillard, and G. Agazzi, “Macs-vrptw: A multiple colony system for vehicle routing problems with time windows,” in New Ideas in Optimization.   McGraw-Hill, 1999, pp. 63–76.
  • [13] G. Schrimpf, J. Schneider, H. Stamm-Wilbrandt, and G. Dueck, “Record breaking optimization results using the ruin and recreate principle,” Journal of Computational Physics, vol. 159, no. 2, pp. 139 – 171, 2000. [Online]. Available:
  • [14] M. Hu and Y. Zhou, “Dynamic type matching,” Manufacturing & Service Operations Management, 2021.
  • [15] D. Favaretto, E. Moretti, and P. Pellegrini, “Ant colony system for a vrp with multiple time windows and multiple visits,” Journal of Interdisciplinary Mathematics, vol. 10, no. 2, pp. 263–284, 2007.
  • [16] R. S. Sutton and A. G. Barto, Reinforcement learning: An introduction.   MIT press, 2018.
  • [17] V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski et al., “Human-level control through deep reinforcement learning,” nature, vol. 518, no. 7540, pp. 529–533, 2015.
  • [18] M. Nazari, A. Oroojlooy, L. V. Snyder, and M. Takáč, “Reinforcement learning for solving the vehicle routing problem,” 2018.
  • [19] W. Kool, H. van Hoof, and M. Welling, “Attention, learn to solve routing problems!” 2018.
  • [20] J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov, “Proximal policy optimization algorithms,” arXiv preprint arXiv:1707.06347, 2017.
  • [21] R. J. Williams, “Simple statistical gradient-following algorithms for connectionist reinforcement learning,” Machine learning, vol. 8, no. 3-4, pp. 229–256, 1992.
  • [22] A. Haydari and Y. Yilmaz, “Deep reinforcement learning for intelligent transportation systems: A survey,” IEEE Transactions on Intelligent Transportation Systems, 2020.
  • [23] Z. Qin, H. Zhu, and J. Ye, “Reinforcement learning for ridesharing: A survey,” arXiv preprint arXiv:2105.01099, 2021.
  • [24] N. Mazyavkina, S. Sviridov, S. Ivanov, and E. Burnaev, “Reinforcement learning for combinatorial optimization: A survey,” Computers & Operations Research, p. 105400, 2021.
  • [25] R. S. Sutton and A. G. Barto, “Reinforcement learning: An introduction,” 2011.
  • [26] V. Mnih, K. Kavukcuoglu, D. Silver, A. Graves, I. Antonoglou, D. Wierstra, and M. Riedmiller, “Playing atari with deep reinforcement learning,” arXiv preprint arXiv:1312.5602, 2013.
  • [27] D. Silver, A. Huang, C. J. Maddison, A. Guez, L. Sifre, G. Van Den Driessche, J. Schrittwieser, I. Antonoglou, V. Panneershelvam, M. Lanctot et al., “Mastering the game of go with deep neural networks and tree search,” nature, vol. 529, no. 7587, pp. 484–489, 2016.
  • [28] C. J. C. H. Watkins, “Learning from delayed rewards,” 1989.
  • [29] L.-J. Lin, Reinforcement learning for robots using neural networks.   Carnegie Mellon University, 1992.
  • [30] R. S. Sutton, “Learning to predict by the methods of temporal differences,” Machine learning, vol. 3, no. 1, pp. 9–44, 1988.
  • [31] H. Van Hasselt, A. Guez, and D. Silver, “Deep reinforcement learning with double q-learning,” in

    Proceedings of the AAAI Conference on Artificial Intelligence

    , vol. 30, no. 1, 2016.
  • [32] H. Hasselt, “Double q-learning,” Advances in neural information processing systems, vol. 23, pp. 2613–2621, 2010.
  • [33] M. G. Bellemare, G. Ostrovski, A. Guez, P. Thomas, and R. Munos, “Increasing the action gap: New operators for reinforcement learning,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 30, no. 1, 2016.
  • [34] L. C. Baird III, “Reinforcement learning through gradient descent,” CARNEGIE-MELLON UNIV PITTSBURGH PA DEPT OF COMPUTER SCIENCE, Tech. Rep., 1999.
  • [35] Z. Wang, T. Schaul, M. Hessel, H. Hasselt, M. Lanctot, and N. Freitas, “Dueling network architectures for deep reinforcement learning,” in International conference on machine learning.   PMLR, 2016, pp. 1995–2003.
  • [36] R. S. Sutton, D. A. McAllester, S. P. Singh, Y. Mansour et al., “Policy gradient methods for reinforcement learning with function approximation.” in NIPs

    , vol. 99.   Citeseer, 1999, pp. 1057–1063.

  • [37] V. R. Konda and J. N. Tsitsiklis, “Actor-critic algorithms,” in Advances in neural information processing systems, 2000, pp. 1008–1014.
  • [38] V. Mnih, A. P. Badia, M. Mirza, A. Graves, T. Lillicrap, T. Harley, D. Silver, and K. Kavukcuoglu, “Asynchronous methods for deep reinforcement learning,” in International conference on machine learning.   PMLR, 2016, pp. 1928–1937.
  • [39] Z. Wang, V. Bapst, N. Heess, V. Mnih, R. Munos, K. Kavukcuoglu, and N. de Freitas, “Sample efficient actor-critic with experience replay,” arXiv preprint arXiv:1611.01224, 2016.
  • [40] T. P. Lillicrap, J. J. Hunt, A. Pritzel, N. Heess, T. Erez, Y. Tassa, D. Silver, and D. Wierstra, “Continuous control with deep reinforcement learning,” arXiv preprint arXiv:1509.02971, 2015.
  • [41] D. Silver, G. Lever, N. Heess, T. Degris, D. Wierstra, and M. Riedmiller, “Deterministic policy gradient algorithms,” in International conference on machine learning.   PMLR, 2014, pp. 387–395.
  • [42] S. Fujimoto, H. Hoof, and D. Meger, “Addressing function approximation error in actor-critic methods,” in International Conference on Machine Learning.   PMLR, 2018, pp. 1587–1596.
  • [43] C. R. Shelton, “Importance sampling for reinforcement learning with multiple objectives,” 2001.
  • [44] J. Schulman, S. Levine, P. Abbeel, M. Jordan, and P. Moritz, “Trust region policy optimization,” in International conference on machine learning.   PMLR, 2015, pp. 1889–1897.
  • [45] A. Tampuu, T. Matiisen, D. Kodelja, I. Kuzovkin, K. Korjus, J. Aru, J. Aru, and R. Vicente, “Multiagent cooperation and competition with deep reinforcement learning,” PloS one, vol. 12, no. 4, p. e0172395, 2017.
  • [46] J. N. Foerster, Y. M. Assael, N. de Freitas, and S. Whiteson, “Learning to communicate to solve riddles with deep distributed recurrent q-networks,” arXiv preprint arXiv:1602.02672, 2016.
  • [47] L. Buşoniu, R. Babuška, and B. De Schutter, “Multi-agent reinforcement learning: An overview,” Innovations in multi-agent systems and applications-1, pp. 183–221, 2010.
  • [48] DiDi, “Didi homepage,”
  • [49] Uber, “Uber homepage,”
  • [50] PrimeNow, “Primenow homepage,”
  • [51] UberEats, “Ubereats homepage,”
  • [52] Eleme, “Eleme homepage,”
  • [53] FedEx, “Fedex homepage,”, 2021.
  • [54] Cainiao, “Cainiao homepage,”, 2021.
  • [55] L. Zhang, T. Hu, Y. Min, G. Wu, J. Zhang, P. Feng, P. Gong, and J. Ye, “A taxi order dispatch model based on combinatorial optimization,” in Proceedings of the 23rd ACM SIGKDD international conference on knowledge discovery and data mining, 2017, pp. 2151–2159.
  • [56] Z. Xu, Z. Li, Q. Guan, D. Zhang, Q. Li, J. Nan, C. Liu, W. Bian, and J. Ye, “Large-scale order dispatch in on-demand ride-hailing platforms: A learning and planning approach,” in Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, 2018, pp. 905–913.
  • [57] C. Yan, H. Zhu, N. Korolko, and D. Woodard, “Dynamic pricing and matching in ride-hailing platforms,” Naval Research Logistics (NRL), vol. 67, no. 8, pp. 705–724, 2020.
  • [58] I. Jindal, Z. T. Qin, X. Chen, M. Nokleby, and J. Ye, “Optimizing taxi carpool policies via reinforcement learning and spatio-temporal mining,” in 2018 IEEE International Conference on Big Data (Big Data).   IEEE, 2018, pp. 1417–1426.
  • [59] Y. Li, Y. Zheng, and Q. Yang, “Efficient and effective express via contextual cooperative reinforcement learning,” in Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, 2019, pp. 510–519.
  • [60] Y. Yang, R. Luo, M. Li, M. Zhou, W. Zhang, and J. Wang, “Mean field multi-agent reinforcement learning,” in International Conference on Machine Learning.   PMLR, 2018, pp. 5571–5580.
  • [61] M. Zhou, J. Jin, W. Zhang, Z. Qin, Y. Jiao, C. Wang, G. Wu, Y. Yu, and J. Ye, “Multi-agent reinforcement learning for order-dispatching via order-vehicle distribution matching,” in Proceedings of the 28th ACM International Conference on Information and Knowledge Management, 2019, pp. 2645–2653.
  • [62] Z. Wang, Z. Qin, X. Tang, J. Ye, and H. Zhu, “Deep reinforcement learning with knowledge transfer for online rides order dispatching,” in 2018 IEEE International Conference on Data Mining (ICDM).   IEEE, 2018, pp. 617–626.
  • [63] X. Tang, Z. Qin, F. Zhang, Z. Wang, Z. Xu, Y. Ma, H. Zhu, and J. Ye, “A deep value-network based approach for multi-driver order dispatching,” in Proceedings of the 25th ACM SIGKDD international conference on knowledge discovery & data mining, 2019, pp. 1780–1790.
  • [64] S. He and K. G. Shin, “Spatio-temporal capsule-based reinforcement learning for mobility-on-demand network coordination,” in The World Wide Web Conference, 2019, pp. 2806–2813.
  • [65] A. O. Al-Abbasi, A. Ghosh, and V. Aggarwal, “Deeppool: Distributed model-free algorithm for ride-sharing using deep reinforcement learning,” IEEE Transactions on Intelligent Transportation Systems, vol. 20, no. 12, pp. 4714–4727, 2019.
  • [66] G. Qin, Q. Luo, Y. Yin, J. Sun, and J. Ye, “Optimizing matching time intervals for ride-hailing services using reinforcement learning,” Transportation Research Part C: Emerging Technologies, vol. 129, p. 103239, 2021.
  • [67] Y. Wang, Y. Tong, C. Long, P. Xu, K. Xu, and W. Lv, “Adaptive dynamic bipartite graph matching: A reinforcement learning approach,” in 2019 IEEE 35th International Conference on Data Engineering (ICDE).   IEEE, 2019, pp. 1478–1489.
  • [68] K. Jintao, H. Yang, J. Ye et al., “Learning to delay in ride-sourcing systems: a multi-agent deep reinforcement learning framework,” IEEE Transactions on Knowledge and Data Engineering, 2020.
  • [69] L. Yang, X. Yu, J. Cao, X. Liu, and P. Zhou, “Exploring deep reinforcement learning for task dispatching in autonomous on-demand services,” ACM Transactions on Knowledge Discovery from Data (TKDD), vol. 15, no. 3, pp. 1–23, 2021.
  • [70] Y. Chen, Y. Qian, Y. Yao, Z. Wu, R. Li, Y. Zhou, H. Hu, and Y. Xu, “Can sophisticated dispatching strategy acquired by reinforcement learning?” in Proceedings of the 18th International Conference on Autonomous Agents and MultiAgent Systems, 2019, pp. 1395–1403.
  • [71] Y. Li, Y. Zheng, and Q. Yang, “Cooperative multi-agent reinforcement learning in express system,” in Proceedings of the 29th ACM International Conference on Information & Knowledge Management, 2020, pp. 805–814.
  • [72] H. Hu, X. Jia, Q. He, S. Fu, and K. Liu, “Deep reinforcement learning based agvs real-time scheduling with mixed rule for flexible shop floor in industry 4.0,” Computers & Industrial Engineering, vol. 149, p. 106749, 2020.
  • [73] K. Lin, R. Zhao, Z. Xu, and J. Zhou, “Efficient large-scale fleet management via multi-agent deep reinforcement learning,” in Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, 2018, pp. 1774–1783.
  • [74] W. Zhang, Q. Wang, J. Li, and C. Xu, “Dynamic fleet management with rewriting deep reinforcement learning,” IEEE Access, vol. 8, pp. 143 333–143 341, 2020.
  • [75] J. Wen, J. Zhao, and P. Jaillet, “Rebalancing shared mobility-on-demand systems: A reinforcement learning approach,” in 2017 IEEE 20th International Conference on Intelligent Transportation Systems (ITSC).   Ieee, 2017, pp. 220–225.
  • [76] T. Oda and C. Joe-Wong, “Movi: A model-free approach to dynamic fleet management,” in IEEE INFOCOM 2018-IEEE Conference on Computer Communications.   IEEE, 2018, pp. 2708–2716.
  • [77] Z. Liu, J. Li, and K. Wu, “Context-aware taxi dispatching at city-scale using deep reinforcement learning,” IEEE Transactions on Intelligent Transportation Systems, 2020.
  • [78] T. N. Kipf and M. Welling, “Semi-supervised classification with graph convolutional networks,” arXiv preprint arXiv:1609.02907, 2016.
  • [79] Z. Shou and X. Di, “Reward design for driver repositioning using multi-agent reinforcement learning,” Transportation research part C: emerging technologies, vol. 119, p. 102738, 2020.
  • [80] J. Jin, M. Zhou, W. Zhang, M. Li, Z. Guo, Z. Qin, Y. Jiao, X. Tang, C. Wang, J. Wang et al., “Coride: joint order dispatching and fleet management for multi-scale ride-hailing platforms,” in Proceedings of the 28th ACM International Conference on Information and Knowledge Management, 2019, pp. 1983–1992.
  • [81] J. Holler, R. Vuorio, Z. Qin, X. Tang, Y. Jiao, T. Jin, S. Singh, C. Wang, and J. Ye, “Deep reinforcement learning for multi-driver vehicle dispatching and repositioning problem,” in 2019 IEEE International Conference on Data Mining (ICDM).   IEEE, 2019, pp. 1090–1095.
  • [82] G. Guo and Y. Xu, “A deep reinforcement learning approach to ride-sharing vehicles dispatching in autonomous mobility-on-demand systems,” IEEE Intelligent Transportation Systems Magazine, 2020.
  • [83] E. Liang, K. Wen, W. H. Lam, A. Sumalee, and R. Zhong, “An integrated reinforcement learning and centralized programming approach for online taxi dispatching,” IEEE Transactions on Neural Networks and Learning Systems, 2021.
  • [84] I. Sungur, Y. Ren, F. Ordóñez, M. Dessouky, and H. Zhong, “A model and algorithm for the courier delivery problem with uncertainty,” Transportation science, vol. 44, no. 2, pp. 193–205, 2010.
  • [85] M. Lowalekar, P. Varakantham, and P. Jaillet, “Online spatio-temporal matching in stochastic and dynamic domains,” Artificial Intelligence, vol. 261, pp. 71–112, 2018.
  • [86] Z. Qin, X. Tang, Y. Jiao, F. Zhang, Z. Xu, H. Zhu, and J. Ye, “Ride-hailing order dispatching at didi via reinforcement learning,” INFORMS Journal on Applied Analytics, vol. 50, no. 5, pp. 272–286, 2020.
  • [87] S. Zhang, L. Qin, Y. Zheng, and H. Cheng, “Effective and efficient: Large-scale dynamic city express,” IEEE Transactions on Knowledge and Data Engineering, vol. 28, no. 12, pp. 3203–3217, 2016.
  • [88] M. Nourinejad and M. Ramezani, “Developing a large-scale taxi dispatching system for urban networks,” in 2016 IEEE 19th International Conference on Intelligent Transportation Systems (ITSC).   IEEE, 2016, pp. 441–446.
  • [89] F. Miao, S. Han, A. M. Hendawi, M. E. Khalefa, J. A. Stankovic, and G. J. Pappas, “Data-driven distributionally robust vehicle balancing using dynamic region partitions,” in Proceedings of the 8th International Conference on Cyber-Physical Systems, 2017, pp. 261–271.
  • [90] M. Qu, H. Zhu, J. Liu, G. Liu, and H. Xiong, “A cost-effective recommender system for taxi drivers,” in Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining, 2014, pp. 45–54.
  • [91] N. J. Yuan, Y. Zheng, L. Zhang, and X. Xie, “T-finder: A recommender system for finding passengers and vacant taxis,” IEEE Transactions on knowledge and data engineering, vol. 25, no. 10, pp. 2390–2403, 2012.
  • [92] J. Xu, R. Rahmatizadeh, L. Bölöni, and D. Turgut, “Taxi dispatch planning via demand and destination modeling,” in 2018 IEEE 43rd Conference on Local Computer Networks (LCN).   IEEE, 2018, pp. 377–384.
  • [93] X. Xie, F. Zhang, and D. Zhang, “Privatehunt: Multi-source data-driven dispatching in for-hire vehicle systems,” Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies, vol. 2, no. 1, pp. 1–26, 2018.
  • [94] T. Verma, P. Varakantham, S. Kraus, and H. C. Lau, “Augmenting decisions of taxi drivers through reinforcement learning for improving revenues,” in Proceedings of the International Conference on Automated Planning and Scheduling, vol. 27, no. 1, 2017.
  • [95] Y. Gao, D. Jiang, and Y. Xu, “Optimize taxi driving strategies based on reinforcement learning,” International Journal of Geographical Information Science, vol. 32, no. 8, pp. 1677–1696, 2018.
  • [96] X. Chen and Y. Tian, “Learning to perform local rewriting for combinatorial optimization,” in Advances in Neural Information Processing Systems, 2019, pp. 6278–6289.
  • [97] H. Lu, X. Zhang, and S. Yang, “A learning-based iterative method for solving vehicle routing problems,” in International Conference on Learning Representations, 2020.
  • [98] L. Duan, Y. Zhan, H. Hu, Y. Gong, J. Wei, X. Zhang, and Y. Xu, “Efficiently solving the practical vehicle routing problem: A novel joint learning approach,” in Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, 2020, pp. 3054–3063.
  • [99] A. Delarue, R. Anderson, and C. Tjandraatmadja, “Reinforcement learning with combinatorial actions: An application to vehicle routing,” arXiv preprint arXiv:2010.12001, 2020.
  • [100] L. Xin, W. Song, Z. Cao, and J. Zhang, “Multi-decoder attention model with embedding glimpse for solving vehicle routing problems,” arXiv preprint arXiv:2012.10638, 2020.
  • [101] W. Joe and H. C. Lau, “Deep reinforcement learning approach to solve dynamic vehicle routing problem with stochastic customers,” in Proceedings of the International Conference on Automated Planning and Scheduling, vol. 30, 2020, pp. 394–402.
  • [102] A. L. Ottoni, E. G. Nepomuceno, M. S. de Oliveira, and D. C. de Oliveira, “Reinforcement learning for the traveling salesman problem with refueling,” Complex & Intelligent Systems, pp. 1–15, 2021.
  • [103] W. Qin, Z. Zhuang, Z. Huang, and H. Huang, “A novel reinforcement learning-based hyper-heuristic for heterogeneous vehicle routing problem,” Computers & Industrial Engineering, vol. 156, p. 107252, 2021.
  • [104] A. Bogyrbayeva, S. Jang, A. Shah, Y. J. Jang, and C. Kwon, “A reinforcement learning approach for rebalancing electric vehicle sharing systems,” IEEE Transactions on Intelligent Transportation Systems, 2021.
  • [105] J. Shi, Y. Gao, W. Wang, N. Yu, and P. A. Ioannou, “Operating electric vehicle fleet for ride-hailing services with reinforcement learning,” IEEE Transactions on Intelligent Transportation Systems, vol. 21, no. 11, pp. 4822–4834, 2019.
  • [106] J. James, W. Yu, and J. Gu, “Online vehicle routing with neural combinatorial optimization and deep reinforcement learning,” IEEE Transactions on Intelligent Transportation Systems, vol. 20, no. 10, pp. 3806–3817, 2019.
  • [107] B. Lin, B. Ghaddar, and J. Nathwani, “Deep reinforcement learning for electric vehicle routing problem with time windows,” arXiv preprint arXiv:2010.02068, 2020.
  • [108] K. Zhang, F. He, Z. Zhang, X. Lin, and M. Li, “Multi-vehicle routing problems with soft time windows: A multi-agent reinforcement learning approach,” Transportation Research Part C: Emerging Technologies, vol. 121, p. 102861, 2020.
  • [109] J. K. Falkner and L. Schmidt-Thieme, “Learning to solve vehicle routing problems with time windows through joint attention,” arXiv preprint arXiv:2006.09100, 2020.
  • [110] J. Zhao, M. Mao, X. Zhao, and J. Zou, “A hybrid of deep reinforcement learning and local search for the vehicle routing problems,” IEEE Transactions on Intelligent Transportation Systems, 2020.
  • [111] J. Li, L. Xin, Z. Cao, A. Lim, W. Song, and J. Zhang, “Heterogeneous attentions for solving pickup and delivery problem via deep reinforcement learning,” IEEE Transactions on Intelligent Transportation Systems, 2021.
  • [112] X. Li, W. Luo, M. Yuan, J. Wang, J. Lu, J. Wang, J. Lü, and J. Zeng, “Learning to optimize industry-scale dynamic pickup and delivery problems,” in 2021 IEEE 37th International Conference on Data Engineering (ICDE).   IEEE, 2021, pp. 2511–2522.
  • [113] H. Lee and J. Jeong, “Mobile robot path optimization technique based on reinforcement learning algorithm in warehouse environment,” Applied Sciences, vol. 11, no. 3, p. 1209, 2021.
  • [114] M. M. Solomon, “Algorithms for the vehicle routing and scheduling problems with time window constraints,” Operations research, vol. 35, no. 2, pp. 254–265, 1987.
  • [115] O. B. G. Madsen, M. L. Fisher, and K. O. Jornsten, Vehicle routing with time windows: Two optimization algorithms, 1997.
  • [116] J. H. Holland, Adaptation in Natural and Artificial System, 1992.
  • [117] M. Desrochers, J. Desrosiers, and M. Solomon, “A new optimization algorithm for the vehicle routing problem with time windows,” Operations Research, vol. 40, pp. 342–354, 04 1992.
  • [118] F. Glover, “Tabu search - part i,” INFORMS Journal on Computing, vol. 2, pp. 4–32, 01 1990.
  • [119] C. Groër, B. Golden, and E. Wasil, “A library of local search heuristics for the vehicle routing problem,” Mathematical Programming Computation, vol. 2, no. 2, pp. 79–101, 2010.
  • [120] D. E. Goldberg, “Genetic algorithms in search,” Optimization, and MachineLearning, 1989.
  • [121] O. Vinyals, M. Fortunato, and N. Jaitly, “Pointer networks,” in Advances in neural information processing systems, 2015, pp. 2692–2700.
  • [122] I. Bello, H. Pham, Q. V. Le, M. Norouzi, and S. Bengio, “Neural combinatorial optimization with reinforcement learning,” 2016.
  • [123] R. Anderson, J. Huchette, W. Ma, C. Tjandraatmadja, and J. P. Vielma, “Strong mixed-integer programming formulations for trained neural networks,” Mathematical Programming, pp. 1–37, 2020.
  • [124] K. Helsgaun, “An extension of the lin-kernighan-helsgaun tsp solver for constrained traveling salesman and vehicle routing problems,” Roskilde: Roskilde University, 2017.
  • [125] TLC, “Nyc taxi & limousine commission trip record data. 2020,”, 2020.
  • [126] Kaggle, “Uber pickups in new york city - trip data for over 20 million uber (and other for-hire vehicle) trips in nyc. 2017.”", 2017.
  • [127] Uber, “Uber movement. 2021.”, 2021.
  • [128] GAIA, “Didi gaia open data set: Kdd cup 2020. 2020.”, 2020.
  • [129] Y. Li, Y. Zheng, and Q. Yang, “Dynamic bike reposition: A spatio-temporal reinforcement learning approach,” in Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, 2018, pp. 1724–1733.
  • [130] GAIA, “Kdd cup 2020: Learning to dispatch and reposition on a mobility-on-demand platform. 2020.”, 2020.
  • [131] huawei, “Icaps 2021:the dynamic pickup and delivery problem,”, 2021.
  • [132] CVRPLib, “Cvrplib homepage,”
  • [133] Solomon-benchmark, “Solomon-benchmark. 2008.”, 2008.
  • [134] homberger benchmark, “homberger-benchmark. 2008.”, 2008.
  • [135] Li&Lim-benchmark, “Li&lim-benchmark. 2008.”, 2008.
  • [136] Z. Zong, J. Feng, K. Liu, H. Shi, and Y. Li, “Deepdpm: Dynamic population mapping via deep neural network,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, no. 01, 2019, pp. 1294–1301.
  • [137] J. Feng, Y. Li, C. Zhang, F. Sun, F. Meng, A. Guo, and D. Jin, “Deepmove: Predicting human mobility with attentional recurrent networks,” in Proceedings of the 2018 world wide web conference, 2018, pp. 1459–1468.
  • [138] Z. Lin, J. Feng, Z. Lu, Y. Li, and D. Jin, “Deepstn+: Context-aware spatial-temporal neural network for crowd flow prediction in metropolis,” in Proceedings of the AAAI conference on artificial intelligence, vol. 33, no. 01, 2019, pp. 1020–1027.
  • [139] P. Geibel, “Reinforcement learning for mdps with constraints,” in European Conference on Machine Learning.   Springer, 2006, pp. 646–653.
  • [140] S. Miryoosefi, K. Brantley, H. Daumé III, M. Dudík, and R. Schapire, “Reinforcement learning with convex constraints,” arXiv preprint arXiv:1906.09323, 2019.
  • [141] E. Taniguchi and R. G. Thompson, City logistics: Mapping the future.   CRC Press, 2014.
  • [142] R. Agarwal, D. Schuurmans, and M. Norouzi, “An optimistic perspective on offline reinforcement learning,” in International Conference on Machine Learning.   PMLR, 2020, pp. 104–114.
  • [143] A. Gupta, R. Mendonca, Y. Liu, P. Abbeel, and S. Levine, “Meta-reinforcement learning of structured exploration strategies,” arXiv preprint arXiv:1802.07245, 2018.
  • [144] X. Yu and S. Shen, “An integrated decomposition and approximate dynamic programming approach for on-demand ride pooling,” IEEE Transactions on Intelligent Transportation Systems, vol. 21, no. 9, pp. 3811–3820, 2019.
  • [145] S. Shah, M. Lowalekar, and P. Varakantham, “Neural approximate dynamic programming for on-demand ride-pooling,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, no. 01, 2020, pp. 507–515.
  • [146] C. Raju, Y. Narahari, and K. Ravikumar, “Reinforcement learning applications in dynamic pricing of retail markets,” in EEE International Conference on E-Commerce, 2003. CEC 2003.   IEEE, 2003, pp. 339–346.
  • [147] D. Bertsimas and G. Perakis, “Dynamic pricing: A learning approach,” in Mathematical and computational models for congestion charging.   Springer, 2006, pp. 45–79.