1 Introduction
The continuous urbanization and development of mobile communication have brought many new application demands into urban daily lives. Among all, the services that transport either humans or parcels to provided destinations following individual demands or system requirements are critical in both urban logistics and transportation nowadays. We define such kinds of services as DemandDriven Services (DDS). For example, the ondemand food delivery service as a typical DDS is widely used since it improves diet convenience significantly. More than 30 million orders are generated every day on the MeituanDianping platform, one of the world’s largest ondemand delivery service providers [1]. As another example, largescale online ridesharing services such as Uber and DiDi have substantially transformed the transportation landscape, offering huge opportunities for boosting the current transportation efficiency. These DDS applications provide striking efficiency to city operations in both logistics and transportation as well as many opportunities to related research fields. Intelligent control with minimal manual intervention upon DDS systems is critical to guarantee their effectiveness and has drawn many research interests.
In a typical DDS task, there are several roles involved that an implemented system should consider, including the service workers, service providers, and corresponding targets. For example, in an ondemand delivery system, a man who orders food can be seen as a service target, while the restaurant from which the food is ordered is the service provider. A group of such delivery tasks is then assigned to and accomplished by a courier, i.e., the worker. These core DDS elements form a DDS loop, and such an example is illustrated in Figure 1. The same formulation can also be constructed within the ridesharing scenario. Each customer who calls for a ride has his/her destination as the target, and the driver serves as the service worker. The DDS platforms that support either delivery or ridesharing services are supposed to provide corresponding algorithms to 1) construct reasonable service loops and 2) conduct workers to complete assignments within loops. We show DDS loop formulations in several typical scenarios in Table I, including ondemand delivery, ridesharing, express systems, and warehousing.


DDS Scenario  DDS Loop  
Service Provider  Service Target  Service Worker  
Ondemand Delivery  Restaurant  Customer  Courier 
Ridesharing  Passenger Origin  Destination  Driver 
Express (Sending)  Consignor  Depot  Courier 
Express (Delivery)  Depot  Consignee  Courier 
Warehousing  Shelf, Entry, Station  Shelf, Entry, Station  AGV 

With fundamental DDS elements defined, how to manage different service demand pairs (providers and targets), schedule available service workers and control the entire service system become the major objectives of developing a centralized intelligent DDS platform. Major research problems can be classified into two aspects. First, forming DDS loops upon demand pairs and workers, which is also named Dispatching , is the firsthand challenge to deal with. The loop forming process, i.e., the Matching between demands and workers can be originated from the traditional bipartite graph matching problem, while the dynamic features in the entire environment bring much more complexity. A good dispatching mechanism should not only consider the current states of workers with scattered demands but also take future distributions into account for longterm optimization. Furthermore, even a worker is not matched with service demands at present, there still remains a large action space to arrange the idling workers into other areas, which forms the Fleet Management problem. The loopforming stage can be seen as the first stage for the complete DDS.
Second, after being assigned with numerous demands to satisfy, how to execute formed loops, i.e., to schedule the detailed Routing strategies including planning the visiting orders of the demand set and the selection on realworld road maps is also critical to determine the entire system efficiency. The routing problem can be originated from the conventional Traveling Salesman Problem [2], where a salesman is supposed to visit all cities without revisiting anyone of them. The further Vehicle Routing Problems (VRP) [3], and its variants are valuable in the mathematical formulation of most realworld routing scenarios [4, 5, 6, 7]. A highquality routing strategy should minimize the total traveling distance to decrease the expenses of the workers. The routing stage can be seen as the second stage after dispatching. A robust and stable routing strategy generation is also important to provide decision information back to the dispatching stage. We illustrate the relationship between the two stages in Figure 2.
The solutions to the mathematical formulations of both two stages were widely studied previously. For instance, the KuhnMunkres (KM) algorithm for BipartiteGraph Matching and BranchandBound for TSP and VRP could provide exact solutions for simple static problems with limited scales [8, 9]. Considering multiple realworld constraints and additional factors, more complicated dispatching and routing problems are also further investigated extensively in the field of operations research, applied maths, etc [10, 11, 12, 13]
. In complicated scenarios with larger problem scales, exact optimizations are almost impossible to obtain. Meanwhile, heuristics and metaheuristics were widely accepted as an alternative to generate approximate solutions within a much more reasonable time in both stages of DDS
[14, 15]. These heuristicsbased methods could generate satisfactory solutions in online scenarios and are thus practical in many realworld DDS systems. However, there is still much potential in exploring better solutions with higher quality, higher efficiency on larger scales.As machine learning showing astonishing performance in recent years, it is of great potential to utilize the learningbased techniques to further develop DDS systems. Reinforcement Learning (RL) methods have developed and been applied in many planning tasks
[16]. RL could generate strategies by modeling a decision process as a Markov Decision Process (MDP). A predefined reward from a longterm perspective works as the feedback signal to any action attempts so that RL can optimize sequential decisions. The trialanderror process could train the agent to learn to select the best action corresponding to different inner states and outside environments. As deep neural networks providing much stronger ability on feature representative and pattern recognition, combining neural networks and RL shows great performances
[17]. Many deep RL (DRL) algorithms are further proposed and become stateoftheart frameworks in control and scheduling tasks. DRL does not have to rely on manually designed assumptions and features by training a parameterized model to learn the optimal control. It is trivial to consider using it as the structure for solving the series of planning tasks in DDS.In this survey, we focus on how DRL can benefit to the development of DDS systems in both the dispatching stage and the routing stage respectively. We first introduce major DRL algorithms and four typical DDS in urban operations. Then we summarize existing DRL based solutions according to the following dimensions:

Problem. We classify the research problems in both dispatching and routing stages into preciser subproblems. Order dispatching along with fleet management included in the dispatching stage are investigated. As for numerous DRL solutions for the routing stage, we first introduce the ones solving typical Capacitied VRP (CVRP) as mathematical solutions, while more practical routing solutions for VRP variants are also discussed. We consider four variant problems with additional constraints in this survey, including dynamic VRP (DVRP), Electric VRP (EVRP), VRP with Time Windows (VRPTW) and VRP with pickup and delivery (VRPPD).

Scenario. The aforementioned research problems exist in several applicable scenarios, and four common DDS scenarios are included in this survey. In transportation systems, we introduce ridesharing services, where vehicles are assigned to transport passengers to their destinations. Specifically, ridesharing can be further classified into ridehailing, where each driver serves only one passenger in a loop, and ridepooling, where multiple passengers can share a ride at the same time. As for the logistic systems where parcels are transported from providers to targets, we summarize solutions for both ondemand delivery systems that fulfill people’s instant demands and traditional express systems with longer service duration. We also introduce modern warehousing systems where Autonomous Guiding Vehicles (AGVs) transport parcels from locations to another. Note that some important literature providing solutions within a mathematical formulation is also included [18, 19].

Algorithm. We distinguish the detailed RL algorithm used during model training. Most commonly used ones in existing works belong to modelfree RL methods, including DQN [17], PPO [20], REINFORCE [21], etc. We also discuss whether the DDS task is constructed as a single agent MDP or a multiagent one.

Network Structure.
We also distinguish the neural network design in each literature. Commonly used networks include Convolutional Neural Networks (CNN), Graph Neural Networks (GNN) and its variants (including GCN and others), attention (ATT) based networks and its variants (including single
multi head attentions). 
Data Type and Data Scheme. We indicate the data type used in each literature, either realworld or generated based on predefined random seeds with a given distribution. Meanwhile, spatial locations of data are utilized in several ways for simplification to a different extent. Generally, there are four data schemes originated from the real road networks as follows: 4way connectivity with cardinal directions, 8way connectivity with ordinal directions, 6way connectivity based on hexagongrids, and original discrete graphbased structure. Note that the first two can also be summarized as squaregrids. Different data schemes are shown in Figure 3.

Data and Code Availability.
To present the extent of reproducibility of the investigated literature, we report the data availability of the proposed methods. A checkmark means that the data is released by the researchers or could be easily found via a direct web search. We also report the availability of the code. Both original opensourced codes and reimplementation from the third party are considered.
Besides, we also introduce the available simulation environments for DDS, which is critical to simulate realworld scenarios with much fewer expenses. Finally, several challenges of using DRL to solve DDS and remaining open research problems are summarized.
Previous literature investigating relative problems includes surveys by Haydari et al. [22], Qin et al. [23] and several reviews on VRP [24]. However, Haydari et al. [22] focused on the general planning problems in Intelligent Transportation Systems from where Transportation Signal Control (TSC) and Autonomous Driving are emphasized. Qin et al. [23] only investigated the dispatching problems in ridesharing scenarios, and Mazyavkina et al. [24]
introduced DRL solutions on mathematical VRPs included in more general combinatorial optimizations. In contrast, we are the first to define DDS from a practical system level and classify specific research problems in several scenarios with DRLbased solutions. We investigate how DRL can benefit to its development. The two stages of DDS are discussed, including dispatching that forms service loops and routing that executes services loops. The related literature is summarized in Table
II and Table III.Overall, this paper presents a comprehensive survey on DRL techniques for solving planning problems in DDS systems. Our contributions can be summarized as follows:

To the best of our knowledge, this is the first comprehensive survey that thoroughly defines and investigates DDS systems and uptodate DRL techniques as solutions.

We classify different stages within a complete DDS system, including the dispatching stage and the routing stage. We also investigate the common applications corresponding to the two stages, introduce the theoretical background of DRL from a broad perspective and explain several important algorithms.

We investigate existing works that utilize DRL for DDS systems. We summarize these works in several dimensions and discuss the individual approaches.

We illustrate the challenges and several open problems in DDS problems using DRL. We believe the summarized research directions will benefit relevant research and help to direct future work.
The remaining survey is organized as follows. We first introduce the background of this survey, including DRL and four common DDS scenarios in Sec 2. The stage definition and more specific problems with corresponding solutions of both dispatching and routing are summarized in Sec 3 and Sec 4 respectively. The commonly used simulation environments for both stages are introduced in Sec 5. Then we summarize several challenges of DRL for DDS design and open research problems in Sec 6 and Sec 7. Finally, we summarize this survey in Sec 8.
2 Background
2.1 Reinforcement Learning
RL is a kind of learning that maps from environmental state to action. The goal is to enable the agent to obtain the largest cumulative reward in the process of interacting with the environment [25]. Usually, the Markov Decision Process (MDP) can be used to model RL problems. There are several core elements within RL under an MDP setting, including the agent, the environment, the state, the action, reward, and transition. We drew Figure 4 to represent the reinforcement learning control loop and the detailed descriptions are as follows,

Environment. The environment of DRL is the fundamental setting that provides basic information from exogenous dynamics.

Agent. The agent in the RL is supposed to provide actions and interact with the entire environment. There could be even more than one agent, which further forms the multiagent RL setting.

State. is the set of all environmental states. By modeling the planning task as an MDP as the prior, the state of the agent, , at decision step describes the latest situation. The state of the agent serves as the endogenous feature that influences the decision making.

Action. is the set of executable actions of the agent. The action, is the way that agents interact with the environment at decision step . Any action could influence the current state of the agent.

Reward. is the reward function. By continuously carrying out actions to change states, the agent will finally obtain the corresponding reward, that is related to the task which is obtained by the agent performing the action in the state at decision step . With as the task signal, the entire training process of RL is to obtain a high reward, which represents how successful the agent is in completing the given task.

Transition.
is the state transition probability distribution function.
represents the probability that the agent performs the action at in the state
and transits to the next state .
In RL, policy is a mapping from state space to action space. It means that the agent selects an action with state , executes the action and transits to the next state with probability , and receives the reward from environmental feedback at the same time. Assuming that the immediate reward obtained at each time step in the future must be multiplied by a discount factor . From the time to the end of the episode at time , the cumulative reward is defined as , where , which is used to weigh the impact of future rewards.
The state action value function refers to the cumulative reward obtained by the agent during the process of executing action in the current state and following the strategy until the end of the episode, which can be expressed as . For all stateaction pairs, if the expected return of one policy is greater than or equal to the expected return of all other policies, then policy is called the optimal strategy. There may be more than one optimal policy, but they share a stateaction value function , which is called the optimal stateaction value function. Such a function follows the Bellman optimality equation,
In traditional RL, solving the value function is generally through iterating the Bellman equation Through continuous iteration, the stateaction value function will eventually converge, thereby obtaining the optimal strategy:
. However, for practical problems, such a process to search for an optimal strategy is not feasible, since the computation cost of iterating the Bellman equation grows rapidly due to the large state space. To tackle such a problem, deep learning (DL) is introduced in RL to form deep reinforcement learning (DRL), which utilizes deep neural networks for function approximation in the traditional RL model and significantly improves the performances of many challenging applications
[26, 17, 27]. In general, an RL agent can act in two ways: (1) by knowing/modeling state transition, which is called modelbased RL, and (2) by interacting with the environment without modeling a transition model, which is called modelfree RL. Modelfree RL algorithms include two categories of algorithms: valuebased methods and policybased methods. In valuebased RL, an agent learns the value function of a stateaction pair and then selects actions based on such a value function [25]. While in policybased RL, the action is determined by a policy network directly, which is trained by policy gradient [25]. We will first introduce valuebased methods and policybased methods, and then discuss the combinations of them. We drew Figure 5 to show the classification and development of these methods. Besides, we also introduce multiagent RL as a special category.Valuebased RL. Mnih et al. [26, 17, 27] first combined the convolutional neural network with the learning [28] algorithm from traditional RL, and proposed the Deep QNetwork (DQN) framework. This model is first used to process visual perception, which is a pioneering and representative work in the field of valuebased RL. DQN uses an experience replay mechanism [29] in the training process, and processes the transferred samples for training. At each time step , the transferred samples obtained from the interaction between the agent and the environment are stored in a replay buffer . During training, a small batch of transferred samples is randomly selected from
each time, and the stochastic gradient descent (SGD) algorithm and TD error
[30] is used to update the network parameters . During training, samples are usually required to be independent of each other. Such a random sampling method greatly reduces the relevance between samples, thereby improves the stability of the algorithm. In addition to using a deep convolutional network with parameter to approximate the current value function, DQN also uses another network to generate the target value. Specifically, represents the output of the current value network, which is used to evaluate the value function of the action pair in the current state. Meanwhile, represents the output of the target value network, which is used to approximate the value function, namely the target value. The parameters of the current value network are updated in realtime. After every iterations, the parameters of target value network are updated by and kept frozen for another iterations. The entire network is trained by minimizing the mean square error between the current value and the target . Such a frozen target mechanism reduces the correlation between the current value and the target value and thus improves the stability of the training process.The selection and evaluation of actions are based on the target value network , which will easily overestimate the value in the learning process. Tackling this problem, researchers have proposed a series of methods based on DQN. Hasselt et al. [31] proposed the Deep Double QNetwork (DDQN) algorithm based on the double Qlearning algorithm [32] (double Qlearning). There are two different sets of parameters in double Qlearning: and . Where is used to select the action corresponding to the maximum value, and is used to evaluate the
value of the optimal action. Such a parameter separation separates action selection and strategy evaluation so as to reduce the risk of overestimating the Q value. Experiments show that DDQN can estimate
value more accurately than DQN. The success of DDQN shows that reducing evaluation error on value improves performance. Inspired by this, Bellemare et al. [33] defined a new operator based on advantage learning (AL) [34] in the Bellman equation to increase the difference between the optimal action value and the suboptimal action value, in order to alleviate the evaluation error caused by the action corresponding to the largest value. Experiments show that AL error terms can effectively reduce the deviation in the evaluation of the value and thus promote learning quality. In addition, Wang et al. [35] improved the network architecture based on DQN, and proposed Dueling DQN, which greatly accelerates task learning speed.Valuebased RL methods are suitable for lowdimensional discrete action spaces. However, they cannot solve the decisionmaking problems in the continuous action space, such as autonomous driving, robot movement, etc. Therefore, we further introduce the policybased RL methods that are capable of solving continuous decisionmaking problems.
Policybased RL. Policybased RL [36] updates the policy parameters directly by computing the gradient of the cumulative reward of the policy with respect to the policy parameters, and finally converges to the optimal policy where represents the sum of rewards received in an episode. The most common idea of policy gradient is to increase the probability of of the trajectories with higher reward. Assume the state, action and reward trajectory of a complete episode is . Then the policy gradient is expressed as : Such a gradient can be used to adjust the policy parameters where is the learning rate, which controls the rate of policy parameter update. The gradient term represents the direction that can increase the probability of occurrence of trajectory . After multiplying by the score function , it can make the probability density of the trajectories with higher reward greater. While trajectories with different total rewards are collected, the above training process will guide the probability density to these trajectories with higher total rewards and maximize the corresponding appearance probability.
However, the above method lacks the ability of distinguishing trajectories with different quality, which will lead to a slow and unstable training process. In order to solve these problems, Williams et al. [21] proposed the REINFORCE algorithm with a baseline as a relative standard for the reward : where is a baseline related to the current trajectory , which is usually set as an expected estimate of
in order to reduce the variance of
. It can be seen that the more exceeds the reference , the greater the probability that the corresponding trajectory will be selected. Therefore, in the DRL task of largescale state, the policy can be parameterized by the deep neural network, and the traditional policy gradient method can be used to solve the optimal policy.However, the policybased RL methods are very unstable during training due to the inaccurate estimation of baseline and are inefficient due to that complete episodes are required for parameter updates. In order to solve these problems, researchers proposed actor critic methods, which combine the valuebased RL methods and policybased RL methods.
Actor Critic RL. V. R. Konda et al. [37] first proposed the actorcritic (AC) methods leveraging advantages from both valuebased and policybased methods. The AC methods include two estimators: an actor that plays the role of the policybased method via interacting with the environment and generating actions according to the current policy, while a critic who plays the role of the valuebased method by estimating the value of the current state during training. In AC methods, the critic’s estimation of the value of the current state makes the RL training process more stable. In addition, there are some actorcritic RL methods introducing gradient restrictions or replay buffers so that the collected data can be reused, and therefore improve the training efficiency.
R. S. Sutton et al. [25] proposed the Advantage ActorCritic (A2C) method, which adds a baseline to the value so that the feedback can be either positive or negative. V. Mnih et al. [38] introduced distributed machine learning methods into A2C and got a new algorithm named Asynchronous Advantage ActorCritic (A3C), which greatly improved the efficiency of the A2C algorithm. Wang et al. combined the AC method with experience replay and proposed actorcritic with experience replay (ACER) [39] method. This method enables the AC framework to train in an offpolicy way to improve data utilization efficiency. Lillicrap et al. [40] leveraged the idea of DQN to extend the Q learning algorithm to transform the Deterministic Policy Gradient [41]
(DPG) method, and proposed a Deep Deterministic Policy Gradient method (DDPG) based on the actorcritic framework, which can be used to solve decisionmaking problems in continuous action space. Moreover, it also introduced the replay buffer so that the collected data can be reused to improve training efficiency. Although DDPG can sometimes achieve good performance, it is still fragile in terms of hyperparameters. A common failure reason for DDPG is overestimating the real
value, thus making the learned policy worse. To solve this problem, Fujimoto et al. [42] proposed Twin Delayed DDPG (TD3), which introduced three techniques based on DDPG. TD3 employs clipped doubleQ learning to reduce the deviation of the value estimation. It also utilizes delayed policy updating and target policy smoothing to reduce the impact of the value estimation deviation on policy training. Furthermore, Schulman et al. [20] employed the importance sampling [43] method and tailored the network gradient update of reinforcement learning to make the training process more robust. Schulman et al. [44] also proposed a method called Trust Region Policy Optimization (TRPO). The core idea of TRPO is to force the differences of the prediction distribution of the old and new policies on the same batch of data so as to avoid excessive gradient updates and ensure the stability of the training process. However, TRPO employs the conjugate gradient algorithm to solve the constrained optimization problem, which greatly reduces the computational efficiency and increases the implementation cost. Therefore, Schulman et al. [20] proposed a Proximal Policy Optimization algorithm (PPO) to get rid of the calculations generated by constrained optimization by introducing a reduced proxy objective function.MultiAgent RL. Many realworld problems require interaction modeling among different agents, and multiagent RL algorithms are thus needed. A common approach is to assign each agent with a separate training mechanism. Such a distributed learning architecture reduces the implementation of learning difficulty and computational complexity. For DRL problems with largescale state space, using the DQN algorithm instead of the Q learning algorithm to train each agent individually can construct a simple multiagent DRL system. Tampuu et al. [45] dynamically adjust the reward model according to different goals and proposed a DRL model in which multiple agents can cooperate and compete with each other. When faced with reasoning tasks that require multiple agents to communicate with each other, the DQN models usually fail to learn an effective strategy. To solve this problem, Foerster et al. [46] proposed a method called Deep Distributed Recurrent QNetworks (DDRQN) model for multiagent communication and cooperation with partially observable state. Except for distributed learning, other mechanisms including cooperative learning, competitive learning, and direct parameter sharing are also used in different multiagent scenarios [47].
2.2 Application Overview
As defined above, a DDS system transports either humans or parcels to provided destinations following individual demands or systematic requirements. We briefly introduce several urban DDS applications that have significant importance in our daily lives, as illustrated in Figure 6.
2.2.1 Ridesharing
Compared to traditional taxihailing services in which passengers are offered rides by chance, a ridesharing service matches passengers with drivers according to their demands from mobile apps, such as DiDi [48], Uber [49], etc. When a potential passenger submits a request from the apps to the centralized platform, the platform will first estimate the trip price and send it back. If the passenger accepts it, a matching module will attempt to match the passenger to a nearby available driver. The matching process may take time due to realtime vehicle availability and thus prematching cancellation may exist. After a successful match, the driver will drive to the passenger and transport him/her to the destination. A trip fare will be obtained by the driver after arrival. To reduce the average waiting time for a successful match, the platforms usually utilize a fleet management module in the backend to rebalance idling vehicles continuously by guiding vehicles to places with a higher possibility of new requests. The decisions from matching and fleet management are finally executed within the routing stage. Vehicles are navigated to serve passengers or repositioned to new areas following these strategies. In the ridesharing scenario, the service worker of a loop refers to the vehicle, while the provider and the target refer to the passengers’ pickup locations and their destinations.
A ridesharing service can be further classified into ridehailing where a driver is assigned only one passenger at a time, and ridepooling (also known as carpool) where several passengers share a vehicle at a time. Note that in some literature the scenario of multiple passengers is also named as ridesharing. In this survey, we use ridepooling specifically for disambiguation following [23].
2.2.2 Ondemand Delivery
Many platforms around the world provide food delivery services such as PrimeNow [50], UberEats [51], MeiTuan [1], and Eleme [52]. Except for delivering food, the newly rising instant delivery services can also deliver small parcels from one customer to another or helps to purchase other daily merchandise directly from local shops or pharmacies, such as medicines. Both food and instant delivery can be seen as a type of ondemand delivery. Compared with traditional delivery platforms, e.g., FedEx and UPS, the orders in ondemand delivery platforms are expected to be fulfilled in a relatively short time, e.g., 30 minutes to 1 hour. A typical ondemand delivery process involves four parties: a customer as the service target, a merchant as the service provider, a courier as the worker, and the centralized platform. A customerfirst places an order in a smartphone app of a platform, while a merchant starts to prepare the order and the platform assigns a courier to pick up the order. Finally, the courier delivers the order to the customer.
2.2.3 Express Systems
As a longexisting DDS system, an express system is required to both pick up parcels from the consignors to the fixed depots, and deliver parcels that were loaded from the depots to the consignees. In a practical express system such as FedEx [53], Cainiao [54], the pickup and delivery are usually considered simultaneously and could be obtained within the same service loop. A courier loads parcels at the depot and then delivers them to their destinations one by one via a delivery van. Meanwhile, new pickup requests may come from local customers during the delivery process, each of which is associated with a service location. The courier should also go to these places to fulfill pickup requests. Couriers are required to depart from and return to the depots by a specific time, to fit the schedule of trucks that send and pick packages to or from stations regularly.
2.2.4 Warehousing
Except for the DDS applications that have direct interactions with humans, the rising autonomous technologies enable unmanned management in local warehousing. The shipment requests of cargoes, usually with large size and weight, are common within a repository or among several repositories. Cargoes are moved into a targeted shelf and moved out continuously to accommodate the global shipping requirements. To reduce expenses and improve efficiency, autonomous guiding vehicles (AGVs) are commonly used in the modern warehousing scenario. In a warehousing service loop, the service provider refers to the original shelf, and the service target refers to the corresponding destination. AGVs serve as the workers in the entire process. An intelligent centralized platform is responsible to control all AGVs for efficient operations.
2.3 Relationship between Two Stages
Generally, the research problems within practical DDS systems can be classified into two stages, i.e., dispatching and routing. The dispatching stage mainly handles the relationship between service workers and demand pairs and thus constructs service loops, while the routing stage focuses on how to execute the services within each established loop. We hereby note that the two stages are not rigidly separated. A reasonable dispatching algorithm should consider future inloop routing strategies as a measurement proxy. Whether a better routing solution can be generated is a direct criterion to judge different dispatching strategies. For example, a courier should not be assigned with a demand request which is far away from him since the routing distance within such a loop will be too long. On the other hand, in practical routing scenarios where a fleet of workers are on duty, it is implicated that the cooperation among different workers needs consideration and thus the dispatching is included.
However, such a classification is necessary to concentrate on primary challenges in different practical scenarios. An important reference metric for such a classification we demonstrate in this survey is the Demand/Worker Ratio. A low ratio means that the number of workers and demand pairs are balanced in each constructed loop and thus the major space of optimization is to determine how different requests should be assigned. For instance, a driver can only take one passenger in ridehailing and no more than two passengers in ridepooling. How to match drivers with customer requests is critical to global efficiency, while computing inloop routing strategies is not computationally expensive. Meanwhile, in scenarios with a large ratio, it implies that a worker has to serve lots of demand requests within its loop. The routing stage, i.e., how to execute the loops thus has a high problem complexity and requires an intensive optimization process. For instance, a courier in express systems may be assigned with hundreds of parcels, and generating its optimal routing strategy becomes the primary challenge due to its NPhard nature.
In the following sections, we will focus on the dispatching and routing stages. We will discuss the within subproblems and introduce existing solutions respectively.
3 Stage 1: Dispatching
Given the information of available workers and continuously updated service demand pairs, the first stage of DDS is to coordinate the relationship between demands along with the available workers, and thus establish service loops both effectively and efficiently. We name such a loop forming process as ’Dispatching’. Generally, the dispatching stage consists of two aspects: 1) Order matching, which aims to find the best matching strategy between workers and demands, and 2) Fleet management, which repositions idling workers to balance the local demandsupply ratio so that better order matching could be obtained in the future. Figure 8 shows an overview of the dispatching phase in DDS.
Formulated as an optimization problem, both tasks in the dispatching scenario are complicated due to three challenges. First, the continuously changing demand distributions and worker states bring high dynamics to the entire Markov Decision Process. It is nontrivial to accurately evaluate returns of different decision attempts. Second, a successful matching strategy should consider longterm returns [55]. A simple maximum result only considering current service distributions may result in a longterm loss. For example, assigning all vehicles to serve every current demand may be a local maximum in ridesharing, but may decrease the profit in the next time window since some vehicles are assigned to areas where barely any new demands appear. Third, a centralized platform should consider multiple, even large amount of workers simultaneously. Effectively modeling the cooperation and sometimes competition among them is critical to improving the system efficiency.
Concerning the given challenges, DRL has its natural advantage to solve the order matching problem compared to conventional methods and other learningbased methods. Many online reinforcement learning methods are developed to handle the nonstationariness in MDP modeling. Taking expected returns as learning signals, DRL is a proper framework to optimize sequential decision tasks, including dispatching tasks. Besides, modeling workers as agents is a natural way to handle the decision problem, either by homogeneously modeling all workers using the same policy, or consider the inbetween interactions among multiple agents.
In this section, we introduce both order matching and fleet management problems. Specifically for each problem, we first introduce the problem definition and common metrics, along with several conventional methods respectively. Then we thoroughly discuss detailed applications for transportation and logistics. The DRLbased literature for the dispatching stage is summarized in Tabel II.
3.1 Order Matching
An order matching process is to assign current unserved service demands to available workers. It is also defined with other names, such as orderdriver assignment in ridesharing services. The mathematical formulation originates from the online bipartite graph matching problem, where both supplies (the workers) and the demands are dynamic. It is an important module in the realtime online DDS applications with high dynamics, such as ridesharing and ondemand delivery [14, 11, 57]. Information including unserved demands, travel costs, and worker availability is updated continuously, which brings complexity to the problem.
Without purely assigning demands to workers, practical DDS systems also consider other additional action choices. For instance, vehicles in a ridesharing system can be designated to idle when no proper demand can be assigned to them. As electric vehicles are widely used and deployed, whether to recharge or continue to accept new demands forms new decision problems [58]. Furthermore, controlling the number of demands assigned to the same worker also expands the action space, such as considering ridehailing and ridepooling scenarios simultaneously [58]. When each driver can have more than one customer in a loop, the action is to determine how many customers and which ones to pick up.
As for the goal of order matching, there are generally two aspects to consider, including optimizing profits for the platform and experience from the demands’ side:

Maximize the Gross Merchandise Volume (GMV).[56]
With each service loop priced, a core evaluation metric of an effective order matching system is to maximize the total revenue of all services over time. In the ridehailing services specifically, it is also called Accumulated Driver Income (ADI) in some literature
[59]. Generally, the profit perspective stands for the interest of both workers and the entire platform. 
Maximize the Order Response Rate (ORR).[59] Since not fulfilling all demands is usual in realworld scenarios, another goal is to maximize the ORR, which evaluates the satisfaction from the demands’ side. Based on the intuition that total response time increases along with ORR, it is also an alternative to represent the interest of customers. Note that ORR is highly correlated to GMV since that the more demands are fulfilled, the more revenues the platform can obtain within a certain period.


Reference  Year  Problem  Scenario  Algorithm  Network Structure  Dscheme  Dtype  Davail  Code 


Li et al.[59]  2019  Order Matching  Ridesharing  MFRL[60]  MLP  HexagonGrid  real  x  x 
Zhou et al.[61]  2019  Order Matching  Ridesharing  DoubleDQN[31]  MLP  HexagonGrid  real, sim  ✓  x 
Xu et al.[56]  2018  Order Matching  Ridesharing  TD[30]    SquareGrid  real, sim  x  x 
Wang et al.[62]  2018  Order Matching  Ridesharing  DQN[17]  MLP, CNN  HexagonGrid  real  x  x 
Tang et al.[63]  2019  Order Matching  Ridesharing  doubleDQN[31]  MLP  HexagonGrid  real  x  x 
Jindal et al.[58]  2018  Order Matching  Ridepooling  DQN[17]  MLP  SquareGrid  real  ✓  x 
He et al.[64]  2019  Order Matching  Ridesharing  Double DQN[31]  MLP, CNN  SquareGrid  real  x  x 
AlAbbasi et al.[65]  2019  Order Matching  Ridesharing  DQN[17]  CNN  SquareGrid  real  x  x 
Qin et al.[66]  2021  Order Matching  Ridesharing  AC [37], ACER [39]  MLP  SquareGrid  real  x  x 
Wang et al.[67]  2019  Order Matching  Ridesharing  QLearning [28]    Graph Based  real, sim  ✓  x 
Ke et al.[68]  2020  Order Matching  Ridesharing  DQN [17], A2C [25], ACER [39], PPO [20]  MLP  Square/HexagonGrid  real, sim  ✓  x 
Yang et al.[69]  2021  Order Matching  Ridesharing  TD [30]  MLP  SquareGrid  real  x  x 
Chen et al.[70]  2019  Order Matching  Ondemand Delivery  PPO[20]  MLP  SquareGrid  real, sim  x  x 
Li et al.[59]  2019  Order Matching  Express  DQN [17]  MLP, CNN  SquareGrid  real  x  x 
Li et al.[71]  2020  Order Matching  Express  DQN [17]  MLP, CNN  SquareGrid  real  x  x 
Hu et al.[72]  2020  Order Matching  Warehousing  DQN [17]  MLP  Graphbased  real  x  x 
Lin et al.[73]  2018  Fleet Management  Ridesharing  A2C [25], DQN[17]  MLP  HexagonGrid  real  ✓  ✓ 
Zhang et al.[74]  2020  Fleet Management  Ridesharing  Dueling DQN [35]  MLP  HexagonGrid  real  ✓  ✓ 
Wen et al.[75]  2017  Fleet Management  Ridesharing  DQN[17]  MLP  SquareGrid  real, sim  x  ✓ 
Oda et al.[76]  2018  Fleet Management  Ridesharing  DQN[17]  CNN  SquareGrid  real  x  x 
Liu et al.[77]  2020  Fleet Management  Ridesharing  DQN [17]  GCN [78]  SquareGrid  real  ✓  ✓ 
Shou et al.[79]  2020  Fleet Management  Ridesharing  DQN [17]  AC [37]  SquareGrid  real  ✓  ✓ 
Jin et al.[80]  2019  Matching+Fleet Management  Ridesharing  DDPG[40]  MLP, RNN  HexagonGrid  real  ✓  ✓ 
Holler et al.[81]  2019  Matching+Fleet Management  Ridesharing  DQN[17], PPO[20]  MLP  SquareGrid  real, sim  x  x 
Guo et al.[82]  2020  Matching+Fleet Management  Ridesharing  Double DQN[31]  CNN  SquareGrid  sim  x  x 
Liang et al.[83]  2021  Matching+Fleet Management  Ridesharing  DQN[17], A2C [25]  MLP  Graphbased  real  x  x 

3.1.1 Conventional Methods for Order Matching
The order matching problem and many variants were widely studied in the field of Operations Research (OR). Given the deterministic information of both workers and demands, the problem can be summarized as bipartite matching and can be solved via the traditional KhunMunkres (KM) algorithm[8]. Early methods were proposed using greedy algorithms to assign the nearest available vehicle to a ride request[10]. These methods omit the global demands and supplies, and thus cannot achieve optimal performances in the long run. With new demands and worker states updating continuously, stochastic modeling becomes a major challenge. Researchers developed heuristics to deal with it efficiently [11, 84, 85, 14]. Based on historical data and the predictable pattern of demands, Sungur et al.[84] use stochastic programming to model the uncertain demands in the courier delivery scenario. Lowalekar et al.[85] tackle the problem with stochastic optimization with Bender’s decomposition and propose a matching framework for ondemand ridehailing. Hu and Zhou et al.[14] also formulate it as a dynamic problem and use heuristic policies to explore the structural space of the optimal.
3.1.2 DRL for order matching in Transportation Systems
Order matching is an essential decision and optimization problem in applications in transportation systems, such as ridesharing services. Modern taxi and inservice vehicles can share their realtime coordinates and states to the centralized platform via mobile networks. On the other hand, each customer can generate new requests including the provided pickup locations and the destinations as a demand pair. The platform receives emerging demands and thus executes online matching policies. In transportation DDS where the demand/worker ratio is relatively low, the coordination between demands and supplies is the principal issue. Thus the order matching among them is critical to improving the system operation efficiency. From the agents’ perspective, an intuitive idea to formulate the MDP of order matching is to model all drivers in the system as different agents, leveraging multiagent RL (MARL) techniques [59, 61]. However, direct multiagent formulation in realworld scenarios without any simplification may suffer from the enormous joint action space of thousands of agents. As a solution, Li et al. [59] used the Mean Field Reinforcement Learning (MFRL) [60] that models the interactions inbetween as each agent with the average of others. Zhou et al. [61] argue that no explicit cooperation or communication is needed in a largescale scenario. They propose a decentralized execution method to dispatch orders following a joint evaluation.
As a comparison, another simplified and commonly accepted method to model the cooperation is to train a single policy and implement it to all workers online [56, 62, 63, 86]. In this formulation, all workers are defined with homogeneous state, action space, and reward definitions. Even though the system is still multiagent from the global perspective, the training stage only considers a single one. Specifically, Xu et al. [56] model order matching as a sequential decisionmaking problem and develop a joint learningandplanning approach. They use Temporal Difference (TD) [30] to learn the approximate driver value function in the learning stage, and then use the KM algorithm to solve the bipartite matching problem based on learned values during planning. Wang et al. [62]
propose a transfer learning method to increase the learning adaptability and efficiency, where the learned order matching model can be transferred to other cities. They use the DQN algorithm to estimate the value network. Tang et al.
[63] further utilize the doubleDQN framework to obtain a more stable learning process. Since online dynamic order matching scenario requires comprehensive consideration upon spatialtemporal features, they develop a special network structure using hierarchical coarse coding and cerebellar embedding memories for better representations. Leveraging the STfeatures, He et al. [64] also develops a capsulebased network for better representations. Jindal et al. [58] only concentrated on the ridepooling task, and design their agent to decide whether a vehicle is to take a single or multiple passengers. Detailed matching is left to lowlevel algorithms. The homogeneous agent formulation avoids common challenges of multiagent RL, including the exponential decision space of different agents. Besides, complicated communication is also avoided since all agents share the same state.Instead of referring different workers as agents, a request of the complete request list is treated as the agent. Yang et al. [69] models each demand as an agent and train a value network to estimate the values of demands instead of workers. A separate manytomany matching process is further executed based on the learned values. Since online order matching includes nonstationariness from high dynamics, some literature also attempts to find solutions by concentrating on each time window to transform it into a static problem [68, 67] following such an agent modeling. Ke et al. [68] models each request as an agent, and all agents share the same policy. The action space of each agent is considered as whether to delay the current request to the next time window for further matching decisions. Wang et al. [67] train a single agent which represents the entire request list and decides how long the current window lasts. In both formulations, eventual matching results are generated by static bipartite graph matching.
3.1.3 DRL for Order Matching in Logistic Systems
Not only important in practical applications in transportation, but order matching is also essential in modern logistic systems. As pickup requests come in realtime with many couriers picking up packages, how to manage couriers to ensure cooperation among them and to complete more pickup tasks in a long time is important but challenging.
With the requirement of fast responding to ondemand delivery customers, modern ondemand delivery systems need effective matching strategies to assign new demands to couriers. Chen et al. proposed a framework that utilizes multilayer images of the spatialtemporal maps to capture realtime representations in the service areas. They model different couriers as multiple agents and use Proximal Policy Optimization (PPO) [20] to train the corresponding policy. As for the more common express systems, researchers also focus on developing an effective and efficient intelligent express system by optimizing the order matching problem. Zhang et al. first systematically study the largescale dynamic city express problem, and adopt a batch assignment strategy that computes the pickupdelivery routes for a group of requests received in a short period rather than dealing with each request individually [87]. Rather than using the heuristicbased methods, Li et al. further proposed a softlabel clustering algorithm named BDSB to dispatch parcels to couriers in each region [59]. A novel Contextual Cooperative Reinforcement Learning (CCRL) model is further proposed to guide where should each courier deliver and serve in each short period. Rather than considering both pickup and delivery tasks, Li et al. further proposed a Cooperative MultiAgent Reinforcement Learning model to learn courier dispatching policies [71].
3.2 Fleet Management
When a service worker is not assigned with demands and idling temporarily, a wellconsidered reposition strategy upon him can increase the possibility of future service chances and thus increase the entire platform’s revenue. Such a repositioning process forms the important fleet management problem which is also presented as vehicle positioning or taxi dispatching [77]. A straightforward intuition is that reasonable management can help balance the demands and supplies in different regions, thus help to improve the demand matching rate. We present the commonly accepted MDP modeling for fleet management problem and investigate the related DRL applications.
3.2.1 Conventional Methods for Fleet Management
Balancing the distributions of both DDS workers and demands was extensively studied, especially for the transportation systems. For instance, the balance of taxis and customers is essential in an efficient transportation system [88]. Traditional methods were mostly based on datadriven approaches, which highly investigate the historical records of the supply and demand distributions. Miao et al. capture the uncertain sets of random demand probability distributions via spatialtemporal features [89]. Yuan et al. and Qu et al. also construct a recommend system for vehicles to provide recommended options for repositioning [90, 91]. Various techniques, including mixedinteger programming and combinatorial optimizations, are utilized to model and solve the fleet management problem [92, 93].
3.2.2 DRL for Fleet Management
Following the idea of partitioning the city area into local grids to reduce computation cost, the MDP modeling of fleet management is also constructed based on the discrete dimension space. Given the spatialtemporal states of the workers in the fleet as individual agents and the information of dynamically updated customers, an intuition is to reposition available workers to locations with a larger demand/supply ratio than their current ones. For computational efficiency, the agents within the same grid at the same period are often considered as the same agents[73]. The goal of the platform is to maximize the longterm revenue of the entire platform of all agents or the total response rate, so as in order matching. Since measurement includes detailed matching between demands and workers, an intuitive assumption is that a worker can only be matched with the demand providers from its current neighbor grids. The action of each agent is defined based on the grid maps, which contains discrete action choices including moving to one of its neighbors in the way connected grids or staying as where it is.
Following such a formulation, many DRLbased methods have been proposed recently to address the fleet management problem [73, 76, 75, 94, 95, 74, 77, 79] in recent years. Lin et al. [73] model the cooperation within the fleet as a multiagent environment and propose a MARLbased solution for fleet management. Zhang et al. [74] develop a DDQN [35] based framework to learn to rewrite the current repositioning policy. Wen [75] explore a new taxi driver perspective upon the fleet management problem. They focus on increasing the individual incomes of drivers and demonstrate that higher revenues for drivers can help bring more drivers into the platform, and thus improve service availability for service customers. Shou et al. [79] further address the suboptimal equilibrium due to the competition among different drivers. They propose a reward design scheme and establish multiagent modeling of different drivers. In these works, as the action space could be extremely large for fleet management in a city, deep Qnetwork learning [17] has been commonly adopted by stateoftheart approaches to accelerate the policy learning process. The agents could fast interact with the environment based on the learned Qvalue and decide their next movements accordingly.
3.3 Joint Scheduling of Order Matching and Fleet Management
Besides individual studies upon order matching and fleet management, researchers also attempt to develop algorithms for both problems and consider it as an integrated dispatching stage[80, 81, 82].
Since the action spaces of the two problems are heterogeneous, Jin et al. [80]
proposed a hierarchical DRLbased structure to measure the two stages. Specifically, they design a unified action as the ranking weight vector to rank and select the specific order for matching or the destination for fleet managing. Holler et al.
[81] separate the two phases of the joint platform. They first treat the drivers as individual agents for order matching and then establish a central fleet management agent that is responsible for all individual drivers. Guo et al. [82, 8] use a double DQN based framework to solve the fleet management problem ahead and leave the detailed Order Matching to the traditional KhunMunkres (KM) algorithm. Liang et al. [83] preserve the topology of the initial graphbased supplydemand distribution structure instead of discretizing them using a grid view. A special centralized programming planning module is developed to dispatch thousands of taxis on a realtime basis.A major challenge of integrating the entire dispatching stage is that it is difficult to model the heterogeneous actions in two individual phases. A welldesigned unified latent representation of agent states is essential to augment the policy exploration ability and the robustness of training in a joint DRL framework. Further joint research of the two phases remains as the opportunity for effective DDS.
4 Stage2: Routing
By assigning tasks to different workers and balancing the relationship between workers and demands, the service loops are constructed. The second stage of DDS scheduling is to determine how to serve each demand pair within the constructed service loops. For example, in the ridepooling situation, a driver at a time may have several customers on the car and should decide the individual service priority. Routing is a more common problem in logistic systems where the ratio of worker/demand is much larger. For example, an express van may be assigned with more than a hundred delivery demands within its current service loop. A wellconsidered visiting strategy to execute the loop is critical to reducing the expenses.
Generally, the routing problems can be derived from the conventional VRP problem. For convenience, we first provide a mathematical formulation of typical Capacitied VRP (CVRP), then discuss the recent DRLbased solutions on solving the routing problems.
4.1 Formulation of Typical CVRP
The basic requirement of VRP is to design a routing strategy with a minimum cost for a fleet of vehicles, given the demands of a set of known customers. All customers must be assigned to one vehicle to have their parcels either be picked up or delivered. All vehicles have limited capacities, , and should originate and terminate at a given depot, , which also offers reloading service.
We represent a fleet of vehicles denoted by , and a set of known customers by , which formulate a directed graph . The total graph includes vertices, where the depot is double represented by vertex and . The set of arcs denoted by represents the traveling cost between customers and the depot and among customers. We associate a spatial distance and a temporal distance cost with each when . includes subgraphs. Each connected subgraph represents a single route by vehicle and has to start from vertex and ends at vertex with several customers inbetween, denoted by . Each vehicle has a capacity . Each customer has a demand . The realtime shipment should not exceed .
We further denote two decision variables and , and define , if and only if the is included in , where , while represents the time stamp when vehicle serves customer . By such denotations, we formulate the VRP mathematically as follows:
(1)  
(2)  
(3)  
(4)  
(5) 
where (1) represents the routing objective. Constraints (2), (3), and (4) make all customers visited and only visited once. (5) indicates that a vehicle should always yield to its capacity limit, and (6) requires all services made within the individual time windows.


Reference  Year  Problem  Scenario  Algorithm  Network  Dscheme  Dtype  Davail  Code 


Nazari et al. [18]  2018  Typical VRP  Mathematical  REINFORCE [21], A3C[38]  RNN  Graphbased  sim  ✓  ✓ 
Kool et al. [19]  2019  Typical VRP  Mathematical  REINFORCE[21]  ATT  Graphbased  sim  ✓  ✓ 
Chen et al. [96]  2019  Typical VRP  Mathematical  A2C[25]  MLP  Graphbased  sim  ✓  ✓ 
Lu et al. [97]  2019  Typical VRP  Mathematical  REINFORCE [21]  MLP, ATT  Graphbased  sim  ✓  ✓ 
Duan et al. [98]  2020  Typical VRP  Logistics  REINFORCE[21]  GCN, ATT  Graphbased  real  x  x 
Delarue et al. [99]  2020  Typical VRP  Mathematical  Modelbased  MLP  Graphbased  sim  x  x 
Xin et al. [100]  2020  Typical VRP  Mathematical  REINFORCE[21]  ATT  Graphbased  sim  x  x 
Joe et al. [101]  2020  dynamic VRP  Logistics  DQN[17]  MLP  Graphbased  real  x  x 
Ottoni et al. [102]  2021  TSP with Refueling (as EVRP)  Mathematical  QLearning [28], SARSA [16]    Graphbased  sim  x  x 
Qin et al.[103]  2021  Heterogeneous VRP  Mathematical  DoubleDQN[31]  MLP, CNN  Graphbased  sim  x  x 
Bogyrbayeva et al.[104]  2021  Electric VRP  Ridesharing  REINFORCE[21]  RNN  Graphbased  sim  x  x 
Shi et al.[105]  2020  Dynamic Electric VRP  Ridesharing  TD[30]  MLP  Graphbased  sim  x  x 
James et al.[106]  2019  Electric with Time Windows  Logistics  REINFORCE[21]  RNN  Graphbased  real  x  ✓ 
Lin et al. [107]  2020  Electric VRP with Time Windows  Logistics  REINFORCE[21]  ATT, RNN  Graphbased  sim  x  x 
Zhang et al. [108]  2020  VRP with Time Windows  Logistics  REINFORCE[21]  ATT  Graphbased  sim  x  x 
Falkner et al.[109]  2020  VRP with Time Windows  Mathematical  REINFORCE[21]  MLP, ATT  Graphbased  sim  ✓  x 
Zhao et al. [110]  2020  VRP with Time Windows  Mathematical  AC [37]  ATT  Graphbased  sim  ✓  x 
Li et al. [111]  2021  VRP with Pickup and Delivery  Mathematical  REINFORCE[21]  ATT  Graphbased  sim  x  x 
Li et al.[112]  2021  VRP with Pickup and Delivery  Logistics  DoubleDQN[31]  MLP, ATT  Graphbased  real  x  x 
Lee et al.[113]  2021  VRP with Pickup and Delivery  Warehousing  QLearning[28]    SquareGrid  Sim  x  x 

4.2 Realistic Routing Problems
Besides the typical VRP setting, realworld routing problems often require additional considerations with more realistic constraints and objectives. Many variants of VRP that tackle these practical constraints are thus closer to industrial applications and are also widely studied by researchers. We briefly introduce several important VRP variants, including dynamic VRP (DVRP), electric VRP (EVRP), VRP with time windows (VRPTW), and VRP with pickup and deliveries (VRPPD). An overview of the typical VRP and the above variants is illustrated in Figure 9.
4.2.1 Dynamic VRP (DVRP)
Service demands may not be preobtained by the platform in realworld scenarios, thus the newly updated demands require assignment with workers dynamically [6]. This is the same common challenge as discussed in the dispatching stage. However, rather than simply coordinating demands with customers, the routing stage also requires a specific routing strategy with the visiting orders of the matched demands for each worker. Joe et al. [101] utilize DQN to estimate the Qvalue of the individual states for vehicles and insert new demands into the existing solution sequence. DRL is with potential to estimate the future reward with possible action attempts and is suitable to solve dynamic VRPs.
4.2.2 Electric VRP (EVRP)
As electric vehicles (EVs) become commonly accepted in recent years, researchers gradually establish their interest in how to route the EV fleets and thus form a special Electric VRP (EVRP) problem [4]. They focus on the application potential of EV in both ridehailing and express systems [105, 106]. Since current EVs have shorter battery lives than traditional vehicles, EVRP considers the charging phase as an additional and essential action of the EV agents. Furthermore, the environment often contains information on the locations of charging stations.
For EV usage in ridehailing services, Shi et al. model the EV fleet operating as a dynamic EVRP[105]. At each decision step, an EV agent could either pass to keep idling, to charge at the local station, or to serve the customer demands. The detailed order dispatching of customer assignment is executed by KM algorithm[8]. For EV usage in delivery and express systems, James et al. consider both charging requirements of EVs and the possibility that not all customers are visited within the given time [106]. The optimization goal of such a framework is to both maximize the number of delivered logistic requests and minimize the total driving distance of all EVs. The two objectives are considered simultaneously using a weighted sum. Lin et al. consider the EVRP modeling along with individual time window limits of different customers, which will be discussed in the following [107].
4.2.3 VRP with Time Windows (VRPTW)
When a service demand is provided, it may also be attached with a corresponding time window that the worker should satisfy, which means the service must arrive at the service target location within the given time window [5, 114]. In practice, a customer who orders food from a restaurant may expect its food to be delivered before it cools down. Detailed consideration of time window limits is essential in practical routing scenarios.
Zhang et al.[108] proposed a multiagent framework by constructing the time window constraint as an additional penalty and generate the routing solutions of different vehicles one after another. James et al. [106] also consider the same constraint in the online electric vehicle routing problem, while does not force the vehicles to visit all given demands. Falkner et al. [109] proposed a joint attention mechanism to balance the coordination between vehicles and demands. Zhao et al. [110] designed a hybrid structure of both DRL and local search to solve both typical VRP and VRPTW.
4.2.4 VRP with Pickup and Deliveries (VRPPD)
Other than the simplified situation where the service provider and the service destination share the same location, VRPPD is a more common problem setting in practical usage [7]. For example, the driver for ridesharing is supposed to first pick up the customer from the origination, and then send him to his destination. How to handle the relationship between different service providertarget pairs, i.e., pickupdelivery pairs, is of great challenge. Li et al. [111] [111] proposes an attentionbased structure by designing a special heterogeneous attention. They design several heterogeneous attention to leverage the different relations between customers within the static graph, including the pickup with paireddelivery, the pickup with otherdeliveries, the pickup with otherpickups, and counterwise if we switch the role of pickups and deliveries.
4.3 Conventional Methods for Routing
When VRP was defined in the early stage [3], researchers attempted to find exact methods to explore the exact optimal strategies.
Researchers attempted to find exact methods for solving VRP at the very beginning when it was early defined and constructed. The branchandbound method as a common approach for combinatorial optimizations was used as a solution [9]. Lagrange relaxation based methods were proposed [115, 116], by which the problem could be solved with a minimum degreeconstrained Ktree problem. Besides, Desrochers et al. firstly used the column generation to solve VRP[117]
. The following column generation based methods initialize the problem with a small subset of variables and compute a corresponding solution, and keep improving the results based on linear programming gradually. However, due to the NPhard nature of VRP, the performances of exact approaches are often poor and computationally expensive. The exact methods could only generate results slowly on smallsized datasets.
As a complementary to the poor performances, many heuristicbased methods were further developed to find nearoptimal results instead. Compromise to the complexity of VRP and its variants, an acceptable loss on the solution quality can earn great efficiency improvement. For instance, the tabu search and local search as conventional metaheuristics were proposed to solve VRP[118, 119]
. New solutions in the neighborhood of the current one are continuously established and evaluated. On the contrary, genetic algorithms operate in a series of solutions instead of only one solution
[120, 116]. Following the idea of genetics, children’s solutions are generated from the best solution parents from the previous generation. Such an iteration can help to find the approximate optimal. Instead of treating objectives to be optimized altogether, ant colony optimizations utilize several ant colonies to optimize different functions: the number of vehicles, the total distance, and others[12].Even though these heuristics outperform the exact methods in finding better solutions, they are limited in realtime decisionmaking. A recreate search method, for example, takes hours to generate solutions for ten thousand instances with 100 customers each, which is not suitable in realtime applications. As another drawback, the optimal approximation of the heuristic methods highly relies on manually defined rules and expert knowledge, which is far from enough compared to the enormous searching space. New technology mechanisms are needed to further improve the solution quality.
4.4 DRL for Routing
In recent years, many researchers attempt to utilize the deep reinforcement learning (RL) method to solve VRP and other combinatorial optimizations due to the ability to improve solution quality via a selfdriven mechanism and the potential of an efficient solution generation process. With solution quality guaranteed, DRLbased methods benefit from the separation of offline training and online inferring. Even though it may take hours or even days to train a fully converged policy in the offline stage, the inference to new problem instances in industrial online applications may only take a second compared to the metaheuristics that will take minutes or hours [19]. Generally, current works using DRL for solving VRP and its variants can be classified into sequence generation based methods and rewriting based methods.
4.4.1 Sequence Generation Methods
A common approach to generate VRP solutions is to generate a partial sequence gradually and finally obtains a complete solution. In such an MDP modeling, the updating between different states is to include a new unvisited node into the current solution, which naturally forms the action of the sequence generation agent. In each decision step, the action space is the selection upon all unvisited nodes, and the agent selects one of them as the next to visit. It is notable that in more realistic VRP variants, practical constraints may limit the selection space since unfeasible solutions are possible to be generated. The agent thus should consider these constraints, and usually design a mask scheme to filter out the unfeasible choices. The sequence generation method is illustrated in Figure 10.
A special (PN) was first proposed under such a mechanism. Following a classic encoderdecoder structure, the PN structure is independent of the encoder length, and the output sequence is a subset of the input with a generated order [121]
. Even though the original PN was trained using supervised learning, it started following research exploration via DRL for more advanced VRP solutions. The basic structure of it is commonly utilized in the following research for routing problems. Bello et al.
[122] first developed a special neural combinatorial optimization framework (NCO) via DRL, which showed its effectiveness on both performance and generating efficiency on TSP and the knapsack problem. Even though the typical VRP was not studied, NCO serves as an important benchmark that utilizes DRL to explore more effective combinatorial optimization solution structures. Nazari et al. [18] further followed the NCO structure and first applied it to VRP. Kool et al. [19] combined the structure with attention mechanism as augmentation and obtained performance improvement. They investigated several routing formulations, including the TSP, typical CVRP, Split Delivery VRP (SDVRP), Orienteering Problem (OP), Prize Collecting TSP (PCTSP), and Stochastic PCTSP (SPCTSP). The attentionbased structure was further developed by the following researchers. Rather than relying on a single decoder for sequence generation, Xin et al.[100] proposed a multidecoder mechanism to generate several partial solutions simultaneously and combine them using tree search to expand the searching space. Duan et al.[98] on the other hand, focused on more effective feature representation ability of the network itself. They augmented the structure with GCN, and develop a joint learning approach using both DRL and supervised learning. Delarue et al.[99] models the action selection from each state as a mixedinteger program(MIP) and combines the combinatorial structure of the action space with the pretrained neural value functions by adapting the branchandcut algorithm[123].Owing to the independent training and inference stages, the sequence generation methods benefit from high inference result generation speed. For instance, generating routing solutions for 100 customers takes only 8 seconds, while the stateoftheart heuristicbased solver, LKH3[124], takes more than 13 hours as a comparison. Sequence generationbased methods have great potential in handling new demand changes in online platform implementation. When new DDS demands arrive or past ones are either modified or canceled, such a fast framework could make realtime responses to these changes.
4.4.2 Rewriting based Methods
Besides generating partial solutions until completeness, researchers also searched for other MDP formulations for solving VRPs. An intuition originates from the continuous modification of current solutions in operations research, which is the core idea of many practical heuristics for VRP. Following this idea, researchers attempted to parameterize such a modification process to continuously improve solution quality. Such a modification process is also called Rewritingbased methods [96]. The rewritingbased method is illustrated in Figure 11.
A framework is proposed by Chen et al.[96] where a complete solution is constructed at the first stage and promoted gradually guided by the RL agent. A local rewriting rule is designed that keeps updating the current solutions. Lu et al. [97] further propose a LearntoIterate (L2I) framework which not only improves the current solution but also creates perturbation for more exploitation choices. These methods borrow ideas from the operations research to keep rewriting the current solution. These rewritingbased methods are relatively slower compared with the sequence generation ones but have more potential in obtaining better performance due to the extended exploration ability.
5 Open Simulators and Datasets for DDS
Since the existing RL methods for DDS problems are modelfree, a large amount of training data generated by interaction with the environment is required. However, direct interactions with the real environment indicate high costs and high risks. Therefore, simulating upon DDS scenarios is very necessary. A reliable simulator is of great practical significance. There are already some existing DDS simulators. We will introduce existing open simulators and datasets.
Simulator and dataset for Dispatching.
Simulators for dispatching learn orders’ generation and state transition from the real data [63]. There are many public dispatchingrelated data sets. The most commonly used data set is provided by the New York City TLC (Taxi and Limousine Commission) [125], which contains travel records for various services, Yellow taxis, Green taxis, and FHV (ForHire Vehicle) from 2009 to 2020. [126] provides a subset of NYC FHV data, which contains GPS coordinates for pickup locations. Travel time data between OD pairs can be obtained through Uber Movement [127]. In addition, Didi Chuxing has released the travel records (regular and aggregated) and vehicle trajectory datasets of Chengdu, China through [128], from which they also developed a simulator to model the dispatching state. The simulator usually consists of two parts: order generation model, driver movement, and transition model. The order generation model learns the order generation and distribution, while the driver movement and transition model learn the state transition from the dataset. The simulator for dispatching often verifies the real effectiveness of the simulator by comparing the gross merchandise volume (GMV) generated by its simulation with the GMV of real data [73]. To make the simulator more realistic, sometimes the climate and traffic conditions are modeled in more details [129]. Recently, DiDi [130] has developed its dispatching simulation platform based on the research of existing dispatching simulators, which serves as an opensource ridehailing environment for training dispatching agents.
Simulator and dataset for Routing.
Simulator for routing learns the orders’ and vehicles’ generation and state transition from historical data [112]. When the simulation starts, it first initializes orders’ and vehicles’ states. The RL agent observes the state and then dispatches a determined vehicle to serve an order. After the RL agent implements the learned policy, the generation model and vehicle state transition model are utilized to update the selected vehicle’s information. This process repeats until the end of the simulation. Recently, Huawei [131] has developed its simulation platform based on the research of existing routing simulators, which serves as an opensource vehicle routing environment for training dispatching agents to tackle the scenario in logistics based on dynamic VRP with pickup and delivery. As for open datasets for routing, [132] summarized several open CVRP instance datasets with different scales with maximal to . VRPTW as a special variant routing problem has its public dataset included in Solomon dataset [133] and Homberg dataset [134]. Meanwhile, the Li&Lim benchmark [135] tackles VRPPD with time windows specifically.
6 Challenges
Owing to the advantages of DRL, many practical frameworks to tackle the two stages for DDS are developed and could generate highquality solutions efficiently on large scales. However, challenges still exist in building more practical DDS applications. We briefly summarize the major challenges in developing DDS solutions.
6.1 Coupled SpatialTemporal Representations
Capturing the dynamic changes of distributions within different service loops is essential to an effective DDS application. A good STrepresentation can well reflect the inbetween spatial relationships and the potential temporal consumption to accomplish the services. For instance, He et al. [64] develop a capsulebased network to capture the representations of both new demands from passengers and available drivers in the order matching task. A welllearned representation can enhance the performance of the overall framework.
Even though STrepresentation is not a new research topic and has been addressed as an important task in urban computation literature [136, 137, 138], the coupled ST forms a new challenge. In most DDS applications, a service target is a bond with its provider, thus the ST representation should reflect such a paired relationship. Some literature proposes special designs for the coupled challenge. For instance, Li et al. [111] propose a special attentionbased structure to leverage different relationships among all customer nodes in VRP with pick and delivery. Six different attention mechanisms in total are computed as a thorough measurement upon all nodes. However, the above solution follows an ergodic way to consider all possible solutions in all extent of measurement based on the given network structure. Developing more eligible and flexible representation methods, learning mechanisms and overall algorithms are still challenges for DDS development.
6.2 Fleet Heterogeneous
In both realworld transportation and logistic systems, it is common that workers in a given worker fleet may be heterogeneous. For example, vehicles with different capacities can be deployed for parcel delivery, and electric taxis with different battery volumes can be arranged together for ridehailing services. Under such circumstances, how to consider the heterogeneity of the entire fleet forms another challenge for realworld DDS. Most current solutions in both dispatching and routing stages make the hypothesis that all workers in the system are homogeneous to greatly reduce computation costs in largescale settings. However, heterogeneity is still an inevitable challenge.
For a DRLbased approach, an intuitive idea is to model such a problem as a MARL model, where several types of agents cooperate to accomplish the entire service tasks given by the platform. However, the state and action space will grow rapidly with the agent scale. How to solve such a problem using MARL, or finding an alternative method using centralized modeling remains a valuable question to solve the fleet heterogeneous challenge.
6.3 Variant Constraints in MDP Modeling
Considering multiple constraints in DRL designs is an important research problem that has been widely studied [139, 140]. Meanwhile, the practical constraints in DDS design are especially significant. For instance, a practical routing problem is much more complicated than the mathematical VRP due to numerous constraints, including time windows, charging requirements, structural limitations between pickups and deliveries. Effective DRL training requires corresponding considerations on these additional constraints.
A commonly used solution is to develop adhoc designs to these constraints. However, adopting such a solution for all constraints may result in complicated structure design when the constraint amount increases. It may be much more difficult to train the entire model. Another solution is to measure the limitations into soft constraints. For instance, to deal with VPR with time windows, Zhang et al. [108] measure possible violation on time windows as a penalty to the total reward. The agent can learn to minimize the exceeds upon the constraint. However, such a solution is not suitable in scenarios where limits are strict and can not be violated even sightly. How to well model these practical challenges remains a critical challenge.
6.4 LargeScale Deployment
Implementing algorithms on large scales is a necessary step from pure research to industrial usage. However, training models directly in largescale scenarios requires enormous computation resources and time. To tackle such a challenge, a commonly used method is to abandon the natural formulation of multiagent on different service workers. Either directly provide centralized control or model them as homogeneous agents with shared parameters can help to simplify the training process [56, 62, 63]. Another approach of reducing computation burden is to utilize the idea of divide and conquer by reducing a citywide planning task into multiple regional ones [59]. Such an idea is widely used in realworld ondemand delivery systems, where the entire delivery scope is divided into regions and couriers are assigned to accomplish ”last kilometer deliveries” [141].
However, current solutions are still far from enough to solve the largescale problem, especially in the routing stage. With an NPhard nature, the complexity of generating an optimal solution grows exponentially along with the problem scales. As a result, most existing literature on solving routing problems limit their experiment scales to no more than one hundred demand nodes within a graphbased data scheme [19, 97]. Developing new training frameworks via either more agile formulation or more advanced lightweight training algorithms can help to fit in largescale environment and promote deployment ability.
6.5 Dynamics and Realtime Scheduling
Realworld DDS scenarios include high dynamics from the environment. New demands arrive continuously and the existing ones may also change. For instance, a passenger who calls for a ridehailing service may change his destination or even cancel the current request directly. Such dynamics are critical in realtime scheduling in both stages.
Much existing literature using DRL for dispatching measures the dynamic features explicitly using specially designed network structures. For instance, Tang et al. [63] represent the spatialtemporal features using hierarchical coarse coding, and He et al.[64] develop a special capsulebased network accordingly. As for the routing stage, DVRP is specially constructed to leverage the dynamics in realtime scheduling. Changing demands update the current service loops and thus bring more complexity. However, only a limited number of DRL solutions for such a dynamic routing scenario are proposed curretnly [101, 105]. Realtime scheduling based on the dynamics is still challenging for DDS development.
7 Open Problems
With multiple challenges, there are still many open problems with future research opportunities in developing more effective DDS systems. In this section, we briefly discuss some research directions that we feel may be potentials in this area.
7.1 Advanced DRL Methods for DDS
As DRL theory and methods develop rapidly in recent years, new advanced DRL algorithms are of great potential in developing more robust, effective, and efficient DDS applications. For example, leveraging dispatching problems, most literature we investigate in this survey utilizes DQN as the training algorithm. However, such an offpolicy framework suffers from limitations to interact with the environment even multisourced data is used. Consequently, reproducibility within other environments will face great difficulty and thus cause fairness issues on performance comparison. While recent development on offline RL [142] provides new opportunities to tackle the challenges respectively. A complete offline learning paradigm based on largescale agent experience data may help to improve training robustness and solve the reproducibility problem. Besides offline RL, other advanced RL techniques such as causal RL [143] leveraging multiobjective optimization may also bring new research opportunities into DDS development.
7.2 Joint Optimization of Two Stages
Currently, even though both the dispatching phase and the routing phase are well studied, works that consider both stages into a DRL learning paradigm are still missing. This forms a major problem, especially in the reaction speed to new changes in applicable systems. For example, current order learningbased dispatching systems are still computationally intensive since conventional VRP solvers are adopted instead of the DRL based [144, 145], which serves as the role to help predict future income and vehicle states. A joint consideration in two stages can help to improve the overall performance, including both planning quality and inference speed. A major challenge of such modeling is the even more complicated state space and the heterogeneity of different action spaces. Research potential lies in the crossstage representation for both states and actions. The planning quality will be highly related to the hierarchical framework design.
7.3 Fairness from Workers’ Perspective
In the current solutions for DDS problems, it almost defaults to set the scheduling objective to maximize the profit of the entire platform. Even though new objectives are proposed, such as Order Response Rate (ORR) in Sec 3.1.1, they still conform to the overall centralized profit. Rather than it, few works stand from the perspective of the service workers. As both DDS and AI ethics develop, researchers of social science gradually focus on how service workers think about their roles in DDS. While keeping optimizing the centralized profit as the prime goal, how to consider the individual differences among the group of service workers and their initiatives is a research question with potentials. On the one hand, fairness between the workers is an essential problem. Maximizing the overall profit alone might result in extreme differences in individual incomes. A welldesigned DDS system should guarantee the fairness of different service workers. On the other hand, individual workers might have his or her preference for the dispatching or routing strategies. Personal historical patterns with no intelligent algorithm intervention may help in finding these preferences, and may further be considered as a factor in using intelligent ones to guide them.
7.4 Partial Compliance Consideration
In the dispatching stage, current algorithms usually consider full compliance from the service workers as a simplified assumption. However in realworld applications, workers may reject recommendations from the centralized platform and operate based on individual preference. Such disobeying may result in inaccurate overall performance prediction and thus requires additional investigation.
Besides considering partial compliance as a factor in the system, the reason that results in such compliance and corresponding solutions form another important research task. For example, couriers on rainy days usually have reluctance on accepting distant delivery tasks. A solution is to provide extra allowance to the couriers. Joint decision on generating order matching decisions and determining the specific allowance quota for different couriers based on specific tasks is a sequential decision task, which is suitable for DRL modeling.
7.5 Pricing Problems
Other than directly scheduling how different roles in the humanengaged DDS loops should operate to improve efficiency as a whole, another important research question lies in how to determine the price of service provided by the workers. Dynamic pricing problems exist in the DDS applications with short serve durations and human engagement, such as ridesharing and instant/food delivery. The pricing module influence both supply distribution from the service workers and new service demands from the customers’ side.
RLbased approaches for dynamic pricing were studied and used in onesided retail markets [146, 147]. In these scenarios, pricing changes only the demand pattern from the customers’ perspective. However, in much more complicated DDS systems, a good pricing strategy should consider both customers and workers and other dynamic spatialtemporal information.
It is not trivial to develop a highquality pricing module within humanengaged DDS applications due to two major challenges. First, optimizing the pricing is closely coupled with other scheduling tasks in DDS as discussed in this survey. The routing estimation in advance is a decisive factor to pricing, and further influences the quality of the dispatching stage. How to optimize several modules jointly is a critical research problem. Second, designing the metrics to evaluate pricing quality is another challenge. A good pricing strategy should both be reasonable and explainable to customers and help to improve the overall income of the platform and all service workers as well as maintaining fairness. Multiside consideration of the problem formulation and the explainable RL design of the algorithm itself could thus be a core research focus in the future.
7.6 Simulation Environments
For scenarios that need high interaction with the environment, such as ridesharing or ondemand delivery with frequent updating to demand and worker distributions, it is too expensive to evaluate the algorithm by directly interacting with realworld scenarios. Thus it is essential to deploy and evaluate the algorithms in simulation environments. However, current simulators are rather simple in environment settings and are not enough for complete modeling for both dispatching or realistic routing in DDS. As for the dispatching stage, many realworld issues that might happen should be considered in the simulation test to evaluate the robustness of proposed algorithms, including stochastic cancellation of requests, intelligent pricing strategy upon different matching results, and the individual preference of different drivers, etc. As for the routing stage, realworld execution in the routing can seldom be as exact as the algorithm predicts. Simulation upon environment changes, realtime traffic congestion, and demand updates are essential to dynamic decisionmaking.
As far as we know, few simulators considering the factors above are released. Great research potential lies in developing simulators that could allow agents to entirely interact with the most realistic environments and thus has enough robustness for deployment. A welldesigned simulator can be critical as the role of the offline training environment for advanced DRL algorithm for DDS.
7.7 LargeScale Online Scheduling System
Tackling the challenges as discussed in Sec 6, the ultimate benchmark of utilizing DRL in DDS is to build a largescale online scheduling system to handle realworld DDS tasks. Complete system development requires thorough consideration on solving coupled and dynamic features, modeling heterogeneity within fleets, remaining high efficiency in large scales, and adapting to practical constraints. Both general highquality and robust algorithms and adhoc considerations in specific scenarios are needed in constructing a centralized platform for DDS design. Developing a largescale online scheduling DDS system via DRL will have a strong impact on both relative research and industrial areas.
8 Conclusion
Demand driven services (DDS), such as ridehailing and express systems, are of great importance in urban life nowadays. The planning and scheduling process within these applications require effectiveness and efficiency. In this survey, we focus on the DDS problems and derive the entire DDS into two stages, the dispatching stage and the routing stage respectively. The dispatching stage is responsible to coordinate the unassigned service demands and the available service workers, while the routing stage generates strategies within each service loops. We investigate the recent works using deep reinforcement learning (DRL) techniques to solve DDS problems in the two stages. We also discuss the further challenges and open problems in using DRL to help build highquality DDS systems.
Acknowledgments
This work was supported in part by The National Key Research and Development Program of China under grant 2018YFB1800804, the National Nature Science Foundation of China under U1936217, 61971267, 61972223, 61941117, 61861136003, Beijing Natural Science Foundation under L182038, Beijing National Research Center for Information Science and Technology under 20031887521, and research fund of Tsinghua University  Tencent Joint Laboratory for Internet Innovation Technology.
References
 [1] Meituan, “Meituan homepage,” https://about.meituan.com/en.
 [2] M. M. Flood, “The travelingsalesman problem,” Operations research, vol. 4, no. 1, pp. 61–75, 1956.
 [3] G. B. Dantzig and J. H. Ramser, “The truck dispatching problem,” Management science, vol. 6, no. 1, pp. 80–91, 1959.
 [4] M. Schneider, A. Stenger, and D. Goeke, “The electric vehiclerouting problem with time windows and recharging stations,” Transportation science, vol. 48, no. 4, pp. 500–520, 2014.
 [5] J. Desrosiers, F. Soumis, and M. Desrochers, “Routing with time windows by column generation,” Networks, vol. 14, no. 4, pp. 545–565, 1984.
 [6] H. N. Psaraftis, “Dynamic vehicle routing problems,” Vehicle routing: Methods and studies, vol. 16, pp. 223–248, 1988.
 [7] H. Min, “The multiple vehicle routing problem with simultaneous delivery and pickup points,” Transportation Research Part A: General, vol. 23, no. 5, pp. 377–386, 1989.
 [8] J. Munkres, “Algorithms for the assignment and transportation problems,” Journal of the society for industrial and applied mathematics, vol. 5, no. 1, pp. 32–38, 1957.
 [9] P. Toth and D. Vigo, “Branchandbound algorithms for the capacitated vrp,” in The vehicle routing problem. SIAM, 2002, pp. 29–51.
 [10] Z. Liao, “Realtime taxi dispatching using global positioning systems,” Communications of the ACM, vol. 46, no. 5, pp. 81–83, 2003.
 [11] E. Özkan and A. R. Ward, “Dynamic matching for realtime ride sharing,” Stochastic Systems, vol. 10, no. 1, pp. 29–70, 2020.
 [12] L. M. Gambardella, Éric Taillard, and G. Agazzi, “Macsvrptw: A multiple colony system for vehicle routing problems with time windows,” in New Ideas in Optimization. McGrawHill, 1999, pp. 63–76.
 [13] G. Schrimpf, J. Schneider, H. StammWilbrandt, and G. Dueck, “Record breaking optimization results using the ruin and recreate principle,” Journal of Computational Physics, vol. 159, no. 2, pp. 139 – 171, 2000. [Online]. Available: http://www.sciencedirect.com/science/article/pii/S0021999199964136
 [14] M. Hu and Y. Zhou, “Dynamic type matching,” Manufacturing & Service Operations Management, 2021.
 [15] D. Favaretto, E. Moretti, and P. Pellegrini, “Ant colony system for a vrp with multiple time windows and multiple visits,” Journal of Interdisciplinary Mathematics, vol. 10, no. 2, pp. 263–284, 2007.
 [16] R. S. Sutton and A. G. Barto, Reinforcement learning: An introduction. MIT press, 2018.
 [17] V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski et al., “Humanlevel control through deep reinforcement learning,” nature, vol. 518, no. 7540, pp. 529–533, 2015.
 [18] M. Nazari, A. Oroojlooy, L. V. Snyder, and M. Takáč, “Reinforcement learning for solving the vehicle routing problem,” 2018.
 [19] W. Kool, H. van Hoof, and M. Welling, “Attention, learn to solve routing problems!” 2018.
 [20] J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov, “Proximal policy optimization algorithms,” arXiv preprint arXiv:1707.06347, 2017.
 [21] R. J. Williams, “Simple statistical gradientfollowing algorithms for connectionist reinforcement learning,” Machine learning, vol. 8, no. 34, pp. 229–256, 1992.
 [22] A. Haydari and Y. Yilmaz, “Deep reinforcement learning for intelligent transportation systems: A survey,” IEEE Transactions on Intelligent Transportation Systems, 2020.
 [23] Z. Qin, H. Zhu, and J. Ye, “Reinforcement learning for ridesharing: A survey,” arXiv preprint arXiv:2105.01099, 2021.
 [24] N. Mazyavkina, S. Sviridov, S. Ivanov, and E. Burnaev, “Reinforcement learning for combinatorial optimization: A survey,” Computers & Operations Research, p. 105400, 2021.
 [25] R. S. Sutton and A. G. Barto, “Reinforcement learning: An introduction,” 2011.
 [26] V. Mnih, K. Kavukcuoglu, D. Silver, A. Graves, I. Antonoglou, D. Wierstra, and M. Riedmiller, “Playing atari with deep reinforcement learning,” arXiv preprint arXiv:1312.5602, 2013.
 [27] D. Silver, A. Huang, C. J. Maddison, A. Guez, L. Sifre, G. Van Den Driessche, J. Schrittwieser, I. Antonoglou, V. Panneershelvam, M. Lanctot et al., “Mastering the game of go with deep neural networks and tree search,” nature, vol. 529, no. 7587, pp. 484–489, 2016.
 [28] C. J. C. H. Watkins, “Learning from delayed rewards,” 1989.
 [29] L.J. Lin, Reinforcement learning for robots using neural networks. Carnegie Mellon University, 1992.
 [30] R. S. Sutton, “Learning to predict by the methods of temporal differences,” Machine learning, vol. 3, no. 1, pp. 9–44, 1988.

[31]
H. Van Hasselt, A. Guez, and D. Silver, “Deep reinforcement learning with
double qlearning,” in
Proceedings of the AAAI Conference on Artificial Intelligence
, vol. 30, no. 1, 2016.  [32] H. Hasselt, “Double qlearning,” Advances in neural information processing systems, vol. 23, pp. 2613–2621, 2010.
 [33] M. G. Bellemare, G. Ostrovski, A. Guez, P. Thomas, and R. Munos, “Increasing the action gap: New operators for reinforcement learning,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 30, no. 1, 2016.
 [34] L. C. Baird III, “Reinforcement learning through gradient descent,” CARNEGIEMELLON UNIV PITTSBURGH PA DEPT OF COMPUTER SCIENCE, Tech. Rep., 1999.
 [35] Z. Wang, T. Schaul, M. Hessel, H. Hasselt, M. Lanctot, and N. Freitas, “Dueling network architectures for deep reinforcement learning,” in International conference on machine learning. PMLR, 2016, pp. 1995–2003.

[36]
R. S. Sutton, D. A. McAllester, S. P. Singh, Y. Mansour et al., “Policy
gradient methods for reinforcement learning with function approximation.” in
NIPs
, vol. 99. Citeseer, 1999, pp. 1057–1063.
 [37] V. R. Konda and J. N. Tsitsiklis, “Actorcritic algorithms,” in Advances in neural information processing systems, 2000, pp. 1008–1014.
 [38] V. Mnih, A. P. Badia, M. Mirza, A. Graves, T. Lillicrap, T. Harley, D. Silver, and K. Kavukcuoglu, “Asynchronous methods for deep reinforcement learning,” in International conference on machine learning. PMLR, 2016, pp. 1928–1937.
 [39] Z. Wang, V. Bapst, N. Heess, V. Mnih, R. Munos, K. Kavukcuoglu, and N. de Freitas, “Sample efficient actorcritic with experience replay,” arXiv preprint arXiv:1611.01224, 2016.
 [40] T. P. Lillicrap, J. J. Hunt, A. Pritzel, N. Heess, T. Erez, Y. Tassa, D. Silver, and D. Wierstra, “Continuous control with deep reinforcement learning,” arXiv preprint arXiv:1509.02971, 2015.
 [41] D. Silver, G. Lever, N. Heess, T. Degris, D. Wierstra, and M. Riedmiller, “Deterministic policy gradient algorithms,” in International conference on machine learning. PMLR, 2014, pp. 387–395.
 [42] S. Fujimoto, H. Hoof, and D. Meger, “Addressing function approximation error in actorcritic methods,” in International Conference on Machine Learning. PMLR, 2018, pp. 1587–1596.
 [43] C. R. Shelton, “Importance sampling for reinforcement learning with multiple objectives,” 2001.
 [44] J. Schulman, S. Levine, P. Abbeel, M. Jordan, and P. Moritz, “Trust region policy optimization,” in International conference on machine learning. PMLR, 2015, pp. 1889–1897.
 [45] A. Tampuu, T. Matiisen, D. Kodelja, I. Kuzovkin, K. Korjus, J. Aru, J. Aru, and R. Vicente, “Multiagent cooperation and competition with deep reinforcement learning,” PloS one, vol. 12, no. 4, p. e0172395, 2017.
 [46] J. N. Foerster, Y. M. Assael, N. de Freitas, and S. Whiteson, “Learning to communicate to solve riddles with deep distributed recurrent qnetworks,” arXiv preprint arXiv:1602.02672, 2016.
 [47] L. Buşoniu, R. Babuška, and B. De Schutter, “Multiagent reinforcement learning: An overview,” Innovations in multiagent systems and applications1, pp. 183–221, 2010.
 [48] DiDi, “Didi homepage,” https://www.didiglobal.com.
 [49] Uber, “Uber homepage,” https://www.uber.com/.
 [50] PrimeNow, “Primenow homepage,” https://primenow.amazon.com/onboard?forceOnboard=1&sourceUrl=%2Fhome.
 [51] UberEats, “Ubereats homepage,” https://www.ubereats.com.
 [52] Eleme, “Eleme homepage,” https://www.ele.me.
 [53] FedEx, “Fedex homepage,” https://www.fedex.com/enus/home.html, 2021.
 [54] Cainiao, “Cainiao homepage,” https://global.cainiao.com, 2021.
 [55] L. Zhang, T. Hu, Y. Min, G. Wu, J. Zhang, P. Feng, P. Gong, and J. Ye, “A taxi order dispatch model based on combinatorial optimization,” in Proceedings of the 23rd ACM SIGKDD international conference on knowledge discovery and data mining, 2017, pp. 2151–2159.
 [56] Z. Xu, Z. Li, Q. Guan, D. Zhang, Q. Li, J. Nan, C. Liu, W. Bian, and J. Ye, “Largescale order dispatch in ondemand ridehailing platforms: A learning and planning approach,” in Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, 2018, pp. 905–913.
 [57] C. Yan, H. Zhu, N. Korolko, and D. Woodard, “Dynamic pricing and matching in ridehailing platforms,” Naval Research Logistics (NRL), vol. 67, no. 8, pp. 705–724, 2020.
 [58] I. Jindal, Z. T. Qin, X. Chen, M. Nokleby, and J. Ye, “Optimizing taxi carpool policies via reinforcement learning and spatiotemporal mining,” in 2018 IEEE International Conference on Big Data (Big Data). IEEE, 2018, pp. 1417–1426.
 [59] Y. Li, Y. Zheng, and Q. Yang, “Efficient and effective express via contextual cooperative reinforcement learning,” in Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, 2019, pp. 510–519.
 [60] Y. Yang, R. Luo, M. Li, M. Zhou, W. Zhang, and J. Wang, “Mean field multiagent reinforcement learning,” in International Conference on Machine Learning. PMLR, 2018, pp. 5571–5580.
 [61] M. Zhou, J. Jin, W. Zhang, Z. Qin, Y. Jiao, C. Wang, G. Wu, Y. Yu, and J. Ye, “Multiagent reinforcement learning for orderdispatching via ordervehicle distribution matching,” in Proceedings of the 28th ACM International Conference on Information and Knowledge Management, 2019, pp. 2645–2653.
 [62] Z. Wang, Z. Qin, X. Tang, J. Ye, and H. Zhu, “Deep reinforcement learning with knowledge transfer for online rides order dispatching,” in 2018 IEEE International Conference on Data Mining (ICDM). IEEE, 2018, pp. 617–626.
 [63] X. Tang, Z. Qin, F. Zhang, Z. Wang, Z. Xu, Y. Ma, H. Zhu, and J. Ye, “A deep valuenetwork based approach for multidriver order dispatching,” in Proceedings of the 25th ACM SIGKDD international conference on knowledge discovery & data mining, 2019, pp. 1780–1790.
 [64] S. He and K. G. Shin, “Spatiotemporal capsulebased reinforcement learning for mobilityondemand network coordination,” in The World Wide Web Conference, 2019, pp. 2806–2813.
 [65] A. O. AlAbbasi, A. Ghosh, and V. Aggarwal, “Deeppool: Distributed modelfree algorithm for ridesharing using deep reinforcement learning,” IEEE Transactions on Intelligent Transportation Systems, vol. 20, no. 12, pp. 4714–4727, 2019.
 [66] G. Qin, Q. Luo, Y. Yin, J. Sun, and J. Ye, “Optimizing matching time intervals for ridehailing services using reinforcement learning,” Transportation Research Part C: Emerging Technologies, vol. 129, p. 103239, 2021.
 [67] Y. Wang, Y. Tong, C. Long, P. Xu, K. Xu, and W. Lv, “Adaptive dynamic bipartite graph matching: A reinforcement learning approach,” in 2019 IEEE 35th International Conference on Data Engineering (ICDE). IEEE, 2019, pp. 1478–1489.
 [68] K. Jintao, H. Yang, J. Ye et al., “Learning to delay in ridesourcing systems: a multiagent deep reinforcement learning framework,” IEEE Transactions on Knowledge and Data Engineering, 2020.
 [69] L. Yang, X. Yu, J. Cao, X. Liu, and P. Zhou, “Exploring deep reinforcement learning for task dispatching in autonomous ondemand services,” ACM Transactions on Knowledge Discovery from Data (TKDD), vol. 15, no. 3, pp. 1–23, 2021.
 [70] Y. Chen, Y. Qian, Y. Yao, Z. Wu, R. Li, Y. Zhou, H. Hu, and Y. Xu, “Can sophisticated dispatching strategy acquired by reinforcement learning?” in Proceedings of the 18th International Conference on Autonomous Agents and MultiAgent Systems, 2019, pp. 1395–1403.
 [71] Y. Li, Y. Zheng, and Q. Yang, “Cooperative multiagent reinforcement learning in express system,” in Proceedings of the 29th ACM International Conference on Information & Knowledge Management, 2020, pp. 805–814.
 [72] H. Hu, X. Jia, Q. He, S. Fu, and K. Liu, “Deep reinforcement learning based agvs realtime scheduling with mixed rule for flexible shop floor in industry 4.0,” Computers & Industrial Engineering, vol. 149, p. 106749, 2020.
 [73] K. Lin, R. Zhao, Z. Xu, and J. Zhou, “Efficient largescale fleet management via multiagent deep reinforcement learning,” in Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, 2018, pp. 1774–1783.
 [74] W. Zhang, Q. Wang, J. Li, and C. Xu, “Dynamic fleet management with rewriting deep reinforcement learning,” IEEE Access, vol. 8, pp. 143 333–143 341, 2020.
 [75] J. Wen, J. Zhao, and P. Jaillet, “Rebalancing shared mobilityondemand systems: A reinforcement learning approach,” in 2017 IEEE 20th International Conference on Intelligent Transportation Systems (ITSC). Ieee, 2017, pp. 220–225.
 [76] T. Oda and C. JoeWong, “Movi: A modelfree approach to dynamic fleet management,” in IEEE INFOCOM 2018IEEE Conference on Computer Communications. IEEE, 2018, pp. 2708–2716.
 [77] Z. Liu, J. Li, and K. Wu, “Contextaware taxi dispatching at cityscale using deep reinforcement learning,” IEEE Transactions on Intelligent Transportation Systems, 2020.
 [78] T. N. Kipf and M. Welling, “Semisupervised classification with graph convolutional networks,” arXiv preprint arXiv:1609.02907, 2016.
 [79] Z. Shou and X. Di, “Reward design for driver repositioning using multiagent reinforcement learning,” Transportation research part C: emerging technologies, vol. 119, p. 102738, 2020.
 [80] J. Jin, M. Zhou, W. Zhang, M. Li, Z. Guo, Z. Qin, Y. Jiao, X. Tang, C. Wang, J. Wang et al., “Coride: joint order dispatching and fleet management for multiscale ridehailing platforms,” in Proceedings of the 28th ACM International Conference on Information and Knowledge Management, 2019, pp. 1983–1992.
 [81] J. Holler, R. Vuorio, Z. Qin, X. Tang, Y. Jiao, T. Jin, S. Singh, C. Wang, and J. Ye, “Deep reinforcement learning for multidriver vehicle dispatching and repositioning problem,” in 2019 IEEE International Conference on Data Mining (ICDM). IEEE, 2019, pp. 1090–1095.
 [82] G. Guo and Y. Xu, “A deep reinforcement learning approach to ridesharing vehicles dispatching in autonomous mobilityondemand systems,” IEEE Intelligent Transportation Systems Magazine, 2020.
 [83] E. Liang, K. Wen, W. H. Lam, A. Sumalee, and R. Zhong, “An integrated reinforcement learning and centralized programming approach for online taxi dispatching,” IEEE Transactions on Neural Networks and Learning Systems, 2021.
 [84] I. Sungur, Y. Ren, F. Ordóñez, M. Dessouky, and H. Zhong, “A model and algorithm for the courier delivery problem with uncertainty,” Transportation science, vol. 44, no. 2, pp. 193–205, 2010.
 [85] M. Lowalekar, P. Varakantham, and P. Jaillet, “Online spatiotemporal matching in stochastic and dynamic domains,” Artificial Intelligence, vol. 261, pp. 71–112, 2018.
 [86] Z. Qin, X. Tang, Y. Jiao, F. Zhang, Z. Xu, H. Zhu, and J. Ye, “Ridehailing order dispatching at didi via reinforcement learning,” INFORMS Journal on Applied Analytics, vol. 50, no. 5, pp. 272–286, 2020.
 [87] S. Zhang, L. Qin, Y. Zheng, and H. Cheng, “Effective and efficient: Largescale dynamic city express,” IEEE Transactions on Knowledge and Data Engineering, vol. 28, no. 12, pp. 3203–3217, 2016.
 [88] M. Nourinejad and M. Ramezani, “Developing a largescale taxi dispatching system for urban networks,” in 2016 IEEE 19th International Conference on Intelligent Transportation Systems (ITSC). IEEE, 2016, pp. 441–446.
 [89] F. Miao, S. Han, A. M. Hendawi, M. E. Khalefa, J. A. Stankovic, and G. J. Pappas, “Datadriven distributionally robust vehicle balancing using dynamic region partitions,” in Proceedings of the 8th International Conference on CyberPhysical Systems, 2017, pp. 261–271.
 [90] M. Qu, H. Zhu, J. Liu, G. Liu, and H. Xiong, “A costeffective recommender system for taxi drivers,” in Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining, 2014, pp. 45–54.
 [91] N. J. Yuan, Y. Zheng, L. Zhang, and X. Xie, “Tfinder: A recommender system for finding passengers and vacant taxis,” IEEE Transactions on knowledge and data engineering, vol. 25, no. 10, pp. 2390–2403, 2012.
 [92] J. Xu, R. Rahmatizadeh, L. Bölöni, and D. Turgut, “Taxi dispatch planning via demand and destination modeling,” in 2018 IEEE 43rd Conference on Local Computer Networks (LCN). IEEE, 2018, pp. 377–384.
 [93] X. Xie, F. Zhang, and D. Zhang, “Privatehunt: Multisource datadriven dispatching in forhire vehicle systems,” Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies, vol. 2, no. 1, pp. 1–26, 2018.
 [94] T. Verma, P. Varakantham, S. Kraus, and H. C. Lau, “Augmenting decisions of taxi drivers through reinforcement learning for improving revenues,” in Proceedings of the International Conference on Automated Planning and Scheduling, vol. 27, no. 1, 2017.
 [95] Y. Gao, D. Jiang, and Y. Xu, “Optimize taxi driving strategies based on reinforcement learning,” International Journal of Geographical Information Science, vol. 32, no. 8, pp. 1677–1696, 2018.
 [96] X. Chen and Y. Tian, “Learning to perform local rewriting for combinatorial optimization,” in Advances in Neural Information Processing Systems, 2019, pp. 6278–6289.
 [97] H. Lu, X. Zhang, and S. Yang, “A learningbased iterative method for solving vehicle routing problems,” in International Conference on Learning Representations, 2020.
 [98] L. Duan, Y. Zhan, H. Hu, Y. Gong, J. Wei, X. Zhang, and Y. Xu, “Efficiently solving the practical vehicle routing problem: A novel joint learning approach,” in Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, 2020, pp. 3054–3063.
 [99] A. Delarue, R. Anderson, and C. Tjandraatmadja, “Reinforcement learning with combinatorial actions: An application to vehicle routing,” arXiv preprint arXiv:2010.12001, 2020.
 [100] L. Xin, W. Song, Z. Cao, and J. Zhang, “Multidecoder attention model with embedding glimpse for solving vehicle routing problems,” arXiv preprint arXiv:2012.10638, 2020.
 [101] W. Joe and H. C. Lau, “Deep reinforcement learning approach to solve dynamic vehicle routing problem with stochastic customers,” in Proceedings of the International Conference on Automated Planning and Scheduling, vol. 30, 2020, pp. 394–402.
 [102] A. L. Ottoni, E. G. Nepomuceno, M. S. de Oliveira, and D. C. de Oliveira, “Reinforcement learning for the traveling salesman problem with refueling,” Complex & Intelligent Systems, pp. 1–15, 2021.
 [103] W. Qin, Z. Zhuang, Z. Huang, and H. Huang, “A novel reinforcement learningbased hyperheuristic for heterogeneous vehicle routing problem,” Computers & Industrial Engineering, vol. 156, p. 107252, 2021.
 [104] A. Bogyrbayeva, S. Jang, A. Shah, Y. J. Jang, and C. Kwon, “A reinforcement learning approach for rebalancing electric vehicle sharing systems,” IEEE Transactions on Intelligent Transportation Systems, 2021.
 [105] J. Shi, Y. Gao, W. Wang, N. Yu, and P. A. Ioannou, “Operating electric vehicle fleet for ridehailing services with reinforcement learning,” IEEE Transactions on Intelligent Transportation Systems, vol. 21, no. 11, pp. 4822–4834, 2019.
 [106] J. James, W. Yu, and J. Gu, “Online vehicle routing with neural combinatorial optimization and deep reinforcement learning,” IEEE Transactions on Intelligent Transportation Systems, vol. 20, no. 10, pp. 3806–3817, 2019.
 [107] B. Lin, B. Ghaddar, and J. Nathwani, “Deep reinforcement learning for electric vehicle routing problem with time windows,” arXiv preprint arXiv:2010.02068, 2020.
 [108] K. Zhang, F. He, Z. Zhang, X. Lin, and M. Li, “Multivehicle routing problems with soft time windows: A multiagent reinforcement learning approach,” Transportation Research Part C: Emerging Technologies, vol. 121, p. 102861, 2020.
 [109] J. K. Falkner and L. SchmidtThieme, “Learning to solve vehicle routing problems with time windows through joint attention,” arXiv preprint arXiv:2006.09100, 2020.
 [110] J. Zhao, M. Mao, X. Zhao, and J. Zou, “A hybrid of deep reinforcement learning and local search for the vehicle routing problems,” IEEE Transactions on Intelligent Transportation Systems, 2020.
 [111] J. Li, L. Xin, Z. Cao, A. Lim, W. Song, and J. Zhang, “Heterogeneous attentions for solving pickup and delivery problem via deep reinforcement learning,” IEEE Transactions on Intelligent Transportation Systems, 2021.
 [112] X. Li, W. Luo, M. Yuan, J. Wang, J. Lu, J. Wang, J. Lü, and J. Zeng, “Learning to optimize industryscale dynamic pickup and delivery problems,” in 2021 IEEE 37th International Conference on Data Engineering (ICDE). IEEE, 2021, pp. 2511–2522.
 [113] H. Lee and J. Jeong, “Mobile robot path optimization technique based on reinforcement learning algorithm in warehouse environment,” Applied Sciences, vol. 11, no. 3, p. 1209, 2021.
 [114] M. M. Solomon, “Algorithms for the vehicle routing and scheduling problems with time window constraints,” Operations research, vol. 35, no. 2, pp. 254–265, 1987.
 [115] O. B. G. Madsen, M. L. Fisher, and K. O. Jornsten, Vehicle routing with time windows: Two optimization algorithms, 1997.
 [116] J. H. Holland, Adaptation in Natural and Artificial System, 1992.
 [117] M. Desrochers, J. Desrosiers, and M. Solomon, “A new optimization algorithm for the vehicle routing problem with time windows,” Operations Research, vol. 40, pp. 342–354, 04 1992.
 [118] F. Glover, “Tabu search  part i,” INFORMS Journal on Computing, vol. 2, pp. 4–32, 01 1990.
 [119] C. Groër, B. Golden, and E. Wasil, “A library of local search heuristics for the vehicle routing problem,” Mathematical Programming Computation, vol. 2, no. 2, pp. 79–101, 2010.
 [120] D. E. Goldberg, “Genetic algorithms in search,” Optimization, and MachineLearning, 1989.
 [121] O. Vinyals, M. Fortunato, and N. Jaitly, “Pointer networks,” in Advances in neural information processing systems, 2015, pp. 2692–2700.
 [122] I. Bello, H. Pham, Q. V. Le, M. Norouzi, and S. Bengio, “Neural combinatorial optimization with reinforcement learning,” 2016.
 [123] R. Anderson, J. Huchette, W. Ma, C. Tjandraatmadja, and J. P. Vielma, “Strong mixedinteger programming formulations for trained neural networks,” Mathematical Programming, pp. 1–37, 2020.
 [124] K. Helsgaun, “An extension of the linkernighanhelsgaun tsp solver for constrained traveling salesman and vehicle routing problems,” Roskilde: Roskilde University, 2017.
 [125] TLC, “Nyc taxi & limousine commission trip record data. 2020,” https://www1.nyc.gov/site/tlc/about/tlctriprecorddata.page, 2020.
 [126] Kaggle, “Uber pickups in new york city  trip data for over 20 million uber (and other forhire vehicle) trips in nyc. 2017.” https://www.kaggle.com/fivethirtyeight/uberpickupsinnewyorkcity", 2017.
 [127] Uber, “Uber movement. 2021.” https://movement.uber.com/?lang=enUS, 2021.
 [128] GAIA, “Didi gaia open data set: Kdd cup 2020. 2020.” https://outreach.didichuxing.com/appEnvue/KDD_CUP_2020?id=1005, 2020.
 [129] Y. Li, Y. Zheng, and Q. Yang, “Dynamic bike reposition: A spatiotemporal reinforcement learning approach,” in Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, 2018, pp. 1724–1733.
 [130] GAIA, “Kdd cup 2020: Learning to dispatch and reposition on a mobilityondemand platform. 2020.” https://www.biendata.xyz/competition/kdd_didi/, 2020.
 [131] huawei, “Icaps 2021:the dynamic pickup and delivery problem,” https://competition.huaweicloud.com/information/1000041411/introduction, 2021.
 [132] CVRPLib, “Cvrplib homepage,” http://vrp.atdlab.inf.pucrio.br/index.php/en/.
 [133] Solomonbenchmark, “Solomonbenchmark. 2008.” https://www.sintef.no/Projectweb/TOP/VRPTW/Solomonbenchmark/, 2008.
 [134] homberger benchmark, “hombergerbenchmark. 2008.” https://www.sintef.no/projectweb/top/vrptw/hombergerbenchmark/, 2008.
 [135] Li&Limbenchmark, “Li&limbenchmark. 2008.” https://www.sintef.no/projectweb/top/pdptw/, 2008.
 [136] Z. Zong, J. Feng, K. Liu, H. Shi, and Y. Li, “Deepdpm: Dynamic population mapping via deep neural network,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, no. 01, 2019, pp. 1294–1301.
 [137] J. Feng, Y. Li, C. Zhang, F. Sun, F. Meng, A. Guo, and D. Jin, “Deepmove: Predicting human mobility with attentional recurrent networks,” in Proceedings of the 2018 world wide web conference, 2018, pp. 1459–1468.
 [138] Z. Lin, J. Feng, Z. Lu, Y. Li, and D. Jin, “Deepstn+: Contextaware spatialtemporal neural network for crowd flow prediction in metropolis,” in Proceedings of the AAAI conference on artificial intelligence, vol. 33, no. 01, 2019, pp. 1020–1027.
 [139] P. Geibel, “Reinforcement learning for mdps with constraints,” in European Conference on Machine Learning. Springer, 2006, pp. 646–653.
 [140] S. Miryoosefi, K. Brantley, H. Daumé III, M. Dudík, and R. Schapire, “Reinforcement learning with convex constraints,” arXiv preprint arXiv:1906.09323, 2019.
 [141] E. Taniguchi and R. G. Thompson, City logistics: Mapping the future. CRC Press, 2014.
 [142] R. Agarwal, D. Schuurmans, and M. Norouzi, “An optimistic perspective on offline reinforcement learning,” in International Conference on Machine Learning. PMLR, 2020, pp. 104–114.
 [143] A. Gupta, R. Mendonca, Y. Liu, P. Abbeel, and S. Levine, “Metareinforcement learning of structured exploration strategies,” arXiv preprint arXiv:1802.07245, 2018.
 [144] X. Yu and S. Shen, “An integrated decomposition and approximate dynamic programming approach for ondemand ride pooling,” IEEE Transactions on Intelligent Transportation Systems, vol. 21, no. 9, pp. 3811–3820, 2019.
 [145] S. Shah, M. Lowalekar, and P. Varakantham, “Neural approximate dynamic programming for ondemand ridepooling,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, no. 01, 2020, pp. 507–515.
 [146] C. Raju, Y. Narahari, and K. Ravikumar, “Reinforcement learning applications in dynamic pricing of retail markets,” in EEE International Conference on ECommerce, 2003. CEC 2003. IEEE, 2003, pp. 339–346.
 [147] D. Bertsimas and G. Perakis, “Dynamic pricing: A learning approach,” in Mathematical and computational models for congestion charging. Springer, 2006, pp. 45–79.
Comments
There are no comments yet.