1. Introduction
Realtime ridesharing refers to the task of helping to arrange onetime shared rides on very short notice (Amey et al., 2011; Furuhata et al., 2013). Such technique, embedded in popular platforms including Uber, Lyft, and DiDi Chuxing, has greatly transformed the way people travel nowadays. By exploiting the data of individual trajectories in both space and time dimensions, it offers more efficiency on traffic management, and the traffic congestion can be further alleviated as well (Li et al., 2016).
One of the critical problems in largescale realtime ridesharing systems is how to dispatch orders, i.e., to assign orders to a set of active drivers on a realtime basis. Since the quality of order dispatching will directly affect the utility of transportation capacity, the amount of service income, and the level of customer satisfaction, therefore, solving the problem of order dispatching is the key to any successful ridesharing platform. In this paper, our goal is to develop an intelligent decision system to maximize the gross merchandise volume (GMV), i.e., the value of all the orders served in a single day, with the ability to scale up to a large number of drivers and robust to potential hardware or connectivity failures.
The challenge of order dispatching is to find an optimal tradeoff between the shortterm and longterm rewards. When the number of available orders is larger than that of active drivers within the order broadcasting area (shown as the grey shadow area in the center of Fig. 0(b)), the problem turns into finding an optimal order choice for each driver. Taking an order with a higher price will contribute to the immediate income; however, it might also harm the GMV in the long run if this order takes the driver to a sparsely populated area. As illustrated in Fig. 0(a), considering the two orders starting from the same area but to two different destinations, a neighboring central business district (CBD) and a distant suburb. A driver taking the latter one may have a higher oneoff order price due to the longer travel distance, but the subsequent suburban area with little demand could also prevent the driver from further sustaining income. Dispatching too many such orders will, therefore, reduce the number of orders taken and harm the longterm GMV. The problem becomes more serious particularly during the peak hours when the situation in places with the imbalance between vehicle supply and order demand gets worse. As such, an intelligent order dispatching system should be designed to not only assign orders with high prices to the drivers, but also to anticipate the future demandsupply gap and distribute the imbalance among different destinations. In the meantime, the pickup distance should also be minimized, as the drivers will not get paid during the pickup process; on the other hand, long waiting time will affect the customer experience.
One direction to tackle the order dispatching challenge has been to apply handcrafted features to either centralized dispatching authorities (e.g., the combinatorial optimization algorithm (Papadimitriou and Steiglitz, 1982; Zhang et al., 2017)) or distributed multiagent scheduling systems (Wooldridge, 2009; Alshamsi and Abdallah, 2009), in which a group of autonomous agents that share a common environment interact with each other. However, the system performance relies highly on the specially designed weighting scheme. For centralized approaches, another critical issue is the potential ”single point of failure” (Lynch, 2009), i.e., the failure of the centralized authority control will fail the whole system. Although the multiagent formulation provides a distributed perspective by allowing each driver to choose their order preference independently, existing solutions require rounds of direct communications between agents during execution (Seow et al., 2010), thus being limited to a local area with a small number of agents.
Recent attempts have been made to formulate this problem with centralized authority control and modelfree reinforcement learning (RL) (Sutton and Barto, 1998), which learns a policy by interacting with a complex environment. However, existing approaches (Xu et al., 2018) formulate the order dispatching problem with the singleagent setting, which is unable to model the complex interactions between drivers and orders, thus being oversimplifications of the stochastic demandsupply dynamics in largescale ridesharing scenarios. Also, executing an order dispatching system in a centralized manner still suffers from the reliability issue mentioned above that is generally inherent to the centralized architecture.
Staying different from these approaches, in this work we model the order dispatching problem with multiagent reinforcement learning (MARL) (Buşoniu et al., 2010), where agents share a centralized judge (the critic) to rate their decisions (actions) and update their strategies (policies). The centralized critic is no longer needed during the execution period as agents can follow their learned policies independently, making the order dispatching system more robust to potential hardware or connectivity failures. With the recent development of the Internet of Things (IoT) (Zanella et al., 2014; Gubbi et al., 2013; Atzori et al., 2010) and the Internet of Vehicles (IoV) (Lu et al., 2014; Yang et al., 2014), the fully distributed execution could be practically deployed by distributing the centralized trained policy to each vehicle through the VehicletoNetwork (V2N) (Abboud et al., 2016). By allowing each driver to learn by maximizing its cumulative reward through time, the reinforcement learning based approach relieves us from designing a sophisticated weighting scheme for the matching algorithms. Also, the multiagent setting follows the distributed nature of the peertopeer ridesharing problem, providing the dispatching system with the ability to capture the stochastic demandsupply dynamics in largescale ridesharing scenarios. Meanwhile, such fully distributed executions also enable us to scale up to much larger scenarios with many more agents, i.e., a scalable realtime order dispatching system for ridesharing services with millions of drivers.
Nonetheless, the major challenge in applying MARL to order dispatching lies in the changing dynamics of two components: the size of the action set and the size of the population. As illustrated in Fig. 0(b), the action set for each agent is defined as a set of neighboring active orders within a given radius from each active driver (shown as the shadow area around each black dot). As the active orders will be taken and new orders keep arriving, the size and content of this set will constantly change over time. The action set will also change when the agent moves to another location and arrives in a new neighborhood. On the other hand, drivers can also switch between online and offline in the realworld scenario, the population size for the order dispatching task is therefore also changing over time.
In this paper, we address the two problems of variable action sets and population size by extending the actorcritic policy gradient methods. Our methods tackle the order dispatching problem within the framework of centralized training with decentralized execution. The critic is provided with the information from other agents to incorporate the peer information, while each actor behaves independently with local information only. To resolve the variable population size, we adopt the mean field approximation to transform the interactions between agents to the pairwise interaction between an agent and the average response from a subpopulation in the neighborhood. We provide the convergence proof of mean field reinforcement learning algorithms with function approximations to justify our algorithm in theory. To solve the issue of changing action set, we use the vectorized features of each order as the network input to generate a set of ranking values, which are fed into a Boltzmann softmax selector to choose an action. Experiments on the largescale simulator show that compared with variant multiagent learning benchmarks, the mean field multiagent reinforcement learning algorithm gives the best performance towards the GMV, the order response rate, and the average pickup distance. Besides the state of the art performance, our solution also enjoys the advantage of distributed execution, which has lower latency and is easily adaptable to be deployed in the real world application.
2. Method
In this section, we first illustrate our definition of order dispatching as a Markov game, and discuss two challenges when applying MARL to this game. We then propose a MARL approach with the independent learning, namely the independent order dispatching algorithm (IOD), to solve this game. By extending IOD with mean field approximations, which capture dynamic demandsupply variations by propagating many local interactions between agents and the environment, we finally propose the cooperative order dispatching algorithm (COD).
2.1. Order Dispatching as a Markov Game
2.1.1. Game Settings
We model the order dispatching task by a Partially Observable Markov Decision Process (POMDP)
(Littman, 1994) in a fully cooperative setting, defined by a tuple , whereare the sets of states, transition probability functions, sets of joint actions, reward functions, sets of private observations, number of agents, and a discount factor respectively. Given two sets
and , we use to denote the Cartesian product of and , i.e., . The definitions are given as follows:
: homogeneous agents identified by are defined as the active drivers in the environment. As the drivers can switch between online and offline via a random process, the number of agents could change over time.

: At each time step , agent draws private observations correlated with the true environment state according to the observation function . The initial state of the environment is determined by a distribution . A typical environmental state includes the order and driver distribution, the global timestamp, and other environment dynamics (traffic congestion, weather conditions, etc.). In this work, we define the observation for each agent with three components: agent ’s location , the timestamp , and an ontrip flag to show if agent is available to take new orders.

: In the order dispatching task, each driver can only take active orders within its neighborhood (inside a given radius, as illustrated in Fig. 0(b)). Hence, agent ’s action set is defined as its own active order pool based on the observation . Each action candidate is parameterized by the normalized vector representation of corresponding order’s origin and destination , i.e., . At time step , each agent takes an action
, forming a set of joint driverorder pair
, which induces a transition in the environment according to the state transition function(1) For simplicity, we assume no order cancellations and changes during the trip, i.e., keeps unchanged if agent is on its way to the destination.

: Each agent obtains rewards by a reward function
(2) As described in Section 1, we want to maximize the total income by considering both the charge of each order and the potential opportunity of the destination. The reward is then defined as the combination of driver ’s own income from its order choice and the order destination potential , which is determined by all agents’ behaviors in the environment. Considering the creditassignment problem (Agogino and Tumer, 2008) arises in MARL with many agents, i.e., the contribution of an agent’s behavior is drowned by the noise of all the other agents’ impact on the reward function, we set each driver’s own income instead of the total income of all drivers as the reward.
Figure 2. Overview of our COD approach. (a) The information flow between RL agents and the environment (shown with two agents). The centralized authority is only needed during the training stage (dashed arrows) to gather the average response ; agents could behave independently during the execution stage (solid arrows), thus being robust to the ”single point of failure”. (b) and (c) Architectures of the critic and the actor. To encourage cooperation between agents and avoid agents’ being selfish and greedy, we use the order destination’s demandsupply gap as a constraint on the behavior of agents. Precisely, we compare the demandsupply status between the order origin and destination, and encourage the driver to choose the order destination with a larger demandsupply gap. The order destination potential (DP) is defined as
(3) where #DD and #DS is the demand and the supply of the destination respectively. We consider the DP only if the number of orders are larger than that of drivers at the origin. If the order destination has more drivers than orders, we penalize this order with the demandsupply gap at the destination, and vice versa.
To provide better customer experience, we also add the pickup distance as a regularizer. The ratio of the DP and the pickup distance to order price are defined as and respectively, i.e., . We typically choose the regularization ratio to scale different reward terms into approximately the same range, although in practice a gridsearch could be used to get better performance. The effectiveness of our reward function setting is empirically verified in Section 3.1.3.

: Each agent aims to maximize its total discounted reward
(4) from time step onwards, where is the discount factor.
We denote joint quantities over agents in bold, and joint quantities over agents other than a given agent with the subscript , e.g., . To stabilize the training process, we maintain an experience replay buffer containing tuples as described in (Mnih et al., 2015).
2.1.2. Dynamic of the Action Set Elements
In the order dispatching problem, an action is defined as an active order within a given radius from the agent. Hence, the content and size of the active order pool for each driver are changing with both the location and time, i.e., the action set for an agent in the order dispatching MDP is changing throughout the training and execution process. This aspect of order dispatching refrains us from using the table to log the value because of the potentially infinite size of the action set. On the other hand, a typical policy network for stochastic actions makes use of a softmax output layer to produce a set of probabilities of choosing each action among a fixed set of action candidates, thus is unable to fit into the order dispatching problem.
2.1.3. Dynamic of the Population Size
To overcome the nonstationarity of the multiagent environment, Lowe et al. (2017) uses Learning to approximate the discounted reward , and rewrites the gradient of the expected return for agent following a deterministic policy (parameterized by ) as
(5) 
Here is a centralized actionvalue function that takes as input the observation of agent and the joint actions of all agents, and outputs the value for agent . However, as offline drivers cannot participate in the order dispatching procedure, the number of agents in the environment is changing over time. Also, in a typical order dispatching task, which involves thousands of agents, the high dynamics of interactions between a large number of agents is intractable. Thus, a naive concatenation of all other agents’ actions cannot form a valid input for the value network and is not applicable to the order dispatching task.
2.2. Independent Order Dispatching
To solve the order dispatching MDP, we first propose the independent order dispatching algorithm (IOD), a straightforward MARL approach with the independent learning. We provide each learner with the actorcritic model, which is a popular form of policy gradient (PG) method. For each agent , PG works by directly adjusting the parameters of the policy to maximize the objective by taking steps in the direction of . In MARL, independent actorcritic uses the actionvalue function to approximate the discounted reward by Learning (Watkins and Dayan, 1992). Here we use temporaldifference learning (Sutton, 1988) to approximate the true , leading to a variety of actorcritic algorithms with called as the critic and called as the actor.
To solve the problem of variable action sets, we use a policy network with both observation and action embeddings as the input, derived from the inaction approximation methods (shown in Fig. 1(b)). As illustrated in Fig. 1(c), we use a deterministic policy (denoted by , abbreviated as ) to generate ranking values of each observationaction pair for each of the candidates within agent ’s action set . To choose an action , these values are then fed into a Boltzmann softmax selector
(6) 
where is the temperature to control the exploration rate. Note that a typical policy network with outaction approximation is equivalent to this approach, where we can use an dimension onehot vector as the embedding to feed into the policy network. The main difference between these approaches is the execution efficiency, as we need exactly forward passes in a single execution step. Meanwhile, using order features naturally provide us with an informative form of embeddings. As each order is parameterized by the concatenation of the normalized vector representation of its origin and destination , i.e., . Similar orders will be close to each other in the vector space and produce similar outputs from the policy network, which improves the generalization ability of the algorithm.
In IOD, each critic takes input the observation embedding by combining agent ’s location and the timestamp . The action embedding is built with the vector representation of the order destination and the distance between the driver location and the order origin. The critic is a DQN (Mnih et al., 2015)
using neural network function approximations to learn the actionvalue function
(parameterized by , abbreviated as for each agent ) by minimizing the loss(7) 
where is the target network for the actionvalue function, and is the target network for the deterministic policy. These earlier snapshots of parameters are periodically updated with the most recent network weights and help increase learning stability by decorrelating predicted and target values and deterministic policy values.
Following Silver et al. (2014), the gradient of the expected return for agent following a deterministic policy is
(8) 
Here is an actionvalue function that takes as input the observation and action , and outputs the value for agent .
2.3. Cooperative Order Dispatching with Mean Field Approximation
To fully condition on other agents’ policy in the environment with variable population size, we propose to integrate our IOD algorithm with mean field approximations, following the Mean Field Reinforcement Learning (MFRL) (Yang et al., 2018). MFRL addresses the scalability issue in the multiagent reinforcement learning with a large number of agents, where the interactions are approximated pairwise by the interaction between an agent and the average response from a subpopulation in the neighborhood. As this pairwise approximation shadows the exact size of interacting counterparts, the use of mean field approximation can help us model other agents’ policies directly in the environment with variable population sizes.
In the order dispatching task, agents are interacting with each other by choosing order destinations with a high demand to optimize the demandsupply gap. As illustrated in Fig. 0(b), the range of the neighborhood is then defined as twice the length of the order receiving radius, because agents within this area have intersections between their action sets and interact with each other. The average response is therefore defined as the number of drivers arriving at the same neighborhood as agent , divided by the number of available orders for agent . For example, when agent finishes the order and arrives at the neighborhood in Fig. 0(b) (the central agent), the average response is 2/3, as there are two agents within the neighborhood area and three available orders.
The introduction of mean field approximations enables agents to learn with the awareness of interacting counterparts, thus helping to improve the training stability and robustness of agents after training for the order dispatching task. Note that the average response only serves for the model update; thus the centralized authority is only needed during the training stage. During the execution stage, agents could behave in a fully distributed manner, thus being robust to the ”single point of failure”.
We propose the cooperative order dispatching algorithm (COD) as illustrated in Fig. 1(a), and present the pseudo code for COD in Algorithm 1. Each critic is trained by minimizing the loss
(9) 
where is the mean field value function for the target network and (shown as the Boltzmann selector from Eq. 6)
(10) 
and is the average response within agent ’s neighborhood. The actor of COD learns the optimal policy by using the policy gradient:
(11) 
In the current decision process, active agents sharing duplicated order receiving areas (e.g., the central and upper agent in Fig. 0(b)) might select the same order following their own strategy. Such collisions could lead to invalid order assignment and force both drivers and customers to wait for a certain period, which equals to the time interval between each dispatching iteration. Observe that the time interval of decisionmaking also influences the performance of dispatching; a too long interval will affect the passenger experience, while a too short interval without enough order candidates is not conducive to the decision making. To solve this problem, our approach works in a fully distributed manner with asynchronous dispatching strategy, allowing agents have different decision time interval for individual states, i.e., agents assigned with invalid orders could immediately rechoose a new order from updated candidates pool.
To theoretically support the efficacy of our proposed COD algorithm, we provide the convergence proof of MFRL with function approximations as shown below.
2.4. Convergence of Mean Field Reinforcement Learning with Function Approximations
Inspired by the previous proof of MFRL convergence in a tabular function setting (Yang et al., 2018), we further develop the proof towards the converge when the function is represented by other function approximators. In addition to the Markov Game setting in Section 2.1.1, let be a family of realvalued functions defined on , where is the action space for the mean actions computed from the neighbors. For simplicity, we assume the environment is a fully observable MDP , i.e., each agent can observe the global state instead of the local observation .
Assuming that the function class is linearly parameterized, for each agent , the function can be expressed as the linear span of a fixed set of linearly independent functions . Given the parameter vector , the function (abbreviated as ) is thus defined as
(12) 
In the function approximation setting, we apply the update rules:
(13) 
where is the temporal difference:
(14) 
Our goal is to derive the parameter vector such that approximates the (local) Nash values. Under the main assumptions and the lemma as introduced below, Yang et al. (2018) proved that the policy is Lipschitz continuous with respect to , where and is the upper bound of the observed reward.
Assumption 1 ().
Each stateaction pair is visited infinitely often, and the reward is bounded by some constant .
Assumption 2 ().
Agent’s policy is Greedy in the Limit with Infinite Exploration (GLIE). In the case with the Boltzmann policy, the policy becomes greedy w.r.t. the function in the limit as the temperature decays asymptotically to zero.
Assumption 3 ().
For each stage game at time and in state in training, for all , , , the Nash equilibrium is recognized either as 1) the global optimum or 2) a saddle point expressed as:

;

and
.
Lemma 2.1 ().
The random process defined in as
converges to zero with probability (w.p.) when

, , ;

, the set of possible states, and ;

, where and converges to zero w.p.;

with constant .
Here denotes the filtration of an increasing sequence of fields including the history of processes; and is a weighted maximum norm (Bertsekas, 2012).
Proof.
In contrast to the previous work Yang et al. (2018), we establish convergence of Eq. (13
) by adopting an ordinary differentiable equation (ODE) with a globally asymptotically stable equilibrium point where the trajectories closely follow, following the framework of the convergence proof of singleagent
learning with function approximation (Melo et al., 2008).Theorem 2.2 ().
Proof.
We first rewrite the Eq. (13) as on ODE:
(15) 
Notice that we use a vector for considering the updating rule for the function of each agent. We can easily know that necessity condition of the equilibrium is . The existence of the such equilibrium has been restricted in the scenario that meets Assumption 3. Yang et al. (2018) proved that under the Assumption 3, the existing equilibrium, either in the form of a global equilibrium or in the form of a saddlepoint equilibrium, is unique.
Let , we have:
(16) 
As we know that the policy is Lipschitz continuous w.r.t , this implies that and are also Lipschitz continuous w.r.t to . In other words, if is sufficiently small and close to zero, then the norm term of goes to zero. Considering near the equilibrium point , is a negative definite matrix, the Eq. (16) tends to be negative definite as well, so the ODE in Eq.(15) is globally asymptotically stable and the conclusion of the theorem follows. ∎
While in practice, we might break the linear condition by the use of nonlinear activation functions, the Lipschitz continuity will still hold as long as the nonlinear addon is limited to a small scale.
3. Experiment
To support the training and evaluation of our MARL algorithm, we adopt two simulators with a gridbased map and a coordinatebased map respectively. The main difference between these two simulators is the design of the driver pickup module, i.e., the process of a driver reaching the origin of the assigned order. In the gridbased simulator (introduced by Lin et al. (2018)), the location state for each driver and order is represented by a grid ID. Hence, the exact coordinate for each instance inside a grid is shadowed by the state representation. This simplified setting ensures there will be no pickup distance (or arriving time) difference inside a grid; it also brings an assumption of no cancellations before driver pickup. Whereas in the coordinatebased simulator, the location of each driver and order instance is represented by a twovalue vector from the Geographic Coordinate System, and the cancellation before pickup is also taken into account. In this setting, taking an order within an appropriate pickup distance is crucial to each driver, as the order may be canceled if the driver takes too long time to arrive. We present experiment details of the gridbased simulator in Section 3.1, and the coordinatebased simulator in Section 3.2.
3.1. Gridbased Experiment
3.1.1. Environment Setting
In the gridbased simulator, the city is covered by a hexagonal gridworld as illustrated in Fig. 3. At each simulation time step , the simulator provides an observation with a set of active drivers and a set of available orders. Each order feature includes the origin grid ID and the destination grid ID, while each driver has the grid ID as the location feature . Drivers are regarded as homogeneous and can switch between online (active) and offline via a random process learned from the history data. As the travel distance between neighboring grids is approximately kilometers and the time step interval is minutes, we assume that drivers will not move to other grids before taking a new order, and define the order receiving area and the neighborhood as the grid where the agent stays. The order dispatching algorithm then generates an optimal list of driverorder pairs for the current policy, where is an available order selected from the order candidate pool . In the gridbased setting, the origin of each order is already embedded as the location feature in , thus is parameterized by the destination grid ID . After receiving the driverorder pairs from the algorithm, the simulator will then return a new observation and a list of order fees. Stepping on this new observation, the order dispatching algorithm will calculate a set of rewards for each agent, store the record to replay buffer, and update the network parameters with respect to a batch of samples from replay buffer.
The data source of this simulator (provided by DiDi Chuxing) includes order information and trajectories of vehicles in three weeks. Available orders are generated by bootstrapping from real orders occurred in the same period during the day given a bootstrapping ratio . More concretely, suppose the simulation time step interval is , at each simulation time step , we randomly sample orders with replacement from real orders happened between to
. Also, drivers are set between online and offline following a distribution learned from real data using a maximum likelihood estimation. On average, the simulator has
drivers and dispatching events per time step.The effectiveness of the gridbased simulator is evaluated by Lin et al. (2018) using the calibration against the real data regarding the most important performance measurement: the gross merchandise volume (GMV). The coefficient of determination between simulated GMV and real GMV is and the Pearson correlation is with value .
3.1.2. Model Setting
We use the gridbased simulator to compare the performance of following methods.

Random (RAN): The random dispatching algorithm considers no additional information. It only assigns all active drivers with an available order at each time step.

Responsebased (RES): This responsebased method aims to achieve higher order response rate by assigning drivers to short duration orders. During each time step, all available orders starting from the same grid will be sorted by the estimated trip time. Multiple orders with the same expected duration will be further sorted by the order price to balance the performance.

Revenuebased (REV): The revenuebased algorithm focuses on a higher GMV. Orders with higher prices will be given priority to get dispatched first. Following the similar principle as described above, orders with shorter estimated trip time will be assigned first if multiple orders have the same price.

IOD: The independent order dispatching algorithm as described in Section 2.2. The actionvalue function approximation (i.e., the
network) is parameterized by an MLP with four hidden layers (512, 256, 128, 64) and the policy network is parameterized by an MLP with three hidden layers (256, 128, 64). We use the ReLU
(Nair and Hinton, 2010) activation between hidden layers, and transform the final linear output ofnetwork and policy network with ReLU and sigmoid function respectively. To find an optimal parameter setting, we use the Adam Optimizer
(Kingma and Ba, 2014) with a learning rate of for the critic and for the actor. The discounted factor is , and the batch size is . We update the network parameters after every samples are added to the replay buffer (capacity ). We use a Boltzmann softmax selector for all MARL methods and set the initial temperature as , then gradually reduce the temperature until to limit exploration.
As described in the reward setting in Section 2.1.1, we set the regularization ratio for DP and for the order waiting time penalty as we don’t consider the pickup distance in the gridbased experiment. Because of our homogeneous agent setting, all agents share the same network and policy network for efficient training. During the execution in the realworld environment, each agent can keep its copy of policy parameters and receive updates periodically from a parameter server.
100%  50%  10%  

RES  
REV  
IOD  
COD 
3.1.3. Result Analysis
For all learning methods, we run episodes for training, store the trained model periodically, and conduct the evaluation on the stored model with the best training performance. The training set is generated by bootstrapping 50% of the original real orders unless specified otherwise. We use five random seeds for testing and present the averaged result. We compare the performance of different methods by three metrics, including the total income in a day (GMV), the order response rate (ORR), and the average order destination potential (ADP). ORR is the number of orders taken divided by the number of orders generated, and ADP is the sum of destination potential (as described in Section 2.1.1) of all orders divided by the number of orders taken.
Gross Merchandise Volume
As shown in Table 1, the performance of COD largely surpasses all rulebased methods and IOD in GMV metric. RES suffers from lowest GMV among all methods due to its preference of short distance trips with lower average order value. On the other hand, REV aims to pick higher value orders with longer trip time, thus enjoying a higher GMV. However, both RES and REV cannot find a balance between getting higher income per order and taking more orders, while RAN falls into a suboptimal tradeoff without favoring either side. Instead, our proposed MARL methods (IOD and COD) achieve higher growths in terms of GMV by considering each order’s price and the destination potential concurrently. Orders with relatively low destination potential will be less possible to get picked, thus avoiding harming GMV by preventing the driver from trapping in areas with very few future orders. By direct modeling other agents’ policies and capturing the interaction between agents in the environment, the COD algorithm with mean field approximation gives the best performance among all comparing methods.
100%  50%  10%  

RES  
REV  
IOD  
COD 
Order Response Rate
In Table 2 we compare the performance of different models in terms of the order response rate (OOR), which is the number of orders taken divided by the number of orders generated. RES has a higher OOR than the random strategy as it focuses on reducing the trip distance to take more orders. On the other hand, REV aims to pick higher value orders with longer trip time, leading to sacrifice on OOR. Although REV has a relatively higher GMV than other two rulebased methods, its lower OOR indicates a lower customer satisfaction rate, thus failing to meet the requirement of an optimal order dispatching algorithm. By considering both the average order value and the destination potential, IOD and COD achieve higher OOR as well. Highpriced orders with low destination potential, i.e., a long trip to a suburban area, will be less possible to get picked, thus avoiding harming OOR when trying to take a highpriced order.
100%  50%  10%  

RES  
REV  
IOD  
COD 
Average Order Destination Potential
To better present our method on optimizing the demandsupply gap, we list the average order destination potential (ADP) in Table 3. Note that all ADP values are negative, which indicates the supply still cannot fully satisfy the demand on average. However, IOD and COD largely alleviate the problem by dispatching orders to places with higher demand. As shown in Fig. 4, COD largely fills the demandsupply gap in the city center during peak hours, while REV fails to assign drivers to take orders with highdemand destination, thus leaving many grids with unserved orders.
Sequential Performance Analysis
To investigate the performance change of our proposed algorithms regarding the time of the day, Fig. 5 shows the normalized income of all algorithms with respect to the average hourly revenue of RAN. We eliminate RES from this comparison since it does not focus on getting a higher GMV and has relatively low performance. A positive value of the bar graph illustrates an increase in income compared to RAN, and vice versa. As shown in Fig. 5, MARL methods (IOD and COD) outperform RAN in most of the hours in a day (except for the late night period between 12 a.m. to 4 a.m., when very few orders are generated). During the peak hours in the morning and at night, MARL methods achieve a much higher hourly income than RAN. REV achieves a higher income than MARL methods between late night and early morning (12 a.m. to 8 a.m.) because of its aggressive strategy to take highpriced orders. However, this strategy ignores the destination quality of orders and assigns many drivers with orders to places with low demand, resulting in a significant income drop during the rest of the day (except for the evening peak hours when the possibility of encountering highpriced orders is large enough for REV to counteract the influence of the low response rate). On the other hand, IOD and COD earn significantly higher than RAN and REV for the rest of the day, possibly because of MARL methods’ ability to recognize a better order in terms of both the order fee and the destination potential. As the order choice of MARL methods will prevent the agent from choosing a destination with lower potential, agents following MARL methods are thus enjoying more sustainable incomes. Also, COD outperforms IOD constantly, showing the effectiveness of explicitly conditioning on other agents’ policies to capture the interaction between agents.
Method  GMV (%)  OOR (%)  ADP (%) 

AVE  
IND  
AVE + DP  
IND + DP 
Effectiveness of Reward Settings
As described in Section 2.1.1, we use the destination potential as a reward function regularizer to encourage cooperation between agents. To show the effectiveness of this reward setting, we compare the GMV, OOR, and ADP of setting average income (AVE) and independent income (IND) as each agent’s reward for IOD respectively. We also measure the performance of adding DP as a regularizer for both settings. For this experiment, we bootstrap 10% of the original real orders for the training and test set separately. As shown in Table 4, the performance of AVE in terms of all metrics is relatively lower than those of IND methods and RAN (even with DP added). This is possibly because of the credit assignment problem, where the agent’s behavior is drowned by the noise of other agents’ impact on the reward function. On the other hand, setting the individual income as the reward helps to distinguish each agent’s contribution to the global objective from others, while adding DP as a regularizer further encourages the coordination between agents by arranging them to places with higher demand.
3.2. Coordinatebased Experiment
3.2.1. Environment Setting
As the realworld environment is coordinatebased rather than gridbased, we also conduct experiments on a more complex coordinatebased simulator provided by DiDi Chuxing. At each time step , the coordinatebased simulator provides an observation including a set of active drivers and a set of available orders. Each order feature includes the coordinate for the origin and the destination, while each driver has the coordinate as the location feature. The order dispatching algorithm works the same as described in Section 3.1.1
. To better approximate the realworld scenario, this simulator also considers order cancelations, i.e., an order might be canceled during the pickup process. This dynamic is controlled by a random variable which is positively related to the arriving time. The data resource of this simulator is based on historical dispatching events, including order generation events, driver logging on/off events and order fee estimation. During the training stage, the simulator will load five weekdays data with
dispatching events and generate drivers. For evaluation, our model is applied on future days which are not used in the training phase.3.2.2. Model Setting
We evaluate the performance of following MARL based methods including IOD, COD, and a DQN variation of IOD (QIOD), i.e., without the policy network. We also compare these MARL methods with a centralized combinatorial optimization method based on the Hungarian algorithm (HOD). The HOD method focuses on minimizing the average arriving time (AAT) by setting the weight of each driverorder pair with the pickup distance. For all MARL based methods, the same network architecture setting as described in Section 3.1.2 is applied. Except that we use a minibatch size of 200 because of the shorter simulation gap. The regularization ratio for pickup distance is in this experiment.
3.2.3. Result Analysis
We train all MARL methods for 400K iterations and apply the trained model in a test set (consists of three weekdays) for comparison. We compare different algorithms in terms of the total income in a day (GMV) and the average arriving time (AAT). GMV2 considers the cancellation while GMV1 doesn’t. All the above metrics are normalized with respect to the result of HOD.
Method  GMV1  GMV2  AAT 

QIOD  
IOD  
COD  +0.32%  +0.06% 
As shown in Table 5, the result of COD largely outperforms QIOD and IOD in both GMV1 and GMV2, showing the effectiveness of direct modeling of other agents’ policies in MARL. In addition, COD outperforms HOD in both GMV settings as well; this justifies the advantage of MARL algorithms that exploit the interaction between agents and the environment to maximize the cumulative reward. The performance improvement of GMV2 is smaller than that of GMV1 for MARL methods. This is possibly because that HOD works by minimizing the global pickup distance and has a shorter waiting time. On the other hand, MARL methods only consider the pickup distance as a regularization term, thus performing comparatively worse than HOD regarding AAT. As shown in Table 5, the AAT of all MARL methods are relatively longer than that of the combinatorial optimization method. However, as the absolute values of GMV are orders of magnitude higher than ATT, the increase in ATT is relatively minor and is thus tolerable in the order dispatching task. Also, MARL methods require no centralized control during execution, thus making the order dispatching system more robust to potential hardware or connectivity failures.
4. Related Work
Order Dispatching
Several previous works addressed the order dispatching problem by either centralized or decentralized ruledbased approaches. Lee et al. (2004) and Lee et al. (2007) chose the pickup distance (or time) as the basic criterion, and focused on finding the nearest option from a set of homogeneous drivers for each order on a firstcome, firstserved basis. These approaches only focus on the individual order pickup distance; however, they do not account for the possibility of other orders in the waiting queue being more suitable for this driver. To improve global performance, Zhang et al. (2017) proposed a novel model based on centralized combinatorial optimization by concurrently matching multiple driverorder pairs within a short time window. They considered each driver as heterogenous by taking the longterm behavior history and shortterm interests into account. The above methods work with centralized control, which is prone to the potential ”single point of failure” (Lynch, 2009).
With the decentralized setting, Seow et al. (2010) addressed the problem by grouping neighboring drivers and orders in a small multiagent environment, and then simultaneously assigning orders to drivers within the group. Drivers in a group are considered as agents who conduct negotiations by several rounds of collaborative reasoning to decide whether to exchange current order assignments or not. This approach requires rounds of direct communications between agents, thus being limited to a local area with a small number of agents. Alshamsi and Abdallah (2009)
proposed an adaptive approach for the multiagent scheduling system to enable negotiations between agents (drivers) to reschedule allocated orders. They used a cycling transfer algorithm to evaluate each driverorder pair with multiple criteria, requiring a sophisticated design of feature selection and weighting scheme.
Different from rulebased approaches, which require additionally handcrafted heuristics, we use a modelfree RL agent to learn an optimal policy given the rewards and observations provided by the environment. A very recent work by
Xu et al. (2018) proposed an RLbased dispatching algorithm to optimize resource utilization and user experience in a global and more farsighted view. However, they formulated the problem with the singleagent setting, which is unable to model the complex interactions between drivers and orders. On the contrary, our multiagent setting follows the distributed nature of the peertopeer ridesharing problem, providing the dispatching system with the ability to capture the stochastic demandsupply dynamics in largescale ridesharing scenarios. During the execution stage, agents will behave under the learned policy independently, thus being more robust to potential hardware or connectivity failures.MultiAgent Reinforcement Learning
One of the most straightforward approaches to adapt reinforcement learning in the multiagent environment is to make each agent learn independently regardless of the other agents, such as independent learning (Tan, 1993). They, however, tend to fail in practice (Matignon et al., 2012) because of the nonstationary nature of the multiagent environment. Several approaches have been attempted to address this problem, including sharing the policy parameters (Gupta et al., 2017), training the function with other agent’s policy parameters (Tesauro, 2004), or using importance sampling to learn from data gathered in a different environment (Foerster et al., 2017b). The idea of centralized training with decentralized execution has been investigated by several works (Foerster et al., 2017a; Lowe et al., 2017; Peng et al., 2017) recently for MARL using policy gradients (Sutton et al., 1999), and deep neural networks function approximators, based on the actorcritic framework (Konda and Tsitsiklis, 2000). Agents within this paradigm learn a centralized function augmented with actions of other agents as the critic during training stage, and use the learned policy (the actor) with local observations to guide their behaviors during execution. Most of these approaches limit their work to a small number of agents usually less than ten. To address the problem of the increasing input space and accumulated exploratory noises of other agents in largescale MARL, Yang et al. (2018) proposed a novel method by integrating MARL with mean field approximations and proved its convergence in a tabular function setting. In this work, we further develop MFRL and prove its convergence when the function is represented by function approximators.
5. Conclusion
In this paper, we proposed the multiagent reinforcement learning solution to the order dispatching problem. Results on two largescale simulation environments have shown that our proposed algorithms (COD and IOD) achieved (1) a higher GMV and OOR than three rulebased methods (RAN, RES, REV); (2) a higher GMV than the combinatorial optimization method (HOD), with desirable properties of fully distributed execution; (3) lower supplydemand gap during the rush hours, which indicates the ability to reduce traffic congestion. We also provide the convergence proof of applying mean field theory to MARL with function approximations as the theoretical justification of our proposed algorithms. Furthermore, our MARL approaches could achieve fully decentralized execution by distributing the centralized trained policy to each vehicle through VehicletoNetwork (V2N). For future work, we are working towards controlling ATT while maximizing the GMV with the proposed MARL framework. Another interesting and practical direction to develop is to use a heterogeneous agent setting with individual specific features, such as the personal preference and the distance from its own destination.
References
 (1)
 Abboud et al. (2016) Khadige Abboud, Hassan Aboubakr Omar, and Weihua Zhuang. 2016. Interworking of DSRC and cellular network technologies for V2X communications: A survey. IEEE transactions on vehicular technology 65, 12 (2016), 9457–9470.
 Agogino and Tumer (2008) Adrian K. Agogino and Kagan Tumer. 2008. Analyzing and Visualizing Multiagent Rewards in Dynamic and Stochastic Environments. Journal of Autonomous Agents and Multiagent Systems (2008), 320–338.
 Alshamsi and Abdallah (2009) Aamena Alshamsi and Sherief Abdallah. 2009. Multiagent selforganization for a taxi dispatch system. In Proceedings of 8th International Conference of Autonomous Agents and Multiagent Systems, 2009. 89–96.
 Amey et al. (2011) Andrew Amey, John Attanucci, and Rabi Mishalani. 2011. Realtime ridesharing: opportunities and challenges in using mobile phone technology to improve rideshare services. Transportation Research Record: Journal of the Transportation Research Board 2217 (2011), 103–110.
 Atzori et al. (2010) Luigi Atzori, Antonio Iera, and Giacomo Morabito. 2010. The Internet of Things: A survey. Computer Networks 54, 15 (2010), 2787 – 2805. https://doi.org/10.1016/j.comnet.2010.05.010
 Bertsekas (2012) Dimitri P Bertsekas. 2012. Weighted supnorm contractions in dynamic programming: A review and some new applications. Dept. Elect. Eng. Comput. Sci., Massachusetts Inst. Technol., Cambridge, MA, USA, Tech. Rep. LIDSP2884 (2012).
 Buşoniu et al. (2010) Lucian Buşoniu, Robert Babuška, and Bart De Schutter. 2010. Multiagent Reinforcement Learning: An Overview. Springer Berlin Heidelberg, Berlin, Heidelberg, 183–221. https://doi.org/10.1007/9783642144356_7
 Foerster et al. (2017a) Jakob Foerster, Gregory Farquhar, Triantafyllos Afouras, Nantas Nardelli, and Shimon Whiteson. 2017a. Counterfactual MultiAgent Policy Gradients. arXiv preprint arXiv:1705.08926 (2017).

Foerster et al. (2017b)
Jakob Foerster, Nantas
Nardelli, Gregory Farquhar, Triantafyllos
Afouras, Philip HS Torr, Pushmeet Kohli,
and Shimon Whiteson. 2017b.
Stabilising Experience Replay for Deep MultiAgent
Reinforcement Learning. In
International Conference on Machine Learning
. 1146–1155.  Furuhata et al. (2013) Masabumi Furuhata, Maged Dessouky, Fernando Ordóñez, MarcEtienne Brunet, Xiaoqing Wang, and Sven Koenig. 2013. Ridesharing: The stateoftheart and future directions. Transportation Research Part B: Methodological 57 (2013), 28–46.
 Gubbi et al. (2013) Jayavardhana Gubbi, Rajkumar Buyya, Slaven Marusic, and Marimuthu Palaniswami. 2013. Internet of Things (IoT): A vision, architectural elements, and future directions. Future Generation Computer Systems 29, 7 (2013), 1645 – 1660. https://doi.org/10.1016/j.future.2013.01.010 Including Special sections: Cyberenabled Distributed Computing for Ubiquitous Cloud and Network Services & Cloud Computing and Scientific Applications — Big Data, Scalable Analytics, and Beyond.
 Gupta et al. (2017) Jayesh K Gupta, Maxim Egorov, and Mykel Kochenderfer. 2017. Cooperative multiagent control using deep reinforcement learning. In AAMAS. Springer, 66–83.
 Jaakkola et al. (1994) Tommi Jaakkola, Michael I Jordan, and Satinder P Singh. 1994. Convergence of stochastic iterative dynamic programming algorithms. In NIPS. 703–710.
 Kingma and Ba (2014) Diederik P Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014).
 Konda and Tsitsiklis (2000) Vijay R. Konda and John N. Tsitsiklis. 2000. ActorCritic Algorithms. In Advances in Neural Information Processing Systems 12, S. A. Solla, T. K. Leen, and K. Müller (Eds.). MIT Press, 1008–1014. http://papers.nips.cc/paper/1786actorcriticalgorithms.pdf
 Lee et al. (2004) DerHorng Lee, Hao Wang, Ruey Cheu, and Siew Teo. 2004. Taxi dispatch system based on current demands and realtime traffic conditions. Transportation Research Record: Journal of the Transportation Research Board 1882 (2004), 193–200.
 Lee et al. (2007) Junghoon Lee, GyungLeen Park, Hanil Kim, YoungKyu Yang, Pankoo Kim, and SangWook Kim. 2007. A Telematics Service System Based on the Linux Cluster. In Computational Science – ICCS 2007, Yong Shi, Geert Dick van Albada, Jack Dongarra, and Peter M. A. Sloot (Eds.). Springer Berlin Heidelberg, Berlin, Heidelberg, 660–667.
 Li et al. (2016) Ziru Li, Yili Hong, and Zhongju Zhang. 2016. An empirical analysis of ondemand ride sharing and traffic congestion. In 2016 International Conference on Information Systems, ICIS 2016. Association for Information Systems.
 Lin et al. (2018) Kaixiang Lin, Renyu Zhao, Zhe Xu, and Jiayu Zhou. 2018. Efficient LargeScale Fleet Management via MultiAgent Deep Reinforcement Learning. arXiv preprint arXiv:1802.06444 (2018).
 Littman (1994) Michael L. Littman. 1994. Markov Games As a Framework for Multiagent Reinforcement Learning. In Proceedings of the Eleventh International Conference on International Conference on Machine Learning (ICML’94). Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 157–163. http://dl.acm.org/citation.cfm?id=3091574.3091594
 Lowe et al. (2017) Ryan Lowe, Yi Wu, Aviv Tamar, Jean Harb, OpenAI Pieter Abbeel, and Igor Mordatch. 2017. Multiagent actorcritic for mixed cooperativecompetitive environments. In NIPS. 6382–6393.
 Lu et al. (2014) N. Lu, N. Cheng, N. Zhang, X. Shen, and J. W. Mark. 2014. Connected Vehicles: Solutions and Challenges. IEEE Internet of Things Journal 1, 4 (Aug 2014), 289–299. https://doi.org/10.1109/JIOT.2014.2327587
 Lynch (2009) Gary S Lynch. 2009. Single point of failure: The 10 essential laws of supply chain risk management. John Wiley & Sons.

Matignon
et al. (2012)
Laetitia Matignon,
Guillaume J Laurent, and Nadine
Le FortPiat. 2012.
Independent reinforcement learners in cooperative
Markov games: a survey regarding coordination problems.
The Knowledge Engineering Review
27, 1 (2012), 1–31.  Melo et al. (2008) Francisco S Melo, Sean P Meyn, and M Isabel Ribeiro. 2008. An analysis of reinforcement learning with function approximation. In Proceedings of the 25th international conference on Machine learning. ACM, 664–671.
 Mnih et al. (2015) Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel Veness, Marc G Bellemare, Alex Graves, Martin Riedmiller, Andreas K Fidjeland, Georg Ostrovski, et al. 2015. Humanlevel control through deep reinforcement learning. Nature 518, 7540 (2015), 529–533.
 Nair and Hinton (2010) Vinod Nair and Geoffrey E Hinton. 2010. Rectified linear units improve restricted boltzmann machines. In Proceedings of the 27th international conference on machine learning (ICML10). 807–814.
 Papadimitriou and Steiglitz (1982) Christos H. Papadimitriou and Kenneth Steiglitz. 1982. Combinatorial Optimization: Algorithms and Complexity. PrenticeHall, Inc., Upper Saddle River, NJ, USA.
 Peng et al. (2017) Peng Peng, Quan Yuan, Ying Wen, Yaodong Yang, Zhenkun Tang, Haitao Long, and Jun Wang. 2017. Multiagent BidirectionallyCoordinated Nets for Learning to Play StarCraft Combat Games. arXiv preprint arXiv:1703.10069 (2017).
 Seow et al. (2010) Kiam Tian Seow, Nam Hai Dang, and DerHorng Lee. 2010. A collaborative multiagent taxidispatch system. IEEE Transactions on Automation Science and Engineering 7, 3 (2010), 607–616.
 Silver et al. (2014) David Silver, Guy Lever, Nicolas Heess, Thomas Degris, Daan Wierstra, and Martin Riedmiller. 2014. Deterministic policy gradient algorithms. In ICML. 387–395.
 Sutton (1988) Richard S Sutton. 1988. Learning to predict by the methods of temporal differences. Machine learning 3, 1 (1988), 9–44.
 Sutton and Barto (1998) Richard S Sutton and Andrew G Barto. 1998. Reinforcement learning: An introduction. Vol. 1. MIT press Cambridge.
 Sutton et al. (1999) Richard S Sutton, David A McAllester, Satinder P Singh, Yishay Mansour, et al. 1999. Policy gradient methods for reinforcement learning with function approximation.. In NIPS, Vol. 99. 1057–1063.
 Szepesvári and Littman (1999) Csaba Szepesvári and Michael L Littman. 1999. A unified analysis of valuefunctionbased reinforcementlearning algorithms. Neural computation 11, 8 (1999), 2017–2060.
 Tan (1993) Ming Tan. 1993. Multiagent reinforcement learning: Independent vs. cooperative agents. In Proceedings of the tenth international conference on machine learning. 330–337.
 Tesauro (2004) Gerald Tesauro. 2004. Extending Qlearning to general adaptive multiagent systems. In NIPS. 871–878.
 Watkins and Dayan (1992) Christopher JCH Watkins and Peter Dayan. 1992. Qlearning. Machine learning 8, 34 (1992), 279–292.
 Wooldridge (2009) Michael Wooldridge. 2009. An Introduction to MultiAgent Systems (2nd ed.). Wiley Publishing.
 Xu et al. (2018) Zhe Xu, Zhixin Li, Qingwen Guan, Dingshui Zhang, Qiang Li, Junxiao Nan, Chunyang Liu, Wei Bian, and Jieping Ye. 2018. LargeScale Order Dispatch in OnDemand RideHailing Platforms: A Learning and Planning Approach. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (KDD ’18). ACM, New York, NY, USA, 905–913. https://doi.org/10.1145/3219819.3219824
 Yang et al. (2014) F. Yang, S. Wang, J. Li, Z. Liu, and Q. Sun. 2014. An overview of Internet of Vehicles. China Communications 11, 10 (Oct 2014), 1–15. https://doi.org/10.1109/CC.2014.6969789
 Yang et al. (2018) Yaodong Yang, Rui Luo, Minne Li, Ming Zhou, Weinan Zhang, and Jun Wang. 2018. Mean Field MultiAgent Reinforcement Learning. In Proceedings of the 35th International Conference on Machine Learning (Proceedings of Machine Learning Research), Jennifer Dy and Andreas Krause (Eds.), Vol. 80. PMLR, Stockholmsmässan, Stockholm Sweden, 5567–5576.
 Zanella et al. (2014) A. Zanella, N. Bui, A. Castellani, L. Vangelista, and M. Zorzi. 2014. Internet of Things for Smart Cities. IEEE Internet of Things Journal 1, 1 (Feb 2014), 22–32. https://doi.org/10.1109/JIOT.2014.2306328
 Zhang et al. (2017) Lingyu Zhang, Tao Hu, Yue Min, Guobin Wu, Junying Zhang, Pengcheng Feng, Pinghua Gong, and Jieping Ye. 2017. A Taxi Order Dispatch Model Based On Combinatorial Optimization. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD ’17). ACM, New York, NY, USA, 2151–2159. https://doi.org/10.1145/3097983.3098138
Comments
There are no comments yet.