CoRide: Joint Order Dispatching and Fleet Management for Multi-Scale Ride-Hailing Platforms

by   Jiarui Jin, et al.

How to optimally dispatch orders to vehicles and how to trade off between immediate and future returns are fundamental questions for a typical ride-hailing platform. We model ride-hailing as a large-scale parallel ranking problem and study the joint decision-making the task of order dispatching and fleet management in online ride-hailing platforms. This task brings unique challenges in the four aspects. First, to facilitate a huge number of vehicles to act and learn efficiently and robustly, we treat each region cell as an agent and build a multi-agent reinforcement learning framework. Second, to coordinate the agents to achieve long-term benefits, we leverage the geographical hierarchy of the region grids to perform hierarchical reinforcement learning. Third, to deal with the heterogeneous and variant action space for joint order dispatching and fleet management, we design the action as the ranking weight vector to rank and select the specific order or the fleet management destination in a unified formulation. Fourth, to achieve the multi-scale ride-hailing platform, we conduct the decision-making process in a hierarchical way where multi-head attention mechanism is utilized to incorporate the impacts of neighbor agents and capture the key agent in each scale. The whole novel framework is named as CoRide. Extensive experiments based on multiple cities real-world data as well as analytic synthetic data demonstrate that CoRide provides superior performance in terms of platform revenue and user experience in the task of city-wide hybrid order dispatching and fleet management over strong baselines. This work provides not only a solution for current online ride-hailing platforms, but also an advanced artificial intelligent technique for future life especially when large scale unmanned ground vehicles going into service.



There are no comments yet.


page 8


Multi-Agent Reinforcement Learning for Order-dispatching via Order-Vehicle Distribution Matching

Improving the efficiency of dispatching orders to vehicles is a research...

Factorized Q-Learning for Large-Scale Multi-Agent Systems

Deep Q-learning has achieved a significant success in single-agent decis...

Efficient Large-Scale Fleet Management via Multi-Agent Deep Reinforcement Learning

Large-scale online ride-sharing platforms have substantially transformed...

Towards Efficient Connected and Automated Driving System via Multi-agent Graph Reinforcement Learning

Connected and automated vehicles (CAVs) have attracted more and more att...

PRIMAL2: Pathfinding via Reinforcement and Imitation Multi-Agent Learning – Lifelong

Multi-agent path finding (MAPF) is an indispensable component of large-s...

Learning to Collaborate in Multi-Module Recommendation via Multi-Agent Reinforcement Learning without Communication

With the rise of online e-commerce platforms, more and more customers pr...

Reinforcement Learning Models of Human Behavior: Reward Processing in Mental Disorders

Drawing an inspiration from behavioral studies of human decision making,...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1. Introduction

Online ride-hailing platforms such as Uber and Didi Chuxing have substantially transformed our lives by sharing and reallocating transportation resources to highly promote transportation efficiency. The emergency of Unmanned Ground Vehicles (UGV) will bring our future not only more convenience with a large number of intelligent automatic vehicles but also more challenging task for building intelligent transportation. In a general view, there are two major decision-making tasks for such ride-hailing platforms, namely (i) order dispatching, i.e., to match the orders and vehicles in real time to directly deliver the service to the users (Zou et al., 2013; Zhang et al., 2017; Seow et al., 2010), and (ii) fleet management, i.e., to reposition the vehicles to certain areas in advance to prepare for the later order dispatching (Lin et al., 2018; Oda and Tachibana, 2018; Simao et al., 2009).

Apparently, the decision-making of matching an order-vehicle pair or repositioning a vehicle to an area needs accounting for the future situation of the vehicle’s location and the orders nearby. Thus much of work has modeled order dispatching and fleet management as a sequential decision-making problem and solved it with reinforcement learning (RL) (Wang et al., 2018b; Xu et al., 2018; Lin et al., 2018; Tang and Qin, 2018). Most of the previous work deals with either order dispatching or fleet management without regarding the high correlation of these two tasks, especially for large-scale ride-hailing platforms in large cities, which leads to sub-optimal performance. In order to achieve near-optimal performance, inspired by thermodynamics, we simulate the whole ride-hailing platform as dispatch (order dispatching) and reposition (fleet management). As illustrated in Figure 1, we resemble vehicle and order as different molecules and aim at building up the system stability via reducing their number by dispatch and reposition. To address this complex criterion, in addition to interconnecting order dispatching and fleet management, joint consideration intra-district (grid-level) and inter-district (district-level) allocation would substantially facilitate especially during peak hours. With such a practical motivation, we focus on modeling joint order dispatching and fleet management with multi-scale decision-making system. There are several significant technical challenges to learn highly efficient allocation policy for the real-time ride-hailing platform:

Large-scale Agents. A fundamental question in any ride-hailing system is how to deal with a large number of orders and vehicles. One alternative is to model each available vehicle as an agent (Oda and Tachibana, 2018; Xu et al., 2018; Wei et al., 2018). However, such setting needs to maintain thousands of agents interacting with the environment, which brings a huge computational cost. Instead, we utilize the region grid world (as will be further discussed in Figure 2), which regards each region as an agent, and naturally model ride-hailing system in a hierarchical learning setting. This formulation allows decentralized learning and control with distributed implementation.

Immediate & Future Rewards. A key challenge in seeking an optimal control policy is to find a trade-off between immediate and future rewards in terms of accumulated driver income (ADI). Greedily matching vehicles with long-distance orders can receive high immediate gain at a single order dispatching stage, but it would harm order response rate (ORR) and future revenue especially during rush hour because of its long drive time and unpopular destination. When we take joint consideration of order dispatching and fleet management, whether to serve the order directing to areas containing few orders or reposition the driver to popular areas with zero pay also requires trade-off between immediate and future gains. Recent attempts (Xu et al., 2018; Wei et al., 2018; Oda and Tachibana, 2018) deployed RL to combine instant order reward from online planning with future state-value as the final matching value. However, the coordination between different regions is still far from optimal. Inspired by hierarchical RL (Vezhnevets et al., 2017), we introduce the geographical hierarchical structure of region agents. We treat large district as manager agent and small grid as worker agent, respectively. The manager operates at a lower spatial and temporal dimension and sets abstract goals which are conveyed to its workers. The worker takes specific actions and interacts with environment coordinated with manager-level goal and worker-level message. This decoupled structure facilitates very long timescale credit assignment (Vezhnevets et al., 2017) and guarantees balance between immediate and future revenue.

Figure 1. Ride-hailing task in thermodynamics view.

Heterogeneous & Variant Action Space. Traditional RL models require a fixed action space (Mnih et al., 2013). If we model picking an order as an RL action, there is no guarantee of a fixed action space as the available orders keep changing. Zhang et al. (2017)

proposed to learn a state-action value function to evaluate each valid order-vehicle match, then use a combinatorial optimization method such as Kuhn-Munkres (KM) algorithm

(Munkres, 1957) to filter the matches. However, such a method faces another important challenge that order dispatching and fleet management are different tasks, which results in heterogeneous action spaces. To address this issue, we redefine action as the weight vector for ranking orders and fleet management, where the fleet controls are regarded as fake orders, and all the orders are ranked and matched with vehicles in each agent. Thus it bypasses the issue of heterogeneous and variant action space as well as high computational costs.

Multi-Scale Ride-Hailing. Xu et al. (2018) introduced a policy evaluation based RL method to learn the dynamics for each grid. As its result shows, orders and vehicles often centralize at different districts (e.g. uptown and downtown in Figure 1). How to combine large hotspots in the city (inter-district) with small ones in districts (intra-district) is another challenge without much attention until now. In order to take both inter-district and intra-district allocation into consideration, we adopt and extend attention mechanism in a hierarchical way (as will be further discussed in Figure 3). Compared with learning value function for each grid homogeneously (Xu et al., 2018), this attention-based structure can not only capture the impacts of neighbor agents, but also learn to distinguish key grid and district in worker (grid) and manager (district) scales respectively.

Wrapping all modules together, we propose CoRide, a hierarchical multi-agent reinforcement learning framework to resolve the aforementioned challenges. The main contributions are listed as follows:

  • [topsep = 3pt,leftmargin =5pt]

  • We propose a model CoRide that learns to collaborate in hierarchical multi-agent setting for ride-hailing platform.

  • We conduct extensive experiments based on real-world data of multiple cities, as well as analytic synthetic data, demonstrate that CoRide provides superior performance in terms of ADI and ORR in the task of city-wide hybrid order dispatching and fleet management over strong baselines.

  • To the best of our knowledge, CoRide is the first work (i) to apply the hierarchical reinforcement learning on ride-hailing platform; (ii) to address the task of joint order dispatching and fleet management of online ride-hailing platforms; (iii) to introduce and study multi-scale ride-hailing task.

To sum up, our model employs large district as manager module and its sub-small grids as worker module. The manager operates at a lower temporal resolution, sets abstract goal to the worker and collaborates with other managers. The worker generates primitive actions and sends messages to its peers for cooperation. This structure conveys several benefits: (i) In addition to balancing long-term and short-term reward, it also facilitates adaptation in a dynamic real-world situation by assigning different goals to worker. (ii) Instead of considering all of the matches between available orders and vehicles globally, these tasks are distributed to each worker and manager agent and fulfilled in a parallel way.

2. Related Work

Decision-making for Ride-hailing. Order dispatching and fleet management are two major decision-making tasks for online ride-hailing platforms, which have acquired much attention from academia and industry during the recent few years.

To tackle these challenging transportation problems, automatically ruled-based approaches addressed order dispatching problem with either centralized or decentralized settings. Lee et al. (2004) and Lee et al. (2007) focused on reducing the pick-up distance or waiting time by choosing nearest orders or following first-come, first-serve principle. These approaches usually can’t reach a high success rate, for ignoring many potential orders in the waiting list. To improve global performance, Zhang et al. (2017)

proposed a novel model based on centralized combinatorial optimization by concurrently matching multiple vehicle-order pairs within a short time window. However, this approach needs to compute all available vehicle-order matches and requires feature engineering, which would be infeasible and prevent it to be adopted in the large-scale taxi-order dispatching situation. With the decentralized setting,

Seow et al. (2010) addressed this problem with a collaborative multi-agent taxi dispatching system. However, this method requires rounds of direct communications between agents, so it is limited to a local area with a small number of vehicles.

Instead of rule-based approaches, which require additional handcrafted heuristics, the current trending direction is to incorporate reinforcement learning algorithms in complicated traffic management problems.

Xu et al. (2018) proposed a learning and planning method based on reinforcement learning to optimize resource utilization and user experience in a global and more farsighted view. In (Oda and Tachibana, 2018), the authors leveraged the graph structure of the road network and expanded distributed DQN formulation to maximize entropy in the agents’ learning policy with soft Q-learning, to improve performance of fleet management. Wei et al. (2018) introduced a reinforcement learning method, which takes the uncertainty of future requests into account and can make a look-ahead decision to help the operator improve the global level-of-service of a shared-vehicle system through fleet management. To capture the complicated stochastic demand-supply variations in high-dimensional space, Lin et al. (2018) proposed a contextual multi-agent actor-critic framework to achieve explicit coordination among a large number of agents adaptive to different contexts in fleet management system.

Different from all aforementioned methods, our approach is the first, to the best of our knowledge, to consider the joint modeling of order dispatching and fleet management and also the only current work introducing and studying the multi-scale ride-hailing task.

Hierarchical Reinforcement Learning. Hierarchical reinforcement learning (HRL) is a promising approach to extend traditional reinforcement learning (RL) methods to solve tasks with long-term dependency or multi-level interaction patterns (Dayan and Hinton, 1993; Dietterich, 2000). Recent works have suggested that several interesting and standout results can be induced by training multi-level hierarchical policy in a multi-task setup (Frans et al., 2017; Sigaud and Stulp, 2018) or implementing hierarchical setting in sparse reward problems (Riedmiller et al., 2018; Vezhnevets et al., 2017).

The options framework (Stolle and Precup, 2002; Precup, 2000; Sutton et al., 1999) formulates the problem with a two-level hierarchy, where the low-level - option - is a sub-policy with a termination condition. Since traditional options framework suffers from prior knowledge on designing options, jointly learning high-level policy with low-level policy has been proposed by (Bacon et al., 2017). However, this actor-critic HRL approach needs to either learning sub-policies for each time step or one sub-policy for the whole episode. Therefore, the performance of the whole module often prone to learning useful sub-policies. To guarantee gaining effective sub-policies, one alternative direction is to provide auxiliary rewards for low-level policy: hand-designed rewards based on prior domain knowledge (Kulkarni et al., 2016; Tessler et al., 2017) or mutual information (Florensa et al., 2017; Daniel et al., 2012; Kong et al., [n. d.]). Given having access to one well-designed and suitable reward is often a luxury, Vezhnevets et al. (2017) proposed FeUdal Networks (FuN), where a generic reward is utilized for low-level policy learning, thus avoid designing hand-craft rewards. Several works extend and improve FuN with off-policy training (Nachum et al., 2018a), form of hindsight (Levy et al., 2018) and representation learning (Nachum et al., 2018b).

Our work is also developed from FuN (Vezhnevets et al., 2017), originally inspired by feudal RL (Dayan and Hinton, 1993). FuN employs only one pair of manager and worker and connects them with a parameterized goal and intrinsic reward. Instead, we model CoRide with multiple managers. Unlike our method, in FuN the manager and worker modules are set to one-to-one, share the same observation, and operate at the different temporal but same spatial resolution. In our CoRide, there are multiple workers learning to collaborate under one manager while the managers are also coordinating. The manager takes a joint observation of all workers, and each worker produces action based on specific observation and sharing goal. Stepping on this one-to-many setting, the manager can not only operate with long timescale credit but act at a lower spatial resolution. Recently, Ahilan and Dayan (2019) introduced a novel architecture named FMH for cooperation in multi-agent RL. Different from this proposed method, CoRide not only extends the scale of the multi-agent environment but also facilitates communication through multi-head attention mechanism, which computes influences of interactions and differentiates the impacts to each agent. In other words, FuN (Vezhnevets et al., 2017) is actually a special case of CoRide, where a single manager along with its only worker is employed. Yet, the majority of current HRL methods require careful task-specific design, making them difficult to apply in real-world scenarios (Nachum et al., 2018a). To the best of our knowledge, CoRide is the first work to apply hierarchical reinforcement learning on the ride-hailing problem.

3. Problem Formulation

We formulate the problem of controlling large-scale homogeneous vehicles in online ride-hailing platforms, which combines order dispatching system with fleet management system with the goal of maximizing the city-level ADI and ORR. In practice, vehicles are divided into two groups: order dispatching (OD) group and fleet management (FM) group. For OD group, we match these vehicles with available orders pair-wisely; whereas for FM group, we need to reposition them to the locations or dispatch orders to them (same as OD group). Illustration of the problem is shown in Figure 2. We use the hexagonal-grid world to represent the map and take a grid as an agent. Considering that only orders within the pick-up distance can be dispatched to vehicles, we set distance between grids based on the pick-up distance. Given that, in our setting, vehicles in the same spatial-temporal node are homogeneous, i.e., the vehicles located at the same grid share the same setting. As such, we can model order dispatching as a large-scale parallel ranking task, where we rank orders and match them with homogeneous vehicles in each grid. The fleet control for fleet management, i.e. repositioning the vehicle to neighbor grids or staying at the current grid, is treated as fake orders (as will be further discussed in Section 6) and conducted in the ranking procedure as same as order dispatching.

Since each agent can only reposition vehicles located in the managing grid, we propose to formulate the problem using

Partially Observable Markov Decision Process (POMDP)

(Spaan, 2012) in a hierarchical multi-agent reinforcement learning setting for both order dispatching and fleet management. Thus we decompose the original complicated tasks into many local ones and transform a high-dimensional problem into multiple low-dimensional problems.

Formally, we model this task as a Markov game for agents, which is defined by a tuple , where , , , ,

are the number of agents, set of states, set of actions, state transition probability, reward function, and a future reward discount factor, respectively. The definitions of major components are as follows.

Agent. We consider each region cell as an agent identified by , where . In detail, a single grid represents a worker agent, a district containing multiple grids represents a manager agent. An example is presented in Figure 2. Each individual grid serves as a worker agent with 6 neighbor grids, as shown in the different color, composes a manager agent. Note that although the number of vehicles and orders varies over time, the number of agents is fixed.

State , Observation . Although there are two different types of agents - manager and worker, their observations only differ in scale. Observation of each manager is actually a joint observation of its workers. At each timestep , agent draws private observations correlated with the environment state . In our setting, the state input used in our method is expressed as , where inner elements represent the number of vehicles, number of order, entropy, number of vehicles in FM group and distribution of order features in current grid (e.g. price, duration) respectively. Note that both dispatching and repositioning belong to resource allocation similar to the thermodynamic system (Figure 1), and once order dispatching or fleet management occurs, dispatched or fleeted items slip out of the system. Namely, only idle vehicles and available orders can contribute to disorder and unevenness of the ride-hailing system. Therefore, we introduce and extend the concept of entropy here, defined as:


where is a Boltzmann constant, and means probability for each state: for dispatched and fleeted, elsewhere. We choose to ignore items at state 1 (), according to the aforementioned analysis, and give the formula of as follows:


which is conditioned in situation and straightforward to transform to other situations.

Action , State Transition . In our hierarchical RL setting, manager’s action is to generate abstract and intrinsic goal to its workers, and each worker needs to provide a ranking list of relevant real orders (OD) and fleet control served as fake orders (FM). Thus, the action of worker is defined as the weight vector for the ranking features. Changing an action of the worker means to change the weight vector for the ranking features (as will be further discussed in Section 6). Each timestep the whole multi-agent system produces a joint action for each manager and worker , where , which induces a transition in the environment according to the state transition .

Reward . Like previous hierarchical RL settings (Vezhnevets et al., 2017), only manager interacts with the environment and receives feedback from it. This extrinsic reward function determines the direction of optimization and is proportional to both immediate profit and potential value; while the intrinsic reward is set to encourage the worker to follow the instruction from the manager.

Figure 2. Illustration of the grid world and problem setting.

More specifically, we give a simple example based on the above problem setting in Figure 2. At time , the worker agent 0 ranks available real orders and potential fake orders for fleet control, and selects the Selected-2 (as will be further discussed in Eq. (9)) options: a real order from grid 0 to grid 17, a fake order from grid 0 to grid 5. After the driver finished, the manager agent, whose sub-workers maintain the worker agent 0, received corresponding reward.

Figure 3. Overall architecture of CoRide.

4. Methodologies

4.1. Overall Architecture

As shown in Figure 3, CoRide employs two layers of agents, namely the layer of manager agents and the layer of worker agents. Each agent is associated with a communication component for exchanging messages. As both agent and decision-making process conduct in a hierarchical way, multi-head attention mechanism served for communication is extended into multi-layer setting.

Compared with traditional one-to-one manager-worker control in hierarchical RL (Vezhnevets et al., 2017), we assign a single manager with multiple single workers, and also learn to collaborate on two layers of agents. The manager internally computes a latent state representation as an input to the manager-level attention network, and outputs a goal vector . The worker produces action and input for worker-level attention conditioned on its private observation , peer message , and the manager’s goal . The manager-level and worker-level attention networks share the same architecture introduced in Section 4.4. The details and training procedure for manager and worker are given in following parts.

Figure 4. Manager Module.

4.2. Manager Module

The architecture of the manager module is presented in Figure 4. The manager network is a two layer Preceptron (MLP) and a dilated RNN (Vezhnevets et al., 2017). Note that the structure of CoRide and formula of the RNN enable manager operate both at lower spatial resolution via taking joint observation of its workers and lower temporal resolution via dilated convolutional network (Yu and Koltun, 2015).

At timestep , the agent receives an observation from the environment and feeds into the dilated RNN with peer messages . Goal and input for manager-level attention are generated as output of the RNN, governed by the following equations (Vezhnevets et al., 2017):


where is the parameters of the RNN network. The environment responds with a new observation and a scalar reward . The goal of the agent is to maximize the discounted return with . Specifically, in the ride-hailing setting, we design our global reward taking both ADI and ORR into account, which can be formulated as:


where symbols accumulated driver income, computed according to the price of each served order; while encourages ORR, and is calculated with several correlative factors as:


where are the manager’s entropy and global average entropy respectively, which form the former term of Eq. (5) to optimize ORR in a global level. Area, different from grid, often means a certain region which needs to be taken more care of. In our experiment, we select several grids whose entropy largely differs from the average as the area. One can also take district composed of grids or subway station located in grid as area, often depending on specific situation. denotes Kullback-Leibler (KL) divergence which shows the margin between vehicle and order distributions of certain area at timestep . and

are realized with Poisson distribution, a common distribution for vehicle routing

(Ghiani et al., 2003) and arriving (Lord et al., 2005)

. In practice, this distribution parameters can be estimated from real trip data by the mean and std of orders and vehicles in each grid at each timestep. Such a combined ORR reward design in both grid-level and area-level help optimization both globally and locally.

4.3. Worker Module

We adopt the goal embedding from Feudal Networks (FuN) (Vezhnevets et al., 2017) in our worker framework (see Figure 5), where is generated as goal-embedding vector via linear projection. Similar with , at timestep , the agent receives an observation from the environment and feeds into a regular RNN with peer message . As Figure 5 shows, the output of RNN together with generates the primitive action - ranking weight vector .

Figure 5. Worker Module.

Given that worker needs to be encouraged to follow the goal generated by its manager, we adopt the intrinsic reward proposed by (Vezhnevets et al., 2017), defined as:



is the cosine similarity between two vectors. Notice that

now represents an advantageous direction in the latent state space at a horizon (Vezhnevets et al., 2017). Such intrinsic reward design would provide directional shift for s to follow.

Different from traditional FuN (Vezhnevets et al., 2017), procedure of our worker module produces action consists of two steps: (i) parameter generating, and (ii) action generating, inspired by (Zhao et al., 2017). We utilize state-specific scoring function in parameter generating setup to map the current state to a list of weight vectors as:


which is calculated based on nerual network shown in Figure 5. In action generating setup, note that it is straightforward to extend linear relations with non-linear ones, we formulate that the scoring function parameter and the ranking feature for order as:


The detailed formulation of will be discussed in Section 5. Then we build and add real orders and potential fleet control - repositioning to neighbor grids and staying at the current grid - as fake orders into item space . After computing scores for all available options in , instead of directly ranking and selecting Top- items for order dispatching and fleet management, we adopt Boltzmann softmax selector to generate Selected- items:


where , denotes temperature hyper-parameter to control the exploration rate, and is the number of scored order candidates. In practice, we set the initial temperature as 1.0, then gradually reduce the temperature until 0.01 to limit exploration. Thus this approach not only equips the action selection procedure with controllable exploration but also diversify the policy’s decision to avoid choosing groups of vehicles fleeted to the same destination.

0:  current observations ; mutual communication messages .
1:  for each manager in grid world do
2:     Generate according to Eq. (3).
3:     for each worker of the manager do
4:        Generate according to Eq. (7).
5:        Add orders and fleet control items to item space .
6:        Rank items in according to Eq. (8).
7:        Generate Selected- items according to Eq. (9).
8:     end for
9:     -level attention mechanism generates according to Eq. (12).
10:     manager receives extrinsic reward , and its workers receive intrinsic reward according to Eq. (4) and Eq. (6) respectively.
11:  end for
12:  -level attention mechanism generates according to Eq. (12).
13:  Update parameters according to Algorithm 2.
Algorithm 1 CoRide for joint multi-scale OD & FM

4.4. Multi-head Attention for Coordination

The cooperating information for -th agent is generated from RNN at -1. Note that manager and worker share the same setting of multi-head attention mechanism, agent in this subsection can represent either of them. We extend self-attention mechanism to learn to evaluate each available interaction as:


where are embedding of messages from target agent and source agent respectively. We can model as the value of communication between -th agent and -th agent. To retrieve a general attention value between source and target agents, we further normalize this value in neighborhood scope as:


where is the neighborhood scope: the set of communication available for target agent, and symbols temperature factor. To jointly attend to the neighborhood from different representation subspaces at different grids, we leverage multi-head attention as in previous work (Vaswani et al., 2017; Veličković et al., 2017; Wei et al., 2019; Zhang et al., 2019) to extend the observation as:


where is the number of attention heads, and are multiple sets of trainable parameters. Thus peer message is generated and will be feed into the corresponding module to produce the cooperative information . We present the overall CoRide for joint order dispatching and fleet management in Algorithm 1.

1:  Randomly initialize Critic network and actor with weights and .
2:  Initialize target network and with weights , .
3:  Initialize replay buffer .
4:  for each training episode do
5:     for manager to  do
6:         = initial message, .
7:        while  and terminal do
8:           Select the action = for active agent;
9:           Receive reward and new observation ;
10:           Generate message = Attention, where is latent vector in RNN and denotes the number of neighboring agents;
11:        end while
12:        Store episode {} in .
13:     end for
14:     Sample a random minibatch of transitions from replay buffer .
15:     for each transition  do
16:        Set ;
17:        Update Critic by minimizing the loss:;
18:        Update Actor policy by maximizing the Critic:;
19:        Update communication component.
20:     end for
21:  end for
Algorithm 2 Parameters Training with DDPG

4.5. Training

As described in Algorithm 1, managers generate specific goals based on their observations and peer messages (line 2). The workers under the manager generate the weight vector according to private observation and sharing goal (line 4). We then build a general item space for order dispatching and fleet management (line 5), and rank items in (line 6). Considering that our action is conditional to the minimum of the number of vehicles and orders, we generate Selected- items as the final action (line 7).

We extend learning approach from FuN (Vezhnevets et al., 2017) and HIRO (Nachum et al., 2018a) to train manager and worker module in the similar way. In CoRide, we utilize DDPG algorithm (Lillicrap et al., 2015) to train the parameters for both manager and worker module for following reasons. Classically, the critic is designed to leverage an approximator, to learn an action-value function , and to direct the actor updating its parameters. The optimal action-value function should follow the Bellman equation (Bellman, 2013) as:


which requires evaluations to select the optimal action. This prevents Eq. (13) to be adopted in real-world scenario, e.g. ride-hailing setting, with enormous state and action spaces. However, the actor architecture proposed in Section 4.3 generates a deterministic action for critic. Furthermore, Lillicrap et al. (2015) proposed a flexible and practical method to use an approximator function to estimate the action-value function, i.e.

. In practice, we refer to leverage DQN: a neural network function approximator can be trained by minimizing a sequence of loss functions



where is the target for the current iteration. According to the aforementioned analysis, the general training algorithm for the manager and worker module is presented in Algorithm 2.

City City A City B City C
Metrics Normalized ADI Normalized ORR Normalized ADI Normalized ORR Normalized ADI Normalized ORR
DQN +5.71% 0.02% +2.67% 0.01% +6.30% 0.01% +3.01% 0.02% +6.11% 0.02% +3.04% 0.01%
MDP +7.11% 0.05% +2.71% 0.03% +7.89% 0.05% +3.13% 0.04% +7.53% 0.03% +3.19% 0.03%
DDQN +6.68% 0.04% +3.19% 0.04% +7.75% 0.06% +4.06% 0.05% +7.62% 0.04% +4.58% 0.05%
MFOD +6.62% 0.03% +3.71% 0.02% +7.91% 0.04% +4.01% 0.02% +7.32% 0.02% +4.60% 0.01%
CoRide- +9.27% 0.04% +4.23% 0.03% +8.73% 0.03% +4.35% 0.02% +9.06% 0.03% +4.23% 0.04%
CoRide +9.80% 0.04% +4.81% 0.05% +8.94% 0.06% +4.89% 0.04% +9.23% 0.05% +5.19% 0.04%
Table 1. Performance comparison of competing methods in terms of ADI and ORR with respect to the performance of RAN. For a fair comparison, the random seeds that control the dynamics of the environment are set to be the same across all methods.

5. Simulator

The trial-and-error nature of reinforcement learning requires a dynamic simulation environment for training and evaluation. Thus, we adopt and extend the grid-based simulator designed by Lin et al. (2018) to joint order dispatching and fleet management.

5.1. Data Description

The real world data provided by Didi Chuxing111Similar dataset supported by Didi Chuxing can be found via GAIA open dataset ( includes order information and trajectories of vehicles in the central area of three large cities with millions of orders in four consecutive weeks. Data of each day contains million-level orders and tens of thousands vehicles in each city. The order information includes order price, origin, destination, and duration. The trajectories contain the positions (latitude and longitude) and status (on-line, off-line, on-service) of all vehicles every few seconds. As the radius of grid is approximate 1.3 kilometers, the central area of the city is covered by a hexagonal grids world consisting of 182, 126, 112 grids in three cities respectively. In order to adapt to the grid-based simulator, we utilize unique gridID to represent position information.

5.2. Simulator Design

In the grid-based simulator, the city is covered by a hexagonal grid-world as illustrated in Figure 2. At each timestep , the simulator provides an observation with a set of idle vehicles and a set of available orders including real orders and aforementioned fake orders for fleet control. All such fake orders share the same attributes as real orders, except that some of attributes are set stationary like price. All these real orders are generated by bootstrapping from real-world dataset introduced above. More specifically, suppose the current timestep of simulator is , we randomly sample real orders occuring in the same period, i.e. happening between to , where denotes timestep interval. In practice, we set sampling rate 100%. Like orders, vehicles are set online and offline alternatively according to a distribution learned from real-world dataset via maximum likelihood estimation. Each order feature, i.e. ranking feature in Eq. (8), includes the origin gridID, the destination gridID, price, duration and the type of order indicating real or fake order; while each vehicle takes its gridID as a feature, and vehicles located at the same grid are regarded as homogeneous. Moreover, as the travel distance between neighboring grids is approximately 1.3 kilometers and timestep interval is 10 minutes, we assume that vehicles will not automatically move to other grids before taking another order. The ride-hailing platform then provides an optimal list of vehicle-order pairs according to current policy. After receiving the list, the simulator will return a new observation and a list of order fees. Stepping on this feedback, rewards for each agent will be calculated and the corresponding record will be stored into a replay buffer. The whole network parameters will be updated using a batch of samples from replay buffer.

The effectiveness of the grid-based simulator is evaluated based on the calibration against the real data in term of the most important performance measurement: accumulated driver income (ADI) (Lin et al., 2018). The coefficient of determination between simulated ADI and real ADI is 0.9331 and the Pearson correlation is 0.9853 with -value .

6. Experiment

In this section, we conduct extensive experiments to evaluate the effectiveness of our proposed method in joint order dispatching and fleet management environment. Given that there are no published methods fitting our task. Thus, we first compare our proposed method with other models either widely used in the industry or published as academic papers based on a single order dispatching environment. Then, we further evaluate our proposed method in joint setting and compare with its performance in single setting.

6.1. Compared Methods

As discussed in (Lin et al., 2018), learning-based methods, currently regarded as state-of-the-art methods, usually outperforms rule-based methods. Thus we employ 6 learning-based methods and random method as the benchmark for comparison in our experiments.

  • [leftmargin=8pt]

  • RAN: A random dispatching algorithm considering no additional information. It only assigns idle vehicles with available orders randomly at each timestep.

  • DQN: Li et al. (2019) conducted action-value function approximation based on -network. The

    -network is parameterized by an MLP with four hidden layers and we adopt the ReLU activation between hidden layers and to transform the final linear output of


  • MDP: Xu et al. (2018) implemented dispatching through a learning and planning approach: each vehicle-order pair is valued in consideration of both immediate rewards and future gains in the learning step, and dispatch is solved using a combinatorial optimizing algorithm in planning step.

  • DDQN: Wang et al. (2018b) introduced a double-DQN with spatial-temporal action search and the network architecture is similar to the one described in DQN except a selected action space is utilized and network parameters are updated via double-DQN.

  • MFOD: Li et al. (2019) modeled the order dispatching problem with MFRL (Yang et al., 2018) and simplifies the local interactions by taking an average action among neighborhoods.

  • CoRide: Our proposed model as detailed in Section 4.

  • CoRide-: In order to further evaluate performance for hierarchical setting and agent communication, we set CoRide without multi-head attention mechanism as one of the baselines.

6.2. Result Analysis

For all learning methods, following (Li et al., 2019), we run 20 episodes for training, store the trained model periodically, and conduct the evaluation on the stored model with 5 random seeds. We compare the performance of different models regarding two criteria, including ADI, computed as the total income in a day, and ORR, calculated by the number of orders taken divided by the number of orders generated.

Experimental Results and Analysis. As shown in Table 1, the performance surpasses the state-of-the-art models like DDQN and industry deployed model like MDP. DDQN along with DQN mainly limits in lack of interaction and cooperation in the multi-agent environment. MDP mainly focuses on order price but ignores other features of order like duration, which makes against finding a balance between getting higher income per order and taking more orders. Instead, our proposed method achieves higher growths in term of ADI not only by considering every feature of each order concurrently but through learning to collaborate hierarchically. MFOD trys to capture dynamic demand-supply variations by propagating many local interactions between vehicles and the environment among mean field. Note that the number and information of available grid are relatively stationary while the number and feature of active vehicles are more dynamic. Thus CoRide, which takes grid as agent, is more likely and easier to learn to cooperate from interaction between agents.

Apart from cooperation, multi-head attention network also enables CoRide to capture demand-supply dynamics from both district (manager) and grid (worker) scale (as will be further discussed in Figure 6). Such a novel combined scale setting facilitates CoRide both effectively and efficiently.

Visualization Analysis. Except for quantitive results, we also analyze whether the learned multi-head attention network can capture the demand-supply relation (see Figure 6(b)) through visualization. As shown in Figure 3, our communication mechanism conducts in a hierarchical way: attention among the managers communicates and learns to collaborate abstractly and globally while peers for worker operate and determine key grid locally.

The values of several example managers and a group of workers belonging to the same manager are visualized in Figure 6(a). By taking a closer look at Figure 6, we can observe that the area with high demand-supply indeed centralized at certain places, which has been well captured in manager-scale. Such district-level attention value allows precious vehicle resource to be dispatched globally. Apart from manager-scale one, multi-head attention network also provides worker-scale attention value, which focuses on local allocation. Stepping on this multi-scale dispatching system design, CoRide could operate as a microscope, where coarse and fine focuses work together to obtain precise action.

Figure 6. Sampled attention value and demand-supply gap in the city center during peak hours. Grids with more orders or higher attention value are shown in red (in green if opposite) and the gap is proportional to the shade of colors.
DR 20% 30% 40%
Metrics Trajectory AST TNF Trajectory AST TNF Trajectory AST TNF
RES 13, 9, W, 14, W, 13, 8, 2, 7, 11 8 8 13, W, 14, W, W, 13, 8, W, 7, 11 6 6 13, W, 14, W, W, W, W, 19, O, 9 5 4
REV O, O, 15, W, 20, O, O, O, O, 11 9 3 O, O, 15, W, W, W, O, O, O, 11 7 2 O, O, 15, W, W, W, 20, W, O, 14 6 3
CoRide 13, 9, W, O, 0, 4, O, 2, O, 5 9 6 13, W, O, 11, W, W, O, O, 0, 5 7 4 13, W, W, O, 17, W, W, O, 0, 3 6 3
CoRide+ 13, 9, W, O, 0, 4, O, 2, O, 5 9 6 8, 3, 0, 2, O, 4, O, 2, O, 5 8 5 8, 3, 0, 2, O, 4, O, 2, O, 5 8 5
Table 2. Performance comparison of competing methods in terms of AST and TNF with three different discounted rates (DR). The numbers in Trajectory denote gridID in Figure 7 and its color symbols the district it located in. O and W mean the vehicle is On-service and Waiting at the current grid. Also, we use underlined number to symbol fleet management.

Ablation Study. In this subsection, we evaluate the effectiveness of components of CoRide. Notice that manager and worker modules serve as key components and are integrated through the hierarchical multi-agent architecture, as Figure 3 shows. Thus we choose to investigate the performance of multi-head attention network and hierarchical multi-agent architecture here and set CoRide- as a variation of proposed method. As shown in the last two rows in Table 1, CoRide- achieves significant advantages over the aforementioned baselines, especially in City A. Similar results occur with CoRide. This phenomenon can be explained from the fact that City A is the largest one according to Section 5.1, which requires frequent and large numbers of transportations among regions. Multi-scale guidance via multi-head attention network and hierarchical multi-agent architecture is therefore potentially helpful, especially at a large-scale case.

Case Study. The above experimental results show that the success rate of our model is significantly better than others in single resource dispatching task. In order to evaluate the performance of CoRide in joint resource dispatching and repositioning. Also, to further differ the formulations of the models, we constructed a synthetic dataset containing 3 districts with 21 grids, as showed in Figure 7. All these synthetic datasets are obtained via sampling real-world dataset supported by Didi Chuxing. More concretely, order distributions of all grids are sampled from the average distribution in real world dataset. Namely, order distributions in each grid are homogeneous. In order to differ downtown areas from uptown areas, we introduce sampling rate here. The sampling rate for each grid, denotes popular rate in real world. Thus, we set downtown (red grids in Figure 7) with stationary sampling rate 100%. The other regions (blue and green grids in Figure 7) are sampled with for blue areas and for green areas. Specifically we set discount ed rate as , and respectively, and further verify our proposed model by comparing against following 3 methods:

  • [leftmargin=8pt]

  • RES: This response-based method aims to achieve higher ORR, which corresponds to Total Number of Finished orders (TNF) in this section. Orders with short duration will gain high priority to get dispatched first. Once there are multiple orders with the same estimated trip time, then orders with higher prices will be served first.

  • REV: The revenue-based algorithm focuses on a higher ADI, which corresponds to Accumulated on-Service Time (AST) in this section, by assigning vehicles to high price orders. Following the similar principle as described above, the price and duration of orders will be considered as primary and secondary factors respectively.

  • CoRide+: To distinguish CoRide running in different environment: single order dispatching, and joint order dispatching and fleet management, we sign the former one CoRide and latter one CoRide+ respectively.

Figure 7. Illustration of the grid world in case study. The color of the grids symbols their entropy. Namely, grids with more orders and higher entropy value are shown in red (in green if opposite) and the gap is colored blue.

In order to analyze these performances in a more straightforward way, we mainly employ rule-based methods here. Also, we introduce Accumulated on-Service Time (AST) and Total Number of Finished orders (TNF) as the new metrics. In order to further analyze these performances in a long-term way, we select one vehicle starting at grid 12, trace 10 timesteps and record its trajectory, as Figure 7 shows, then conclude these results in Table 2. Although we only record the first 10 timesteps, we can observe that our proposed methods, both CoRide+ and CoRide, are guiding the vehicle to regions with larger entropy. This is benefit from architecture where the state of both manager and worker, and ranking feature take the gird information into consideration. In contrast, other methods greedily optimize either AST (ADI) or TNF (ORR) and ignore these information. After taking a close look at Table 2, we can find that CoRide and CoRide+ share the same trajectory on discounted rate 20% and differ greatly when discounted rate moves to 30% and 40 %. This can be explained by regarding CoRide+ as a combined design between our proposed model CoRide and joint order dispatching and fleet management setting. Namely, CoRide is actually a special case of CoRide+, where fleet management is unable. Equipped with fleet management, CoRide+ allows the vehicle move to and serve order in the hotspots more directly than CoRide. Also, when discounted rate varies from 20% to 40%, fleet management enable CoRide+ with better adaptation, even can ignore the dynamics of order distributions in some cases.

According to aforementioned analysis, we can conclude that (i) CoRide+ achieves not only the state-of-the-art but a more stable result benefiting from joint order dispatching and fleet management setting; (ii) both CoRide and CoRide+ can direct the vehicle to grids with larger entropy via taking grid information into consideration.

7. Conclusion and future work

In this paper, we have proposed CoRide, a hierarchical multi-agent reinforcement learning solution to combine order dispatching and fleet management for multi-scale ride-hailing platforms. Test results on multi cities real-world data as well as analytic synthetic data have shown that our proposed algorithm achieves (i) a higher ADI and ORR than aforementioned methods, (ii) a multi-scale decision-making process, (iii) a hierarchical multi-agent architecture in the ride-hailing task and (iv) a more stable method at different cases, as illustrated in case study section. Note that CoRide could achieve fully decentralized execution and incorporate closely with other geographical information based model like estimating time of arrival (ETA) (Wang et al., 2018a) theoretically. Thus it’s interesting to conduct further evaluation and investigation. Also, we notice that applying hierarchical reinforcement learning in real-world scenarios is very challenge and our work is just a start. There is much work for future research to improve both stability and performance of hierarchical reinforcement learning methods on real-world tasks.