1. Introduction
Online ride-hailing platforms such as Uber and Didi Chuxing have substantially transformed our lives by sharing and reallocating transportation resources to highly promote transportation efficiency. The emergency of Unmanned Ground Vehicles (UGV) will bring our future not only more convenience with a large number of intelligent automatic vehicles but also more challenging task for building intelligent transportation. In a general view, there are two major decision-making tasks for such ride-hailing platforms, namely (i) order dispatching, i.e., to match the orders and vehicles in real time to directly deliver the service to the users (Zou et al., 2013; Zhang et al., 2017; Seow et al., 2010), and (ii) fleet management, i.e., to reposition the vehicles to certain areas in advance to prepare for the later order dispatching (Lin et al., 2018; Oda and Tachibana, 2018; Simao et al., 2009).
Apparently, the decision-making of matching an order-vehicle pair or repositioning a vehicle to an area needs accounting for the future situation of the vehicle’s location and the orders nearby. Thus much of work has modeled order dispatching and fleet management as a sequential decision-making problem and solved it with reinforcement learning (RL) (Wang et al., 2018b; Xu et al., 2018; Lin et al., 2018; Tang and Qin, 2018). Most of the previous work deals with either order dispatching or fleet management without regarding the high correlation of these two tasks, especially for large-scale ride-hailing platforms in large cities, which leads to sub-optimal performance. In order to achieve near-optimal performance, inspired by thermodynamics, we simulate the whole ride-hailing platform as dispatch (order dispatching) and reposition (fleet management). As illustrated in Figure 1, we resemble vehicle and order as different molecules and aim at building up the system stability via reducing their number by dispatch and reposition. To address this complex criterion, in addition to interconnecting order dispatching and fleet management, joint consideration intra-district (grid-level) and inter-district (district-level) allocation would substantially facilitate especially during peak hours. With such a practical motivation, we focus on modeling joint order dispatching and fleet management with multi-scale decision-making system. There are several significant technical challenges to learn highly efficient allocation policy for the real-time ride-hailing platform:
Large-scale Agents. A fundamental question in any ride-hailing system is how to deal with a large number of orders and vehicles. One alternative is to model each available vehicle as an agent (Oda and Tachibana, 2018; Xu et al., 2018; Wei et al., 2018). However, such setting needs to maintain thousands of agents interacting with the environment, which brings a huge computational cost. Instead, we utilize the region grid world (as will be further discussed in Figure 2), which regards each region as an agent, and naturally model ride-hailing system in a hierarchical learning setting. This formulation allows decentralized learning and control with distributed implementation.
Immediate & Future Rewards. A key challenge in seeking an optimal control policy is to find a trade-off between immediate and future rewards in terms of accumulated driver income (ADI). Greedily matching vehicles with long-distance orders can receive high immediate gain at a single order dispatching stage, but it would harm order response rate (ORR) and future revenue especially during rush hour because of its long drive time and unpopular destination. When we take joint consideration of order dispatching and fleet management, whether to serve the order directing to areas containing few orders or reposition the driver to popular areas with zero pay also requires trade-off between immediate and future gains. Recent attempts (Xu et al., 2018; Wei et al., 2018; Oda and Tachibana, 2018) deployed RL to combine instant order reward from online planning with future state-value as the final matching value. However, the coordination between different regions is still far from optimal. Inspired by hierarchical RL (Vezhnevets et al., 2017), we introduce the geographical hierarchical structure of region agents. We treat large district as manager agent and small grid as worker agent, respectively. The manager operates at a lower spatial and temporal dimension and sets abstract goals which are conveyed to its workers. The worker takes specific actions and interacts with environment coordinated with manager-level goal and worker-level message. This decoupled structure facilitates very long timescale credit assignment (Vezhnevets et al., 2017) and guarantees balance between immediate and future revenue.

Heterogeneous & Variant Action Space. Traditional RL models require a fixed action space (Mnih et al., 2013). If we model picking an order as an RL action, there is no guarantee of a fixed action space as the available orders keep changing. Zhang et al. (2017)
proposed to learn a state-action value function to evaluate each valid order-vehicle match, then use a combinatorial optimization method such as Kuhn-Munkres (KM) algorithm
(Munkres, 1957) to filter the matches. However, such a method faces another important challenge that order dispatching and fleet management are different tasks, which results in heterogeneous action spaces. To address this issue, we redefine action as the weight vector for ranking orders and fleet management, where the fleet controls are regarded as fake orders, and all the orders are ranked and matched with vehicles in each agent. Thus it bypasses the issue of heterogeneous and variant action space as well as high computational costs.Multi-Scale Ride-Hailing. Xu et al. (2018) introduced a policy evaluation based RL method to learn the dynamics for each grid. As its result shows, orders and vehicles often centralize at different districts (e.g. uptown and downtown in Figure 1). How to combine large hotspots in the city (inter-district) with small ones in districts (intra-district) is another challenge without much attention until now. In order to take both inter-district and intra-district allocation into consideration, we adopt and extend attention mechanism in a hierarchical way (as will be further discussed in Figure 3). Compared with learning value function for each grid homogeneously (Xu et al., 2018), this attention-based structure can not only capture the impacts of neighbor agents, but also learn to distinguish key grid and district in worker (grid) and manager (district) scales respectively.
Wrapping all modules together, we propose CoRide, a hierarchical multi-agent reinforcement learning framework to resolve the aforementioned challenges. The main contributions are listed as follows:
-
[topsep = 3pt,leftmargin =5pt]
-
We propose a model CoRide that learns to collaborate in hierarchical multi-agent setting for ride-hailing platform.
-
We conduct extensive experiments based on real-world data of multiple cities, as well as analytic synthetic data, demonstrate that CoRide provides superior performance in terms of ADI and ORR in the task of city-wide hybrid order dispatching and fleet management over strong baselines.
-
To the best of our knowledge, CoRide is the first work (i) to apply the hierarchical reinforcement learning on ride-hailing platform; (ii) to address the task of joint order dispatching and fleet management of online ride-hailing platforms; (iii) to introduce and study multi-scale ride-hailing task.
To sum up, our model employs large district as manager module and its sub-small grids as worker module. The manager operates at a lower temporal resolution, sets abstract goal to the worker and collaborates with other managers. The worker generates primitive actions and sends messages to its peers for cooperation. This structure conveys several benefits: (i) In addition to balancing long-term and short-term reward, it also facilitates adaptation in a dynamic real-world situation by assigning different goals to worker. (ii) Instead of considering all of the matches between available orders and vehicles globally, these tasks are distributed to each worker and manager agent and fulfilled in a parallel way.
2. Related Work
Decision-making for Ride-hailing. Order dispatching and fleet management are two major decision-making tasks for online ride-hailing platforms, which have acquired much attention from academia and industry during the recent few years.
To tackle these challenging transportation problems, automatically ruled-based approaches addressed order dispatching problem with either centralized or decentralized settings. Lee et al. (2004) and Lee et al. (2007) focused on reducing the pick-up distance or waiting time by choosing nearest orders or following first-come, first-serve principle. These approaches usually can’t reach a high success rate, for ignoring many potential orders in the waiting list. To improve global performance, Zhang et al. (2017)
proposed a novel model based on centralized combinatorial optimization by concurrently matching multiple vehicle-order pairs within a short time window. However, this approach needs to compute all available vehicle-order matches and requires feature engineering, which would be infeasible and prevent it to be adopted in the large-scale taxi-order dispatching situation. With the decentralized setting,
Seow et al. (2010) addressed this problem with a collaborative multi-agent taxi dispatching system. However, this method requires rounds of direct communications between agents, so it is limited to a local area with a small number of vehicles.Instead of rule-based approaches, which require additional handcrafted heuristics, the current trending direction is to incorporate reinforcement learning algorithms in complicated traffic management problems.
Xu et al. (2018) proposed a learning and planning method based on reinforcement learning to optimize resource utilization and user experience in a global and more farsighted view. In (Oda and Tachibana, 2018), the authors leveraged the graph structure of the road network and expanded distributed DQN formulation to maximize entropy in the agents’ learning policy with soft Q-learning, to improve performance of fleet management. Wei et al. (2018) introduced a reinforcement learning method, which takes the uncertainty of future requests into account and can make a look-ahead decision to help the operator improve the global level-of-service of a shared-vehicle system through fleet management. To capture the complicated stochastic demand-supply variations in high-dimensional space, Lin et al. (2018) proposed a contextual multi-agent actor-critic framework to achieve explicit coordination among a large number of agents adaptive to different contexts in fleet management system.Different from all aforementioned methods, our approach is the first, to the best of our knowledge, to consider the joint modeling of order dispatching and fleet management and also the only current work introducing and studying the multi-scale ride-hailing task.
Hierarchical Reinforcement Learning. Hierarchical reinforcement learning (HRL) is a promising approach to extend traditional reinforcement learning (RL) methods to solve tasks with long-term dependency or multi-level interaction patterns (Dayan and Hinton, 1993; Dietterich, 2000). Recent works have suggested that several interesting and standout results can be induced by training multi-level hierarchical policy in a multi-task setup (Frans et al., 2017; Sigaud and Stulp, 2018) or implementing hierarchical setting in sparse reward problems (Riedmiller et al., 2018; Vezhnevets et al., 2017).
The options framework (Stolle and Precup, 2002; Precup, 2000; Sutton et al., 1999) formulates the problem with a two-level hierarchy, where the low-level - option - is a sub-policy with a termination condition. Since traditional options framework suffers from prior knowledge on designing options, jointly learning high-level policy with low-level policy has been proposed by (Bacon et al., 2017). However, this actor-critic HRL approach needs to either learning sub-policies for each time step or one sub-policy for the whole episode. Therefore, the performance of the whole module often prone to learning useful sub-policies. To guarantee gaining effective sub-policies, one alternative direction is to provide auxiliary rewards for low-level policy: hand-designed rewards based on prior domain knowledge (Kulkarni et al., 2016; Tessler et al., 2017) or mutual information (Florensa et al., 2017; Daniel et al., 2012; Kong et al., [n. d.]). Given having access to one well-designed and suitable reward is often a luxury, Vezhnevets et al. (2017) proposed FeUdal Networks (FuN), where a generic reward is utilized for low-level policy learning, thus avoid designing hand-craft rewards. Several works extend and improve FuN with off-policy training (Nachum et al., 2018a), form of hindsight (Levy et al., 2018) and representation learning (Nachum et al., 2018b).
Our work is also developed from FuN (Vezhnevets et al., 2017), originally inspired by feudal RL (Dayan and Hinton, 1993). FuN employs only one pair of manager and worker and connects them with a parameterized goal and intrinsic reward. Instead, we model CoRide with multiple managers. Unlike our method, in FuN the manager and worker modules are set to one-to-one, share the same observation, and operate at the different temporal but same spatial resolution. In our CoRide, there are multiple workers learning to collaborate under one manager while the managers are also coordinating. The manager takes a joint observation of all workers, and each worker produces action based on specific observation and sharing goal. Stepping on this one-to-many setting, the manager can not only operate with long timescale credit but act at a lower spatial resolution. Recently, Ahilan and Dayan (2019) introduced a novel architecture named FMH for cooperation in multi-agent RL. Different from this proposed method, CoRide not only extends the scale of the multi-agent environment but also facilitates communication through multi-head attention mechanism, which computes influences of interactions and differentiates the impacts to each agent. In other words, FuN (Vezhnevets et al., 2017) is actually a special case of CoRide, where a single manager along with its only worker is employed. Yet, the majority of current HRL methods require careful task-specific design, making them difficult to apply in real-world scenarios (Nachum et al., 2018a). To the best of our knowledge, CoRide is the first work to apply hierarchical reinforcement learning on the ride-hailing problem.
3. Problem Formulation
We formulate the problem of controlling large-scale homogeneous vehicles in online ride-hailing platforms, which combines order dispatching system with fleet management system with the goal of maximizing the city-level ADI and ORR. In practice, vehicles are divided into two groups: order dispatching (OD) group and fleet management (FM) group. For OD group, we match these vehicles with available orders pair-wisely; whereas for FM group, we need to reposition them to the locations or dispatch orders to them (same as OD group). Illustration of the problem is shown in Figure 2. We use the hexagonal-grid world to represent the map and take a grid as an agent. Considering that only orders within the pick-up distance can be dispatched to vehicles, we set distance between grids based on the pick-up distance. Given that, in our setting, vehicles in the same spatial-temporal node are homogeneous, i.e., the vehicles located at the same grid share the same setting. As such, we can model order dispatching as a large-scale parallel ranking task, where we rank orders and match them with homogeneous vehicles in each grid. The fleet control for fleet management, i.e. repositioning the vehicle to neighbor grids or staying at the current grid, is treated as fake orders (as will be further discussed in Section 6) and conducted in the ranking procedure as same as order dispatching.
Since each agent can only reposition vehicles located in the managing grid, we propose to formulate the problem using Partially Observable Markov Decision Process (POMDP)
Formally, we model this task as a Markov game for agents, which is defined by a tuple , where , , , ,
are the number of agents, set of states, set of actions, state transition probability, reward function, and a future reward discount factor, respectively. The definitions of major components are as follows.
Agent. We consider each region cell as an agent identified by , where . In detail, a single grid represents a worker agent, a district containing multiple grids represents a manager agent. An example is presented in Figure 2. Each individual grid serves as a worker agent with 6 neighbor grids, as shown in the different color, composes a manager agent. Note that although the number of vehicles and orders varies over time, the number of agents is fixed.
State , Observation . Although there are two different types of agents - manager and worker, their observations only differ in scale. Observation of each manager is actually a joint observation of its workers. At each timestep , agent draws private observations correlated with the environment state . In our setting, the state input used in our method is expressed as , where inner elements represent the number of vehicles, number of order, entropy, number of vehicles in FM group and distribution of order features in current grid (e.g. price, duration) respectively. Note that both dispatching and repositioning belong to resource allocation similar to the thermodynamic system (Figure 1), and once order dispatching or fleet management occurs, dispatched or fleeted items slip out of the system. Namely, only idle vehicles and available orders can contribute to disorder and unevenness of the ride-hailing system. Therefore, we introduce and extend the concept of entropy here, defined as:
(1) |
where is a Boltzmann constant, and means probability for each state: for dispatched and fleeted, elsewhere. We choose to ignore items at state 1 (), according to the aforementioned analysis, and give the formula of as follows:
(2) |
which is conditioned in situation and straightforward to transform to other situations.
Action , State Transition . In our hierarchical RL setting, manager’s action is to generate abstract and intrinsic goal to its workers, and each worker needs to provide a ranking list of relevant real orders (OD) and fleet control served as fake orders (FM). Thus, the action of worker is defined as the weight vector for the ranking features. Changing an action of the worker means to change the weight vector for the ranking features (as will be further discussed in Section 6). Each timestep the whole multi-agent system produces a joint action for each manager and worker , where , which induces a transition in the environment according to the state transition .
Reward . Like previous hierarchical RL settings (Vezhnevets et al., 2017), only manager interacts with the environment and receives feedback from it. This extrinsic reward function determines the direction of optimization and is proportional to both immediate profit and potential value; while the intrinsic reward is set to encourage the worker to follow the instruction from the manager.

More specifically, we give a simple example based on the above problem setting in Figure 2. At time , the worker agent 0 ranks available real orders and potential fake orders for fleet control, and selects the Selected-2 (as will be further discussed in Eq. (9)) options: a real order from grid 0 to grid 17, a fake order from grid 0 to grid 5. After the driver finished, the manager agent, whose sub-workers maintain the worker agent 0, received corresponding reward.

4. Methodologies
4.1. Overall Architecture
As shown in Figure 3, CoRide employs two layers of agents, namely the layer of manager agents and the layer of worker agents. Each agent is associated with a communication component for exchanging messages. As both agent and decision-making process conduct in a hierarchical way, multi-head attention mechanism served for communication is extended into multi-layer setting.
Compared with traditional one-to-one manager-worker control in hierarchical RL (Vezhnevets et al., 2017), we assign a single manager with multiple single workers, and also learn to collaborate on two layers of agents. The manager internally computes a latent state representation as an input to the manager-level attention network, and outputs a goal vector . The worker produces action and input for worker-level attention conditioned on its private observation , peer message , and the manager’s goal . The manager-level and worker-level attention networks share the same architecture introduced in Section 4.4. The details and training procedure for manager and worker are given in following parts.

4.2. Manager Module
The architecture of the manager module is presented in Figure 4. The manager network is a two layer Preceptron (MLP) and a dilated RNN (Vezhnevets et al., 2017). Note that the structure of CoRide and formula of the RNN enable manager operate both at lower spatial resolution via taking joint observation of its workers and lower temporal resolution via dilated convolutional network (Yu and Koltun, 2015).
At timestep , the agent receives an observation from the environment and feeds into the dilated RNN with peer messages . Goal and input for manager-level attention are generated as output of the RNN, governed by the following equations (Vezhnevets et al., 2017):
(3) |
where is the parameters of the RNN network. The environment responds with a new observation and a scalar reward . The goal of the agent is to maximize the discounted return with . Specifically, in the ride-hailing setting, we design our global reward taking both ADI and ORR into account, which can be formulated as:
(4) |
where symbols accumulated driver income, computed according to the price of each served order; while encourages ORR, and is calculated with several correlative factors as:
(5) |
where are the manager’s entropy and global average entropy respectively, which form the former term of Eq. (5) to optimize ORR in a global level. Area, different from grid, often means a certain region which needs to be taken more care of. In our experiment, we select several grids whose entropy largely differs from the average as the area. One can also take district composed of grids or subway station located in grid as area, often depending on specific situation. denotes Kullback-Leibler (KL) divergence which shows the margin between vehicle and order distributions of certain area at timestep . and
are realized with Poisson distribution, a common distribution for vehicle routing
(Ghiani et al., 2003) and arriving (Lord et al., 2005). In practice, this distribution parameters can be estimated from real trip data by the mean and std of orders and vehicles in each grid at each timestep. Such a combined ORR reward design in both grid-level and area-level help optimization both globally and locally.
4.3. Worker Module
We adopt the goal embedding from Feudal Networks (FuN) (Vezhnevets et al., 2017) in our worker framework (see Figure 5), where is generated as goal-embedding vector via linear projection. Similar with , at timestep , the agent receives an observation from the environment and feeds into a regular RNN with peer message . As Figure 5 shows, the output of RNN together with generates the primitive action - ranking weight vector .

Given that worker needs to be encouraged to follow the goal generated by its manager, we adopt the intrinsic reward proposed by (Vezhnevets et al., 2017), defined as:
(6) |
where
is the cosine similarity between two vectors. Notice that
now represents an advantageous direction in the latent state space at a horizon (Vezhnevets et al., 2017). Such intrinsic reward design would provide directional shift for s to follow.Different from traditional FuN (Vezhnevets et al., 2017), procedure of our worker module produces action consists of two steps: (i) parameter generating, and (ii) action generating, inspired by (Zhao et al., 2017). We utilize state-specific scoring function in parameter generating setup to map the current state to a list of weight vectors as:
(7) |
which is calculated based on nerual network shown in Figure 5. In action generating setup, note that it is straightforward to extend linear relations with non-linear ones, we formulate that the scoring function parameter and the ranking feature for order as:
(8) |
The detailed formulation of will be discussed in Section 5. Then we build and add real orders and potential fleet control - repositioning to neighbor grids and staying at the current grid - as fake orders into item space . After computing scores for all available options in , instead of directly ranking and selecting Top- items for order dispatching and fleet management, we adopt Boltzmann softmax selector to generate Selected- items:
(9) |
where , denotes temperature hyper-parameter to control the exploration rate, and is the number of scored order candidates. In practice, we set the initial temperature as 1.0, then gradually reduce the temperature until 0.01 to limit exploration. Thus this approach not only equips the action selection procedure with controllable exploration but also diversify the policy’s decision to avoid choosing groups of vehicles fleeted to the same destination.
4.4. Multi-head Attention for Coordination
The cooperating information for -th agent is generated from RNN at -1. Note that manager and worker share the same setting of multi-head attention mechanism, agent in this subsection can represent either of them. We extend self-attention mechanism to learn to evaluate each available interaction as:
(10) |
where are embedding of messages from target agent and source agent respectively. We can model as the value of communication between -th agent and -th agent. To retrieve a general attention value between source and target agents, we further normalize this value in neighborhood scope as:
(11) |
where is the neighborhood scope: the set of communication available for target agent, and symbols temperature factor. To jointly attend to the neighborhood from different representation subspaces at different grids, we leverage multi-head attention as in previous work (Vaswani et al., 2017; Veličković et al., 2017; Wei et al., 2019; Zhang et al., 2019) to extend the observation as:
(12) |
where is the number of attention heads, and are multiple sets of trainable parameters. Thus peer message is generated and will be feed into the corresponding module to produce the cooperative information . We present the overall CoRide for joint order dispatching and fleet management in Algorithm 1.
4.5. Training
As described in Algorithm 1, managers generate specific goals based on their observations and peer messages (line 2). The workers under the manager generate the weight vector according to private observation and sharing goal (line 4). We then build a general item space for order dispatching and fleet management (line 5), and rank items in (line 6). Considering that our action is conditional to the minimum of the number of vehicles and orders, we generate Selected- items as the final action (line 7).
We extend learning approach from FuN (Vezhnevets et al., 2017) and HIRO (Nachum et al., 2018a) to train manager and worker module in the similar way. In CoRide, we utilize DDPG algorithm (Lillicrap et al., 2015) to train the parameters for both manager and worker module for following reasons. Classically, the critic is designed to leverage an approximator, to learn an action-value function , and to direct the actor updating its parameters. The optimal action-value function should follow the Bellman equation (Bellman, 2013) as:
(13) |
which requires evaluations to select the optimal action. This prevents Eq. (13) to be adopted in real-world scenario, e.g. ride-hailing setting, with enormous state and action spaces. However, the actor architecture proposed in Section 4.3 generates a deterministic action for critic. Furthermore, Lillicrap et al. (2015) proposed a flexible and practical method to use an approximator function to estimate the action-value function, i.e.
. In practice, we refer to leverage DQN: a neural network function approximator can be trained by minimizing a sequence of loss functions
as:(14) |
where is the target for the current iteration. According to the aforementioned analysis, the general training algorithm for the manager and worker module is presented in Algorithm 2.
City | City A | City B | City C | |||
Metrics | Normalized ADI | Normalized ORR | Normalized ADI | Normalized ORR | Normalized ADI | Normalized ORR |
DQN | +5.71% 0.02% | +2.67% 0.01% | +6.30% 0.01% | +3.01% 0.02% | +6.11% 0.02% | +3.04% 0.01% |
MDP | +7.11% 0.05% | +2.71% 0.03% | +7.89% 0.05% | +3.13% 0.04% | +7.53% 0.03% | +3.19% 0.03% |
DDQN | +6.68% 0.04% | +3.19% 0.04% | +7.75% 0.06% | +4.06% 0.05% | +7.62% 0.04% | +4.58% 0.05% |
MFOD | +6.62% 0.03% | +3.71% 0.02% | +7.91% 0.04% | +4.01% 0.02% | +7.32% 0.02% | +4.60% 0.01% |
CoRide- | +9.27% 0.04% | +4.23% 0.03% | +8.73% 0.03% | +4.35% 0.02% | +9.06% 0.03% | +4.23% 0.04% |
CoRide | +9.80% 0.04% | +4.81% 0.05% | +8.94% 0.06% | +4.89% 0.04% | +9.23% 0.05% | +5.19% 0.04% |
5. Simulator
The trial-and-error nature of reinforcement learning requires a dynamic simulation environment for training and evaluation. Thus, we adopt and extend the grid-based simulator designed by Lin et al. (2018) to joint order dispatching and fleet management.
5.1. Data Description
The real world data provided by Didi Chuxing111Similar dataset supported by Didi Chuxing can be found via GAIA open dataset (https://outreach.didichuxing.com/research/opendata/en/). includes order information and trajectories of vehicles in the central area of three large cities with millions of orders in four consecutive weeks. Data of each day contains million-level orders and tens of thousands vehicles in each city. The order information includes order price, origin, destination, and duration. The trajectories contain the positions (latitude and longitude) and status (on-line, off-line, on-service) of all vehicles every few seconds. As the radius of grid is approximate 1.3 kilometers, the central area of the city is covered by a hexagonal grids world consisting of 182, 126, 112 grids in three cities respectively. In order to adapt to the grid-based simulator, we utilize unique gridID to represent position information.
5.2. Simulator Design
In the grid-based simulator, the city is covered by a hexagonal grid-world as illustrated in Figure 2. At each timestep , the simulator provides an observation with a set of idle vehicles and a set of available orders including real orders and aforementioned fake orders for fleet control. All such fake orders share the same attributes as real orders, except that some of attributes are set stationary like price. All these real orders are generated by bootstrapping from real-world dataset introduced above. More specifically, suppose the current timestep of simulator is , we randomly sample real orders occuring in the same period, i.e. happening between to , where denotes timestep interval. In practice, we set sampling rate 100%. Like orders, vehicles are set online and offline alternatively according to a distribution learned from real-world dataset via maximum likelihood estimation. Each order feature, i.e. ranking feature in Eq. (8), includes the origin gridID, the destination gridID, price, duration and the type of order indicating real or fake order; while each vehicle takes its gridID as a feature, and vehicles located at the same grid are regarded as homogeneous. Moreover, as the travel distance between neighboring grids is approximately 1.3 kilometers and timestep interval is 10 minutes, we assume that vehicles will not automatically move to other grids before taking another order. The ride-hailing platform then provides an optimal list of vehicle-order pairs according to current policy. After receiving the list, the simulator will return a new observation and a list of order fees. Stepping on this feedback, rewards for each agent will be calculated and the corresponding record will be stored into a replay buffer. The whole network parameters will be updated using a batch of samples from replay buffer.
The effectiveness of the grid-based simulator is evaluated based on the calibration against the real data in term of the most important performance measurement: accumulated driver income (ADI) (Lin et al., 2018). The coefficient of determination between simulated ADI and real ADI is 0.9331 and the Pearson correlation is 0.9853 with -value .
6. Experiment
In this section, we conduct extensive experiments to evaluate the effectiveness of our proposed method in joint order dispatching and fleet management environment. Given that there are no published methods fitting our task. Thus, we first compare our proposed method with other models either widely used in the industry or published as academic papers based on a single order dispatching environment. Then, we further evaluate our proposed method in joint setting and compare with its performance in single setting.
6.1. Compared Methods
As discussed in (Lin et al., 2018), learning-based methods, currently regarded as state-of-the-art methods, usually outperforms rule-based methods. Thus we employ 6 learning-based methods and random method as the benchmark for comparison in our experiments.
-
[leftmargin=8pt]
-
RAN: A random dispatching algorithm considering no additional information. It only assigns idle vehicles with available orders randomly at each timestep.
-
MDP: Xu et al. (2018) implemented dispatching through a learning and planning approach: each vehicle-order pair is valued in consideration of both immediate rewards and future gains in the learning step, and dispatch is solved using a combinatorial optimizing algorithm in planning step.
-
DDQN: Wang et al. (2018b) introduced a double-DQN with spatial-temporal action search and the network architecture is similar to the one described in DQN except a selected action space is utilized and network parameters are updated via double-DQN.
-
CoRide: Our proposed model as detailed in Section 4.
-
CoRide-: In order to further evaluate performance for hierarchical setting and agent communication, we set CoRide without multi-head attention mechanism as one of the baselines.
6.2. Result Analysis
For all learning methods, following (Li et al., 2019), we run 20 episodes for training, store the trained model periodically, and conduct the evaluation on the stored model with 5 random seeds. We compare the performance of different models regarding two criteria, including ADI, computed as the total income in a day, and ORR, calculated by the number of orders taken divided by the number of orders generated.
Experimental Results and Analysis. As shown in Table 1, the performance surpasses the state-of-the-art models like DDQN and industry deployed model like MDP. DDQN along with DQN mainly limits in lack of interaction and cooperation in the multi-agent environment. MDP mainly focuses on order price but ignores other features of order like duration, which makes against finding a balance between getting higher income per order and taking more orders. Instead, our proposed method achieves higher growths in term of ADI not only by considering every feature of each order concurrently but through learning to collaborate hierarchically. MFOD trys to capture dynamic demand-supply variations by propagating many local interactions between vehicles and the environment among mean field. Note that the number and information of available grid are relatively stationary while the number and feature of active vehicles are more dynamic. Thus CoRide, which takes grid as agent, is more likely and easier to learn to cooperate from interaction between agents.
Apart from cooperation, multi-head attention network also enables CoRide to capture demand-supply dynamics from both district (manager) and grid (worker) scale (as will be further discussed in Figure 6). Such a novel combined scale setting facilitates CoRide both effectively and efficiently.
Visualization Analysis. Except for quantitive results, we also analyze whether the learned multi-head attention network can capture the demand-supply relation (see Figure 6(b)) through visualization. As shown in Figure 3, our communication mechanism conducts in a hierarchical way: attention among the managers communicates and learns to collaborate abstractly and globally while peers for worker operate and determine key grid locally.
The values of several example managers and a group of workers belonging to the same manager are visualized in Figure 6(a). By taking a closer look at Figure 6, we can observe that the area with high demand-supply indeed centralized at certain places, which has been well captured in manager-scale. Such district-level attention value allows precious vehicle resource to be dispatched globally. Apart from manager-scale one, multi-head attention network also provides worker-scale attention value, which focuses on local allocation. Stepping on this multi-scale dispatching system design, CoRide could operate as a microscope, where coarse and fine focuses work together to obtain precise action.

DR | 20% | 30% | 40% | ||||||
Metrics | Trajectory | AST | TNF | Trajectory | AST | TNF | Trajectory | AST | TNF |
RES | 13, 9, W, 14, W, 13, 8, 2, 7, 11 | 8 | 8 | 13, W, 14, W, W, 13, 8, W, 7, 11 | 6 | 6 | 13, W, 14, W, W, W, W, 19, O, 9 | 5 | 4 |
REV | O, O, 15, W, 20, O, O, O, O, 11 | 9 | 3 | O, O, 15, W, W, W, O, O, O, 11 | 7 | 2 | O, O, 15, W, W, W, 20, W, O, 14 | 6 | 3 |
CoRide | 13, 9, W, O, 0, 4, O, 2, O, 5 | 9 | 6 | 13, W, O, 11, W, W, O, O, 0, 5 | 7 | 4 | 13, W, W, O, 17, W, W, O, 0, 3 | 6 | 3 |
CoRide+ | 13, 9, W, O, 0, 4, O, 2, O, 5 | 9 | 6 | 8, 3, 0, 2, O, 4, O, 2, O, 5 | 8 | 5 | 8, 3, 0, 2, O, 4, O, 2, O, 5 | 8 | 5 |
Ablation Study. In this subsection, we evaluate the effectiveness of components of CoRide. Notice that manager and worker modules serve as key components and are integrated through the hierarchical multi-agent architecture, as Figure 3 shows. Thus we choose to investigate the performance of multi-head attention network and hierarchical multi-agent architecture here and set CoRide- as a variation of proposed method. As shown in the last two rows in Table 1, CoRide- achieves significant advantages over the aforementioned baselines, especially in City A. Similar results occur with CoRide. This phenomenon can be explained from the fact that City A is the largest one according to Section 5.1, which requires frequent and large numbers of transportations among regions. Multi-scale guidance via multi-head attention network and hierarchical multi-agent architecture is therefore potentially helpful, especially at a large-scale case.
Case Study. The above experimental results show that the success rate of our model is significantly better than others in single resource dispatching task. In order to evaluate the performance of CoRide in joint resource dispatching and repositioning. Also, to further differ the formulations of the models, we constructed a synthetic dataset containing 3 districts with 21 grids, as showed in Figure 7. All these synthetic datasets are obtained via sampling real-world dataset supported by Didi Chuxing. More concretely, order distributions of all grids are sampled from the average distribution in real world dataset. Namely, order distributions in each grid are homogeneous. In order to differ downtown areas from uptown areas, we introduce sampling rate here. The sampling rate for each grid, denotes popular rate in real world. Thus, we set downtown (red grids in Figure 7) with stationary sampling rate 100%. The other regions (blue and green grids in Figure 7) are sampled with for blue areas and for green areas. Specifically we set discount ed rate as , and respectively, and further verify our proposed model by comparing against following 3 methods:
-
[leftmargin=8pt]
-
RES: This response-based method aims to achieve higher ORR, which corresponds to Total Number of Finished orders (TNF) in this section. Orders with short duration will gain high priority to get dispatched first. Once there are multiple orders with the same estimated trip time, then orders with higher prices will be served first.
-
REV: The revenue-based algorithm focuses on a higher ADI, which corresponds to Accumulated on-Service Time (AST) in this section, by assigning vehicles to high price orders. Following the similar principle as described above, the price and duration of orders will be considered as primary and secondary factors respectively.
-
CoRide+: To distinguish CoRide running in different environment: single order dispatching, and joint order dispatching and fleet management, we sign the former one CoRide and latter one CoRide+ respectively.

In order to analyze these performances in a more straightforward way, we mainly employ rule-based methods here. Also, we introduce Accumulated on-Service Time (AST) and Total Number of Finished orders (TNF) as the new metrics. In order to further analyze these performances in a long-term way, we select one vehicle starting at grid 12, trace 10 timesteps and record its trajectory, as Figure 7 shows, then conclude these results in Table 2. Although we only record the first 10 timesteps, we can observe that our proposed methods, both CoRide+ and CoRide, are guiding the vehicle to regions with larger entropy. This is benefit from architecture where the state of both manager and worker, and ranking feature take the gird information into consideration. In contrast, other methods greedily optimize either AST (ADI) or TNF (ORR) and ignore these information. After taking a close look at Table 2, we can find that CoRide and CoRide+ share the same trajectory on discounted rate 20% and differ greatly when discounted rate moves to 30% and 40 %. This can be explained by regarding CoRide+ as a combined design between our proposed model CoRide and joint order dispatching and fleet management setting. Namely, CoRide is actually a special case of CoRide+, where fleet management is unable. Equipped with fleet management, CoRide+ allows the vehicle move to and serve order in the hotspots more directly than CoRide. Also, when discounted rate varies from 20% to 40%, fleet management enable CoRide+ with better adaptation, even can ignore the dynamics of order distributions in some cases.
According to aforementioned analysis, we can conclude that (i) CoRide+ achieves not only the state-of-the-art but a more stable result benefiting from joint order dispatching and fleet management setting; (ii) both CoRide and CoRide+ can direct the vehicle to grids with larger entropy via taking grid information into consideration.
7. Conclusion and future work
In this paper, we have proposed CoRide, a hierarchical multi-agent reinforcement learning solution to combine order dispatching and fleet management for multi-scale ride-hailing platforms. Test results on multi cities real-world data as well as analytic synthetic data have shown that our proposed algorithm achieves (i) a higher ADI and ORR than aforementioned methods, (ii) a multi-scale decision-making process, (iii) a hierarchical multi-agent architecture in the ride-hailing task and (iv) a more stable method at different cases, as illustrated in case study section. Note that CoRide could achieve fully decentralized execution and incorporate closely with other geographical information based model like estimating time of arrival (ETA) (Wang et al., 2018a) theoretically. Thus it’s interesting to conduct further evaluation and investigation. Also, we notice that applying hierarchical reinforcement learning in real-world scenarios is very challenge and our work is just a start. There is much work for future research to improve both stability and performance of hierarchical reinforcement learning methods on real-world tasks.
References
- (1)
- Ahilan and Dayan (2019) Sanjeevan Ahilan and Peter Dayan. 2019. Feudal Multi-Agent Hierarchies for Cooperative Reinforcement Learning. arXiv (2019).
- Bacon et al. (2017) Pierre-Luc Bacon, Jean Harb, and Doina Precup. 2017. The Option-Critic Architecture.. In AAAI. 1726–1734.
- Bellman (2013) Richard Bellman. 2013. Dynamic programming. Courier Corporation.
- Daniel et al. (2012) Christian Daniel, Gerhard Neumann, and Jan Peters. 2012. Hierarchical relative entropy policy search. In Artificial Intelligence and Statistics. 273–281.
- Dayan and Hinton (1993) Peter Dayan and Geoffrey E Hinton. 1993. Feudal reinforcement learning. In Advances in neural information processing systems. 271–278.
- Dietterich (2000) Thomas G Dietterich. 2000. Hierarchical reinforcement learning with the MAXQ value function decomposition. Journal of Artificial Intelligence Research 13 (2000), 227–303.
- Florensa et al. (2017) Carlos Florensa, Yan Duan, and Pieter Abbeel. 2017. Stochastic neural networks for hierarchical reinforcement learning. arXiv preprint arXiv:1704.03012 (2017).
- Frans et al. (2017) Kevin Frans, Jonathan Ho, Xi Chen, Pieter Abbeel, and John Schulman. 2017. Meta learning shared hierarchies. arXiv preprint arXiv:1710.09767 (2017).
- Ghiani et al. (2003) Gianpaolo Ghiani, Francesca Guerriero, Gilbert Laporte, and Roberto Musmanno. 2003. Real-time vehicle routing: Solution concepts, algorithms and parallel computing strategies. European Journal of Operational Research 151, 1 (2003), 1–11.
- Kong et al. ([n. d.]) Xiangyu Kong, Bo Xin, Fangchen Liu, and Yizhou Wang. [n. d.]. Effective Master-Slave Communication On A Multi-Agent Deep Reinforcement Learning System. ([n. d.]).
- Kulkarni et al. (2016) Tejas D Kulkarni, Karthik Narasimhan, Ardavan Saeedi, and Josh Tenenbaum. 2016. Hierarchical deep reinforcement learning: Integrating temporal abstraction and intrinsic motivation. In Advances in neural information processing systems. 3675–3683.
- Lee et al. (2004) Der-Horng Lee, Hao Wang, Ruey Long Cheu, and Siew Hoon Teo. 2004. Taxi dispatch system based on current demands and real-time traffic conditions. Transportation Research Record 1882, 1 (2004), 193–200.
- Lee et al. (2007) Junghoon Lee, Gyung-Leen Park, Hanil Kim, Young-Kyu Yang, Pankoo Kim, and Sang-Wook Kim. 2007. A telematics service system based on the Linux cluster. In International Conference on Computational Science. Springer, 660–667.
- Levy et al. (2018) Andrew Levy, George Konidaris, Robert Platt, and Kate Saenko. 2018. Learning Multi-Level Hierarchies with Hindsight. (2018).
- Li et al. (2019) Minne Li, Yan Jiao, Yaodong Yang, Zhichen Gong, Jun Wang, Chenxi Wang, Guobin Wu, Jieping Ye, et al. 2019. Efficient Ridesharing Order Dispatching with Mean Field Multi-Agent Reinforcement Learning. arXiv (2019).
- Lillicrap et al. (2015) Timothy P Lillicrap, Jonathan J Hunt, Alexander Pritzel, Nicolas Heess, Tom Erez, Yuval Tassa, David Silver, and Daan Wierstra. 2015. Continuous control with deep reinforcement learning. arXiv preprint arXiv:1509.02971 (2015).
- Lin et al. (2018) Kaixiang Lin, Renyu Zhao, Zhe Xu, and Jiayu Zhou. 2018. Efficient Large-Scale Fleet Management via Multi-Agent Deep Reinforcement Learning. arXiv preprint arXiv:1802.06444 (2018).
- Lord et al. (2005) Dominique Lord, Simon P Washington, and John N Ivan. 2005. Poisson, Poisson-gamma and zero-inflated regression models of motor vehicle crashes: balancing statistical fit and theory. Accident Analysis & Prevention 37, 1 (2005), 35–46.
- Mnih et al. (2013) Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Alex Graves, Ioannis Antonoglou, Daan Wierstra, and Martin Riedmiller. 2013. Playing atari with deep reinforcement learning. arXiv preprint arXiv:1312.5602 (2013).
- Munkres (1957) James Munkres. 1957. Algorithms for the assignment and transportation problems. Journal of the society for industrial and applied mathematics 5, 1 (1957), 32–38.
- Nachum et al. (2018a) Ofir Nachum, Shane Gu, Honglak Lee, and Sergey Levine. 2018a. Data-Efficient Hierarchical Reinforcement Learning. arXiv preprint arXiv:1805.08296 (2018).
- Nachum et al. (2018b) Ofir Nachum, Shixiang Gu, Honglak Lee, and Sergey Levine. 2018b. Near-Optimal Representation Learning for Hierarchical Reinforcement Learning. arXiv preprint arXiv:1810.01257 (2018).
- Oda and Tachibana (2018) Takuma Oda and Yulia Tachibana. 2018. Distributed Fleet Control with Maximum Entropy Deep Reinforcement Learning. (2018).
- Precup (2000) Doina Precup. 2000. Temporal abstraction in reinforcement learning. University of Massachusetts Amherst.
- Riedmiller et al. (2018) Martin Riedmiller, Roland Hafner, Thomas Lampe, Michael Neunert, Jonas Degrave, Tom Van de Wiele, Volodymyr Mnih, Nicolas Heess, and Jost Tobias Springenberg. 2018. Learning by Playing-Solving Sparse Reward Tasks from Scratch. arXiv preprint arXiv:1802.10567 (2018).
- Seow et al. (2010) Kiam Tian Seow, Nam Hai Dang, and Der-Horng Lee. 2010. A collaborative multiagent taxi-dispatch system. IEEE Transactions on Automation Science and Engineering 7, 3 (2010), 607–616.
- Sigaud and Stulp (2018) Olivier Sigaud and Freek Stulp. 2018. Policy Search in Continuous Action Domains: an Overview. arXiv preprint arXiv:1803.04706 (2018).
- Simao et al. (2009) Hugo P Simao, Jeff Day, Abraham P George, Ted Gifford, John Nienow, and Warren B Powell. 2009. An approximate dynamic programming algorithm for large-scale fleet management: A case application. Transportation Science 43, 2 (2009), 178–197.
- Spaan (2012) Matthijs TJ Spaan. 2012. Partially observable Markov decision processes. In Reinforcement Learning. Springer, 387–414.
- Stolle and Precup (2002) Martin Stolle and Doina Precup. 2002. Learning options in reinforcement learning. In International Symposium on abstraction, reformulation, and approximation. Springer, 212–223.
- Sutton et al. (1999) Richard S Sutton, Doina Precup, and Satinder Singh. 1999. Between MDPs and semi-MDPs: A framework for temporal abstraction in reinforcement learning. Artificial intelligence 112, 1-2 (1999), 181–211.
- Tang and Qin (2018) Xiaocheng Tang and Zhiwei Qin. 2018. A Deep Value-network Based Approach for Multi-Driver Order Dispatching. Technical Report (2018).
- Tessler et al. (2017) Chen Tessler, Shahar Givony, Tom Zahavy, Daniel J Mankowitz, and Shie Mannor. 2017. A Deep Hierarchical Approach to Lifelong Learning in Minecraft.. In AAAI, Vol. 3. 6.
- Vaswani et al. (2017) Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In NeurIPS.
- Veličković et al. (2017) Petar Veličković, Guillem Cucurull, Arantxa Casanova, Adriana Romero, Pietro Lio, and Yoshua Bengio. 2017. Graph attention networks. arXiv (2017).
- Vezhnevets et al. (2017) Alexander Sasha Vezhnevets, Simon Osindero, Tom Schaul, Nicolas Heess, Max Jaderberg, David Silver, and Koray Kavukcuoglu. 2017. Feudal networks for hierarchical reinforcement learning. arXiv preprint arXiv:1703.01161 (2017).
- Wang et al. (2018a) Zheng Wang, Kun Fu, and Jieping Ye. 2018a. Learning to estimate the travel time. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. ACM, 858–866.
- Wang et al. (2018b) Zhaodong Wang, Zhiwei Qin, Xiaocheng Tang, Jieping Ye, and Hongtu Zhu. 2018b. Deep Reinforcement Learning with Knowledge Transfer for Online Rides Order Dispatching. In 2018 IEEE International Conference on Data Mining (ICDM). IEEE, 617–626.
- Wei et al. (2018) Chong Wei, Yinhu Wang, Xuedong Yan, and Chunfu Shao. 2018. Look-Ahead Insertion Policy for a Shared-Taxi System Based on Reinforcement Learning. IEEE Access 6 (2018), 5716–5726.
- Wei et al. (2019) Hua Wei, Nan Xu, Huichu Zhang, Guanjie Zheng, Xinshi Zang, Chacha Chen, Weinan Zhang, Yanmin Zhu, Kai Xu, and Zhenhui Li. 2019. CoLight: Learning Network-level Cooperation for Traffic Signal Control. arXiv preprint arXiv:1905.05717 (2019).
- Xu et al. (2018) Zhe Xu, Zhixin Li, Qingwen Guan, Dingshui Zhang, Qiang Li, Junxiao Nan, Chunyang Liu, Wei Bian, and Jieping Ye. 2018. Large-Scale Order Dispatch in On-Demand Ride-Hailing Platforms: A Learning and Planning Approach. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. ACM, 905–913.
- Yang et al. (2018) Yaodong Yang, Rui Luo, Minne Li, Ming Zhou, Weinan Zhang, and Jun Wang. 2018. Mean Field Multi-Agent Reinforcement Learning (ICML).
- Yu and Koltun (2015) Fisher Yu and Vladlen Koltun. 2015. Multi-scale context aggregation by dilated convolutions. arXiv preprint arXiv:1511.07122 (2015).
- Zhang et al. (2019) Huichu Zhang, Siyuan Feng, Chang Liu, Yaoyao Ding, Yichen Zhu, Zihan Zhou, Weinan Zhang, Yong Yu, Haiming Jin, and Zhenhui Li. 2019. CityFlow: A Multi-Agent Reinforcement Learning Environment for Large Scale City Traffic Scenario. arXiv preprint arXiv:1905.05217 (2019).
- Zhang et al. (2017) Lingyu Zhang, Tao Hu, Yue Min, Guobin Wu, Junying Zhang, Pengcheng Feng, Pinghua Gong, and Jieping Ye. 2017. A taxi order dispatch model based on combinatorial optimization. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 2151–2159.
- Zhao et al. (2017) Xiangyu Zhao, Liang Zhang, Zhuoye Ding, Dawei Yin, Yihong Zhao, and Jiliang Tang. 2017. Deep Reinforcement Learning for List-wise Recommendations. arXiv preprint arXiv:1801.00209 (2017).
- Zou et al. (2013) Qingnan Zou, Guangtao Xue, Yuan Luo, Jiadi Yu, and Hongzi Zhu. 2013. A novel taxi dispatch system for smart city. In International Conference on Distributed, Ambient, and Pervasive Interactions. Springer, 326–335.
Comments
There are no comments yet.