Efficient Large-Scale Fleet Management via Multi-Agent Deep Reinforcement Learning

02/18/2018 ∙ by Kaixiang Lin, et al. ∙ Didi Chuxing Michigan State University 0

Large-scale online ride-sharing platforms have substantially transformed our lives by reallocating transportation resources to alleviate traffic congestion and promote transportation efficiency. An efficient fleet management strategy not only can significantly improve the utilization of transportation resources but also increase the revenue and customer satisfaction. It is a challenging task to design an effective fleet management strategy that can adapt to an environment involving complex dynamics between demand and supply. Existing studies usually work on a simplified problem setting that can hardly capture the complicated stochastic demand-supply variations in high-dimensional space. In this paper we propose to tackle the large-scale fleet management problem using reinforcement learning, and propose a contextual multi-agent reinforcement learning framework including two concrete algorithms, namely contextual deep Q-learning and contextual multi-agent actor-critic, to achieve explicit coordination among a large number of agents adaptive to different contexts. We show significant improvements of the proposed framework over state-of-the-art approaches through extensive empirical studies.



There are no comments yet.


page 1

page 2

page 3

page 4

Code Repositories


Large-scale multi-agent reinforcement learning; Fleet management.

view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1. Introduction

Large-scale online ride-sharing platforms such as Uber (Uber, [n. d.]) and Didi Chuxing (Chuxing, [n. d.]) have transformed the way people travel, live and socialize. By leveraging the advances in and wide adoption of information technologies such as cellular networks and global positioning systems, the ride-sharing platforms redistribute underutilized vehicles on the roads to passengers in need of transportation. The optimization of transportation resources greatly alleviated traffic congestion and calibrated the once significant gap between transport demand and supply (Li et al., 2016).

One key challenge in ride-sharing platforms is to balance the demands and supplies, i.e., orders of the passengers and drivers available for picking up orders. In large cities, although millions of ride-sharing orders are served everyday, an enormous number of passengers requests remain unserviced due to the lack of available drivers nearby. On the other hand, there are plenty of available drivers looking for orders in other locations. If the available drivers were directed to locations with high demand, it will significantly increase the number of orders being served, and thus simultaneously benefit all aspects of the society: utility of transportation capacity will be improved, income of drivers and satisfaction of passengers will be increased, and market share and revenue of the company will be expanded. fleet management is a key technical component to balance the differences between demand and supply, by reallocating available vehicles ahead of time, to achieve high efficiency in serving future demand.

Even though rich historical demand and supply data are available, using the data to seek an optimal allocation policy is not an easy task. One major issue is that changes in an allocation policy will impact future demand-supply, and it is hard for supervised learning approaches to capture and model these real-time changes. On the other hand, the reinforcement learning (RL) 

(Sutton and Barto, 1998), which learns a policy by interacting with a complicated environment, has been naturally adopted to tackle the fleet management problem (Godfrey and Powell, 2002a, b; Wei et al., 2017). However, the high-dimensional and complicated dynamics between demand and supply can hardly be modeled accurately by traditional RL approaches.

Recent years witnessed tremendous success in deep reinforcement learning (DRL) in modeling intellectual challenging decision-making problems (Mnih et al., 2015; Silver et al., 2017) that were previously intractable. In the light of such advances, in this paper we propose a novel DRL approach to learn highly efficient allocation policies for fleet management. There are significant technical challenges when modeling fleet management using DRL:

1) Feasibility of problem setting. The RL framework is reward-driven, meaning that a sequence of actions from the policy is evaluated solely by the reward signal from environment (Arulkumaran et al., 2017). The definitions of agent, reward and action space are essential for RL. If we model the allocation policy using a centralized agent, the action space can be prohibitively large since an action needs to decide the number of available vehicles to reposition from each location to its nearby locations. Also, the policy is subject to a feasibility constraint enforcing that the number of repositioned vehicles needs to be no larger than the current number of available vehicles. To the best of our knowledge, this high-dimensional exact-constrain satisfaction policy optimization is not computationally tractable in DRL: applying it in a very small-scale problem could already incur high computational costs (Pham et al., 2017).

2) Large-scale Agents.

One alternative approach is to instead use a multi-agent DRL setting, where each available vehicle is considered as an agent. The multi-agent recipe indeed alleviates the curse of dimensionality of action space. However, such setting creates thousands of agents interacting with the environment at each time. Training a large number of agents using DRL is again challenging: the environment for each agent is non-stationary since other agents are learning and affecting the environment at same the time. Most of existing studies 

(Lowe et al., 2017; Foerster et al., 2017; Tampuu et al., 2017) allow coordination among only a small set of agents due to high computational costs.

3) Coordinations and Context Dependence of Action Space. Facilitating coordination among large-scale agents remains a challenging task. Since each agent typically learns its own policy or action-value function that are changing over time, it is difficult to coordinate agents for a large number of agents. Moreover, the action space is dynamic changing over time since agents are navigating to different locations and the number of feasible actions depends on the geographic context of the location.

In this paper, we propose a contextual multi-agent DRL framework to resolve the aforementioned challenges. Our major contributions are listed as follows:

  • [leftmargin=0.1in]

  • We propose an efficient multi-agent DRL setting for large-scale fleet management problem by a proper design of agent, reward and state.

  • We propose contextual multi-agent reinforcement learning framework in which two concrete algorithms: contextual multi-agent actor-critic (cA2C) and contextual deep Q-learning (cDQN) are developed. For the first time in multi-agent DRL, the contextual algorithms can not only achieve efficient explicit coordination among thousands of learning agents at each time, but also adapt to dynamically changing action spaces.

  • In order to train and evaluate the RL algorithm, we developed a simulator that simulates real-world traffic activities perfectly after calibrating the simulator using real historical data provided by Didi Chuxing (Chuxing, [n. d.]).

  • Last but not least, the proposed contextual algorithms significantly outperform the state-of-the-art methods in multi-agent DRL with a much less number of repositions needed.

The rest of paper is organized as follows. We first give a literature review on the related work in Sec 2. Then the problem statement is elaborated in Sec 3 and the simulation platform we built for training and evaluation are introduced in Sec 5. The methodology is described in Sec 4. Quantitative and qualitative results are presented in Sec 6. Finally, we conclude our work in Sec 7.

2. Related Works

Intelligent Transportation System.

Advances in machine learning and traffic data analytics lead to widespread applications of machine learning techniques to tackle challenging traffic problems. One trending direction is to incorporate reinforcement learning algorithms in complicated traffic management problems. There are many previous studies that have demonstrated the possibility and benefits of reinforcement learning. Our work has close connections to these studies in terms of problem setting, methodology and evaluation. Among the traffic applications that are closely related to our work, such as taxi dispatch systems or traffic light control algorithms, multi-agent RL has been explored to model the intricate nature of these traffic activities 

(Bakker et al., 2010; Seow et al., 2010; Maciejewski and Nagel, 2013). The promising results motivated us to use multi-agent modeling in the fleet management problem. In (Godfrey and Powell, 2002a)

, an adaptive dynamic programming approach was proposed to model stochastic dynamic resource allocation. It estimates the returns of future states using a piecewise linear function and delivers actions (assigning orders to vehicles, reallocate available vehicles) given states and one step future states values, by solving an integer programming problem. In 

(Godfrey and Powell, 2002b), the authors further extended the approach to the situations that an action can span across multiple time periods. These methods are hard to be directly utilized in the real-world setting where orders can be served through the vehicles located in multiple nearby locations.

Multi-agent reinforcement learning. Another relevant research topic is multi-agent reinforcement learning (Busoniu et al., 2008) where a group of agents share the same environment, in which they receive rewards and take actions. (Tan, 1993) compared and contrasted independent -learning and a cooperative counterpart in different settings, and empirically showed that the learning speed can benefit from cooperative -learning. Independent -learning is extended into DRL in (Tampuu et al., 2017), where two agents are cooperating or competing with each other only through the reward. In (Foerster et al., 2017), the authors proposed a counterfactual multi-agent policy gradient method that uses a centralized advantage to estimate whether the action of one agent would improve the global reward, and decentralized actors to optimize the agent policy. Ryan et al. also utilized the framework of decentralized execution and centralized training to develop multi-agent multi-agent actor-critic algorithm that can coordinate agents in mixed cooperative-competitive environments (Lowe et al., 2017). However, none of these methods were applied when there are a large number of agents due to the communication cost among agents. Recently, few works (Zheng et al., 2017; Yang et al., 2018) scaled DRL methods to a large number of agents, while it is not applicable to apply these methods to complex real applications such as fleet management. In (Nguyen et al., 2017a, b), the authors studied large-scale multi-agent planning for fleet management with explicitly modeling the expected counts of agents.

Deep reinforcement learning.

DRL utilizes neural network function approximations and are shown to have largely improved the performance over challenging applications 

(Silver et al., 2016; Silver et al., 2017; Mnih et al., 2015). Many sophisticated DRL algorithms such as DQN (Mnih et al., 2015), A3C (Mnih et al., 2016) were demonstrated to be effective in the tasks in which we have a clear understanding of rules and have easy access to millions of samples, such as video games (Brockman et al., 2016; Bellemare et al., 2013). However, DRL approaches are rarely seen to be applied in complicated real-world applications, especially in those with high-dimensional and non-stationary action space, lack of well-defined reward function, and in need of coordination among a large number of agents. In this paper, we show that through careful reformulation, the DRL can be applied to tackle the fleet management problem.

3. Problem Statement

In this paper, we consider the problem of managing a large set of available homogeneous vehicles for online ride-sharing platforms. The goal of the management is to maximize the gross merchandise volume (GMV: the value of all the orders served) of the platform by repositioning available vehicles to the locations with larger demand-supply gap than the current one. This problem belongs to a variant of the classical fleet management problem (Dejax and Crainic, 1987). A spatial-temporal illustration of the problem is available in Figure 1. In this example, we use hexagonal-grid world to represent the map and split the duration of one day into time intervals (one for 10 minutes). At each time interval, the orders emerge stochastically in each grid and are served by the available vehicles in the same grid or six nearby grids. The goal of fleet management here is to decide how many available vehicles to relocate from each grid to its neighbors in ahead of time, so that most orders can be served.

To tackle this problem, we propose to formulate the problem using multi-agent reinforcement learning (Busoniu et al., 2008). In this formulation, we use a set of homogeneous agents with small action spaces, and split the global reward into each grid. This will lead to a much more efficient learning procedure than the single agent setting, due to the simplified action dimension and the explicit credit assignment based on split reward. Formally, we model the fleet management problem as a Markov game for agents, which is defined by a tuple , where

are the number of agents, sets of states, joint action space, transition probability functions, reward functions, and a discount factor respectively. The definitions are given as follows:

  • [leftmargin=0.1in]

  • Agent: We consider an available vehicle (or equivalently an idle driver) as an agent, and the vehicles in the same spatial-temporal node are homogeneous, i.e., the vehicles located at the same region at the same time interval are considered as same agents (where agents have the same policy). Although the number of unique heterogeneous agents is always , the number of agents is changing over time.

  • State : We maintain a global state at each time , considering the spatial distributions of available vehicles and orders (i.e. the number of available vehicles and orders in each grid) and current time

    (using one-hot encoding). The state of an agent

    , , is defined as the identification of the grid it located and the shared global state i.e. , where is the one-hot encoding of the grid ID. We note that agents located at same grid have the same state .

  • Action : a joint action instructing the allocation strategy of all available vehicles at time . The action space of an individual agent specifies where the agent is able to arrive at the next time, which gives a set of seven discrete actions denoted by . The first six discrete actions indicate allocating the agent to one of its six neighboring grids, respectively. The last discrete action means staying in the current grid. For example, the action means to relocate the st agent from the current grid to the second nearby grid at time , as shown in Figure 1. For a concise presentation, we also use to represent agent moving from grid to . Furthermore, the action space of agents depends on their locations. The agents located at corner grids have a smaller action space. We also assume that the action is deterministic: if , then agent will arrive at the grid at time .

  • Reward function : Each agent is associated with a reward function and all agents in the same location have the same reward function. The -th agent attempts to maximize its own expected discounted return: . The individual reward for the -th agent associated with the action is defined as the averaged revenue of all agents arriving at the same grid as the -th agent at time . Since the individual rewards at same time and the same location are same, we denote this reward of agents at time and grid as . Such design of rewards aims at avoiding greedy actions that send too many agents to the location with high value of orders, and aligning the maximization of each agent’s return with the maximization of GMV (value of all served orders in one day). Its effectiveness is empirically verified in Sec 6.

  • State transition probability : It gives the probability of transiting to given a joint action is taken in the current state . Notice that although the action is deterministic, new vehicles and orders will be available at different grids each time, and existing vehicles will become off-line via a random process.

To be more concrete, we give an example based on the above problem setting in Figure 1. At time , agent is repositioned from to by action , and agent is also repositioned from to by action . At time , two agents arrive at , and a new order with value also emerges at same grid. Therefore, the reward for both and is the averaged value received by agents at , which is .

Figure 1. The grid world system and a spatial-temporal illustration of the problem setting.

4. Contextual Multi-Agent Reinforcement Learning

In this section, we present two novel contextual multi-agent RL approaches: contextual multi-agent actor-critic (cA2C) and contextual DQN (cDQN) algorithm. We first briefly introduce the basic multi-agent RL method.

4.1. Independent DQN

Independent DQN (Tampuu et al., 2017) combines independent -learning (Tan, 1993) and DQN (Mnih et al., 2015). A straightforward extension of independent DQN from small scale to a large number of agents, is to share network parameters and distinguish agents with their IDs (Zheng et al., 2017)

. The network parameters can be updated by minimizing the following loss function, with respect to the transitions collected from all agents:


where includes parameters of the target network updated periodically, and includes parameters of behavior network outputting the action value for -greedy policy, same as the algorithm described in (Mnih et al., 2015)

. This method could work reasonably well after extensive tuning but it suffers from high variance in performance, and it also repositions too many vehicles. Moreover, coordination among massive agents is hard to achieve since each unique agent executes its action independently based on its action values.

4.2. Contextual DQN

Since we assume that the location transition of an agent after the allocation action is deterministic, the actions that lead the agents to the same grid should have the same action value. In this case, the number of unique action-values for all agents should be equal to the number of grids . Formally, for any agent where , and , the following holds:


Hence, at each time step, we only need unique action-values () and the optimization of Eq (1) can be replaced by minimizing the following mean-squared loss:


This accelerates the learning procedure since the output dimension of the action value function is reduced from to . Furthermore, we can build a centralized action-value table at each time for all agents, which can serve as the foundation for coordinating the actions of agents.

Geographic context. In hexagonal grids systems, border grids and grids surrounded by infeasible grids (e.g., a lake) have reduced action dimensions. To accommodate this, for each grid we compute a geographic context

, which is a binary vector that filters out invalid actions for agents in grid

. The th element of vector represents the validity of moving toward th direction from the grid . Denote as the grid corresponds to the th direction of grid , the value of the th element of is given by:


where and last dimension of the vector represents direction staying in same grid, which is always 1.

Collaborative context. To avoid the situation that agents are moving in conflict directions (i.e., agents are repositioned from grid to and to at the same time.), we provide a collaborative context for each grid at each time. Based on the centralized action values , we restrict the valid actions such that agents at the grid are navigating to the neighboring grids with higher action values or staying unmoved. Therefore, the binary vector eliminates actions to grids with lower action values than the action staying unmoved. Formally, the th element of vector that corresponds to action value is defined as follows:


After computing both collaborative and geographic context, the -greedy policy is then performed based on the action values survived from the two contexts. Suppose the original action values of agent at time is , given state , the valid action values after applying contexts is as follows:


The coordination is enabled because the action values of different agents lead to the same location are restricted to be same so that they can be compared, which is impossible in independent DQN. This method requires that action values are always non-negative, which will always hold because agents always receive nonnegative rewards. The algorithm of cDQN is elaborated in Alg 2.

0:  Global state
1:  Compute centralized action value
2:  for  to  do
3:     Compute action values by Eq (2), where .
4:     Compute contexts and for agent .
5:     Compute valid action values .
6:      with probability otherwise choose an action randomly from the valid actions.
7:  end for
8:  return  Joint action .
Algorithm 1 -greedy policy for cDQN
1:  Initialize replay memory to capacity
2:  Initialize action-value function with random weights or pre-trained parameters.
3:  for  to max-iterations do
4:     Reset the environment and reach the initial state .
5:     for  to  do
6:        Sample joint action using Alg. 1, given .
7:        Execute in simulator and observe reward and next state
8:        Store the transitions of all agents () in .
9:     end for
10:     for  to  do
11:        Sample a batch of transitions () from , where can be different in one batch.
12:        Compute target .
13:        Update -network as ,
14:     end for
15:  end for
Algorithm 2 Contextual Deep Q-learning (cDQN)

4.3. Contextual Actor-Critic

We now present the contextual multi-agent actor-critic (cA2C) algorithm, which is a multi-agent policy gradient algorithm that tailors its policy to adapt to the dynamically changing action space. Meanwhile, it achieves not only a more stable performance but also a much more efficient learning procedure in a non-stationary environment. There are two main ideas in the design of cA2C: 1) A centralized value function shared by all agents with an expected update; 2) Policy context embedding that establishes explicit coordination among agents, enables faster training and enjoys the flexibility of regulating policy to different action spaces. The centralized state-value function is learned by minimizing the following loss function derived from Bellman equation:


where we use to denote the parameters of the value network and to denote the target value network. Since agents staying unmoved at the same time are treated homogeneous and share the same internal state, there are unique agent states, and thus unique state-values () at each time. The state-value output is denoted by , where each element is the expected return received by agent arriving at grid on time . In order to stabilize learning of the value function, we fix a target value network parameterized by , which is updated at the end of each episode. Note that the expected update in Eq (7) and training actor/critic in an offline fashion are different from the updates in -step actor-critic online training using TD error (Mnih et al., 2016), whereas the expected updates and training paradigm are found to be more stable and sample-efficient. Furthermore, efficient coordination among multiple agents can be established upon this centralized value network.

Policy Context Embedding. Coordination is achieved by masking available action space based on the context. At each time step, the geographic context is given by Eq (4) and the collaborative context is computed according to the value network output:


where the th element of vector corresponds to the probability of the th action . Let

denote the original logits from the policy network output for the

th agent conditioned on state . Let denote the valid logits considering both geographic and collaborative context for agent at grid , where denotes an element-wise multiplication. In order to achieve effective masking, we restrict the output logits to be positive. The probability of valid actions for all agents in the grid are given by:


The gradient of policy can then be written as:


where denotes the parameters of policy network and the advantage is computed as follows:


The detailed description of cA2C is summarized in Alg 4.

0:  The global state .
1:  Compute centralized state-value
2:  for i = 1 to  do
3:     Compute contexts and for agent .

     Compute action probability distribution

for agent in grid as Eq (10).
5:     Sample action for agent in grid based on action probability .
6:  end for
7:  return  Joint action .
Algorithm 3 Contextual Multi-agent Actor-Critic Policy forward
Figure 2. Illustration of contextual multi-agent actor-critic. The left part shows the coordination of decentralized execution based on the output of centralized value network. The right part illustrates embedding context to policy network.
1:  Initialization:
2:  Initialize the value network with fixed value table.
3:  for  to max-iterations do
4:     Reset environment, get initial state .
5:     Stage 1: Collecting experience
6:     for  to  do
7:        Sample actions according to Alg 3, given .
8:        Execute in simulator and observe reward and next state .
9:        Compute value network target as Eq (8) and advantage as Eq (12) for policy network and store the transitions.
10:     end for
11:     Stage 2: Updating parameters
12:     for  to  do
13:        Sample a batch of experience:
14:        Update value network by minimizing the value loss Eq (7) over the batch.
15:     end for
16:     for  to  do
17:        Sample a batch of experience: , where can be different one batch.
18:        Update policy network as .
19:     end for
20:  end for
Algorithm 4 Contextual Multi-agent Actor-Critic Algorithm for agents

5. Simulator Design

Unlike the standard supervised learning problems where the data is stationary to the learning algorithms and can be evaluated by the training-testing paradigm, the interactive nature of RL introduces intricate difficulties on training and evaluation. One common solution in traffic studies is to build simulators for the environment (Wei et al., 2017; Seow et al., 2010; Maciejewski and Nagel, 2013). In this section, we introduce a simulator design that models the generation of orders, procedure of assigning orders and key driver behaviors such as distributions across the city, on-line/off-line status control in the real world. The simulator serves as the training environment for RL algorithms, as well as their evaluation. More importantly, our simulator allows us to calibrate the key performance index with the historical data collected from a fleet management system, and thus the policies learned are well aligned with real-world traffics.

The Data Description The data provided by Didi Chuxing includes orders and trajectories of vehicles in the center area of a City (Chengdu) in four consecutive weeks. The city is covered by a hexagonal grids world consisting of 504 grids. The order information includes order price, origin, destination and duration. The trajectories contain the positions (latitude and longitude) and status (on-line, off-line, on-service) of all vehicles every few seconds.

Timeline Design. In one time interval (10 minutes), the main activities are conducted sequentially, also illustrated in Figure 4.

  • [leftmargin=0.1in]

  • Vehicle status updates: Vehicles will be set offline (i.e. off from service) or online (i.e. start working) following a distribution learned from real data using a maximum likelihood estimation.

  • Order generation: The new orders generated at the current time step are bootstrapped from real orders occurred in the same time interval. Since the order will naturally reposition vehicles in a wide range, this procedure keeps the reposition from orders similar to the real data.

  • Interact with agents: This step computes state as input to fleet management algorithm and applies the allocations for agents.

  • Order assignments: All available orders are assigned through a two-stage procedure. In the first stage, the orders in one grid are assigned to the vehicles in the same grid. In the second stage, the remaining unfilled orders are assigned to the vehicles in its neighboring grids. This two-stage procedure is essential to stimulate the real world activities.

Calibration. The effectiveness of the simulator is guaranteed by calibration against the real data regarding the most important performance measurement: the gross merchandise volume (GMV). As shown in Figure 3, after the calibration procedure, the GMV in the simulator is very similar to that from the ride-sharing platform. The between simulated GMV and real GMV is and the Pearson correlation is with -value .

Figure 3.

The simulator calibration in terms of GMV. The red curves plot the GMV values of real data averaged over 7 days with standard deviation, in 10-minute time granularity. The blue curves are simulated results averaged over 7 episodes.

Figure 4. Simulator time line in one time step (10 minutes).

6. Experiments

In this section, we conduct extensive experiments to evaluate the effectiveness of our proposed method.

6.1. Experimental settings

In the following experiments, both of training and evaluation are conducted on the simulator introduced in Sec 5. For all the competing methods, we prescribe two sets of random seed that control the dynamics of the simulator for training and evaluation, respectively. Examples of dynamics in simulator include order generations, and stochastically status update of all vehicles. In this setting, we can test the generalization performance of algorithms when it encounters unseen dynamics as in real scenarios. The performance is measured by GMV (the total value of orders served in the simulator) gained by the platform over one episode (144 time steps in the simulator), and order response rate, which is the averaged number of orders served divided by the number of orders generated. We use the first 15 episodes for training and conduct evaluation on the following ten episodes for all learning methods. The number of available vehicles at each time in different locations is counted by a pre-dispatch procedure. This procedure runs a virtual two-stage order dispatching process to compute the remaining available vehicles in each location. On average, the simulator has 5356 agents per time step waiting for management. All the quantitative results of learning methods presented in this section are averaged over three runs.

6.2. Performance comparison

In this subsection, the performance of following methods are extensively evaluated by the simulation.

  • [leftmargin=0.1in]

  • Simulation: This baseline simulates the real scenario without any fleet management. The simulated results are calibrated with real data in Sec 5.

  • Diffusion: This method diffuses available vehicles to neighboring grids randomly.

  • Rule-based: This baseline computes a value table , where each element represents the averaged reward of an agent staying in grid at time step . The rewards are averaged over ten episodes controlled by random seeds that are different with testing episodes. With the value table, the agent samples its action based on the probability mass function normalized from the values of neighboring grids at the next time step. For example, if an agent located in at time and the current valid actions are and , the rule-based method sample its actions from .

  • Value-Iter: It dynamically updates the value table based on policy evaluation (Sutton and Barto, 1998). The allocation policy is computed based on the new value table, the same used in the rule-based method, while the collaborative context is considered.

  • T- learning: The standard independent tabular -learning (Sutton and Barto, 1998) learns a table with -greedy policy. In this case the state reduces to time and the location of the agent.

  • T-SARSA: The independent tabular SARSA (Sutton and Barto, 1998) learns a table with same setting of states as T- learning.

  • DQN: The independent DQN is currently the state-of-the-art as we introduced in Sec 4.1. Our network is parameterized by a three-layer ELUs (Clevert et al., 2015) and we adopt the -greedy policy as the agent policy. The is annealed linearly from 0.5 to 0.1 across the first 15 training episodes and fixed as during the testing.

  • cDQN: The contextual DQN as we introduced in Sec 4.2. The is annealed the same as in DQN. At the end of each episode, the -network is updated over 4000 batches, i.e. in Alg 2

    . To ensure a valid context masking, the activation function of the output layer of the

    -network is ReLU + 1.

  • cA2C: The contextual multi-agent actor-critic as we introduced in Sec 4.3. At the end of each episode, both the policy network and the value network are updated over 4000 batches, i.e. in Alg 2. Similar to cDQN, The output layer of the policy network uses ReLU + 1 as the activation function to ensure that all elements in the original logits are positive.

Except for the first baseline, the geographic context is considered in all methods so that the agents will not navigate to the invalid grid. Unless other specified, the value function approximations and policy network in contextual algorithms are parameterized by a three-layer ReLU (He et al., 2016)

with node sizes of 128, 64 and 32, from the first layer to the third layer. The batch size of all deep learning methods is fixed as 3000, and we use

AdamOptimizer with a learning rate of . Since performance of DQN varies a lot when there are a large number of agents, the first column in the Table 1 for DQN is averaged over the best three runs out of six runs, and the results for all other methods are averaged over three runs. Also, the centralized critics of cDQN and cA2C are initialized from a pre-trained value network using the historical mean of order values computed from ten episodes simulation, with different random seeds from both training and evaluation.

To test the robustness of proposed method, we evaluate all competing methods under different numbers of initial vehicles, and the results are summarized in Table 1. The results of Diffusion improved the performance a lot, possibly because the method sometimes encourages the available vehicles to leave the grid with high density of available vehicles, and thus the imbalanced situation is alleviated. The Rule-based method that repositions vehicles to the grids with a higher demand value, improves the performance of random repositions. The Value-Iter dynamically updates the value table according to the current policy applied so that it further promotes the performance upon Rule-based. Comparing the results of Value-Iter, T-Q learning and T-SARSA, the first method consistently outperforms the latter two, possibly because the usage of a centralized value table enables coordinations, which helps to avoid conflict repositions. The above methods simplify the state representation into a spatial-temporal value representation, whereas the DRL methods account both complex dynamics of supply and demand using neural network function approximations. As the results shown in last three rows of Table 1, the methods with deep learning outperforms the previous one. Last but not least, the contextual algorithms (cA2C and cDQN) largely outperform the independent DQN (DQN), which is the state-of-the-art among large-scale multi-agent DRL method and all other competing methods.

initial vehicles initial vehicles initial vehicles
Normalized GMV Order response rate Normalized GMV Order response rate Normalized GMV Order response rate
T-Q learning
cDQN 114.29
cA2C 115.27 105.62
Table 1. Performance comparison of competing methods in terms of GMV and order response rate. For a fair comparison, the random seeds that control the dynamics of the environment are set to be the same across all methods.
Normalized GMV Order response rate Repositions
Table 2. The effectiveness of contextual multi-agent actor-critic considering dispatch costs.

6.3. Consideration of reposition costs

In reality, each reposition comes with a cost. In this subsection, we consider such reposition costs and estimated them by fuel costs. Since the travel distance from one grid to another is approximately 1.2km and the fuel cost is around 0.5 RMB/km, we set the cost of each reposition as . In this setting, the definition of agent, state, action and transition probability is same as we stated in Sec 3. The only difference is that the repositioning cost is included in the reward when the agent is repositioned to different locations. Therefore, the GMV of one episode is the sum of all served order value substracted by the total of reposition cost in one episode. For example, the objective function for DQN now includes the reposition cost as follows:


where , and if then , otherwise . Similarly, we can consider the costs in cA2C. However, it is hard to apply them to cDQN because the assumption, that different actions that lead to the same location should share the same action value, which is not held in this setting. Therefore we compared two deep learning methods that achieve best performances in the previous section. As the results shown in Table 2, the DQN tends to reposition more agents while the cA2C achieves better performance in terms of both GMV and order response rate, with lower cost. The training procedures and the network architecture are the same as described in the previous section.

6.4. The effectiveness of averaged reward design

In multi-agent RL, the reward design for each agent is essential for the success of learning. In fully cooperative multi-agent RL, the reward for all agents is a single global reward (Busoniu et al., 2008), while it suffers from the credit assignment problem for each agent’s action. Splitting the reward to each agent will alleviate this problem. In this subsection, we compare two different designs for the reward of each agent: the averaged reward of a grid as stated in Sec 3 and the total reward of a grid that does not average on the number of available vehicles at that time. As shown in table 3, the methods with averaged reward (cA2C, cDQN) largely outperform those using total reward, since this design naturally encourages the coordinations among agents. Using total reward, on the other hand, is likely to reposition an excessive number of agents to the location with high demand.

Proposed methods Raw Reward
Normalized GMV/Order response rate Normalized GMV/Order response rate
cA2C / /
cDQN / /
Table 3. The effectiveness of averaged reward design. The performance of methods using the raw reward (second column) is much worse than the performance of methods using the averaged reward.
(a) Without reposition cost (b) With reposition cost
Figure 5. Convergence comparison of cA2C and its variations without using context embedding in both settings, with and without reposition costs. The X-axis is the number of episodes. The left Y-axis denotes the number of conflicts and the right Y-axis denotes the normalized GMV in one episode.
Normalized GMV/Order response rate Repositions
Without reposition cost
cA2C /
cA2C-v1 /
cA2C-v2 /
With reposition cost
cA2C /
cA2C-v3 /
Table 4. The effectiveness of context embedding.

6.5. Ablations on policy context embedding

In this subsection, we evaluate the effectiveness of context embedding, including explicitly coordinating the actions of different agents through the collaborative context, and eliminating the invalid actions with geographic context. The following variations of proposed methods are investigated in different settings.

  • cA2C-v1: This variation drops collaborative context of cA2C in the setting that does not consider reposition cost.

  • cA2C-v2: This variation drops both geographic and collaborative context of cA2C in the setting that does not consider reposition cost.

  • cA2C-v3: This variation drops collaborative context of cA2C in the setting that considers reposition cost.

The results of above variations are summarized in Table 4 and Figure 5. As seen in the first two rows of Table 4 and the red/blue curves in Figure 5 (a), in the setting of zero reposition cost, cA2C achieves the best performance with much less repositions () comparing with cA2C-v1. Furthermore, collaborative context embedding achieves significant advantages when the reposition cost is considered, as shown in the last two rows in Table 4 and Figure 5 (b). It not only greatly improves the performance but also accelerates the convergence. Since the collaborative context largely narrows down the action space and leads to a better policy solution in the sense of both effectiveness and efficiency, we can conclude that coordination based on collaborative context is effective. Also, comparing the performances of cA2C and cA2C-v2 (red/green curves in Figure 5 (a)), apparently the policy context embedding (considering both geographic and collaborative context) is essential to performance, which greatly reduces the redundant policy search.

6.6. Qualitative study

In this section, we analyze whether the learned value function can capture the demand-supply relation ahead of time, and the rationality of allocations. To see this, we present a case study on the region nearby the airport. The state value and allocation policy is acquired from cA2C that was trained for ten episodes. We then run the well-trained cA2C on one testing episode, and qualitatively exam the state value and allocations under the unseen dynamics. The sum of state values and demand-supply gap (defined as the number of orders minus the number of vehicles) of seven grids that cover the CTU airport is visualized. As seen in Figure 7, the state value can capture the future dramatic changes of demand-supply gap. Furthermore, the spatial distribution of state values can be seen in Figure 6. After the midnight, the airport has a large number of orders, and less available vehicles, and therefore the state values of airport are higher than other locations. During the daytime, more vehicles are available at the airport so that each will receive less reward and the state values are lower than other regions, as shown in Figure 6 (b). In Figure 6 and Figure 7, we can conclude that the value function can estimate the relative shift of demand-supply gap from both spatial and temporal perspectives. It is crucial to the performance of cA2C since the coordination is built upon the state values. Moreover, as illustrated by blue arrows in Figure 6, we see that the allocation policy gives consecutive allocations from lower value grids to higher value grids, which can thus fill the future demand-supply gap and increase the GMV.

(a) At 01:50 am. (b) At 06:40 pm.
Figure 6. Illustration on the repositions nearby the airport at 1:50 am and 06:40 pm. The darker color denotes the higher state value and the blue arrows denote the repositions.
Figure 7. The normalized state value and demand-supply gap over one day.

7. Conclusions

In this paper, we first formulate the large-scale fleet management problem into a feasible setting for deep reinforcement learning. Given this setting, we propose contextual multi-agent reinforcement learning framework, in which two contextual algorithms cDQN and cA2C are developed and both of them achieve the large scale agents’ coordination in fleet management problem. cA2C enjoys both flexibility and efficiency by capitalizing a centralized value network and decentralized policy execution embedded with contextual information. It is able to adapt to different action space in an end-to-end training paradigm. A simulator is developed and calibrated with the real data provided by Didi Chuxing, which served as our training and evaluation platform. Extensive empirical studies under different settings in simulator have demonstrated the effectiveness of the proposed framework.

This material is based in part upon work supported by the National Science Foundation under Grant IIS-1565596, IIS-1615597, IIS-1749940 and Office of Naval Research N00014-14-1-0631, N00014-17-1-2265.