1 Introduction
The emergence of transportation network companies (TNCs) or ehailing platforms (such as Didi and Uber) has revolutionizsed the traditional taxi market and provided commuters a flexibleroute doortodoor mobility service. Nonetheless, it is reported that a large portion of the passenger requests remain unserviced because of the imbalance between demand (i.e., passenger requests) and supply (i.e., available drivers) (lin_efficient_2018), resulting in long cruising trips for taxi drivers to find the next passenger (powell_towards_2011)
. Such cruising behavior has negative impact on urban economy by not only decreasing drivers’ income but also generating additional vehicle miles traveled. Thus, repositioning available drivers to potential locations with nearfuture high demand, i.e., to balance supply and demand, becomes the key challenge faced by the taxi and forhire market, including ehailing platforms. Leveraging cutting edge machine learning techniques, this paper aims to improve the efficiency of the taxi and forhire market.
The essence of the repositioning task is to provide recommendations to idle drivers on where to find the next passenger. Some recommender systems have been proposed for drivers (ge_energyefficient_2010; hwang_effective_2015; yuan_where_2011; qu_costeffective_2014). These studies extracted useful aggregated statistical quantities such as taxi demand and travel time from historical data and recommended a next cruising location (ge_energyefficient_2010), a sequence of potential pickup points (hwang_effective_2015), a driving route (qu_costeffective_2014), or a route and a location (yuan_where_2011).
Although the aforementioned studies provide effective recommendations of the next cruising route or location to drivers at the immediate next step, they are nearsighted and fall short of capturing the future longrun payoffs. To capture the effect of future rewards on the recommendation at the immediate next step, various Markov decision process (MDP) based approaches have been proposed to model idle drivers’ passenger searching process
(rong_rich_2016; zhou_optimizing_2018; verma_augmenting_2017; gao_optimize_2018; yu_markov_2019; shou_optimal_2020). In an MDP with a single agent, a driver is the agent who makes decisions of where to go next. The dynamic environment is determined by the stochastic passenger requests and all other traffic information including the road network, distribution of drivers, and traffic conditions. Once the agent makes an action in a state, the agent then transits into a new state and receives an immediate reward by following the dynamics of the environment. The agent aims to derive an optimal policy which maximizes her expected cumulative reward. When the dynamic environment is known to the agent, dynamic programming or value iteration can be used to solve the MDP and derive an optimal policy. When the dynamic environment is unknown to the agent, the agent needs to interact with the environment by the trial and error process and gradually learns an optimal policy by some reinforcement learning (RL) algorithms such as Qlearning and temporal difference learning (sutton_introduction_1998).The competition among multiple agents is, however, neglected in the aforementioned MDP models due to their singleagent setting, resulting in overly optimistic optimal policies. In other words, one agent cannot earn the full amount of the expected reward by following the policy derived in the singleagent setting. In a dynamic environment involving a group of agents, multiple agents interact with both the shared environment and other agents. Multiagent reinforcement learning (MARL) (busoniu_multiagent_2010) thus fits naturally well in this multiagent system (MAS). Recently, MARL has been attracting significant attention due to its success in tackling high dimensional and complicated tasks such as playing the game of Go (silver_mastering_2016; silver_mastering_2017), Poker (brown_superhuman_2018; brown_superhuman_2019), Dota 2 (OpenAI_dota), and StarCraft II (vinyals_grandmaster_2019).
MARL tasks can be broadly grouped into three categories, namely, fully cooperative, fully competitive, and a mix of the two, depending on different applications (zhang_multiagent_2019): (1) In the fully cooperative setting, agents collaborate with each other to optimize a common goal; (2) In the fully competitive setting, agents have competing goals, and the return of agents sums up to zero; (3) The mixed setting is more like a generalsum game where each agent cooperates with some agents while competes with others. For instance, in the video game Pong, an agent is expected to be either fully competitive if its goal is to beat its opponent or fully cooperative if its goal is to keep the ball in the game as long as possible (tampuu_multiagent_2017). A progression from fully competitive to fully cooperative behavior of agents was also presented in tampuu_multiagent_2017 by simply adjusting the reward.
A key challenge arises in MARL when independent agents have no knowledge of other agents, that is, the theoretical convergence guarantee is no longer applicable since the environment is no longer Markovian and stationary (matignon_independent_2012; nguyen_deep_2018). To tackle this issue, one way is to exchange some information among agents. In some contexts, agents actually exchange information with their peers through some coordination. For example, in the game of a team of hunters capturing a team of preys, tan_multiagent_1993 proposed multiple ways to enable coordination among agents and concluded that the performance of the hunter agents can be better off through some coordination. However, in other contexts such as the driver repositioning system, agents only have access to their own information. Thus, information exchange among agents involves a central controller which collects the information of all agents and disseminates it to agents. Agents update their value functions and policies based on the provided information from the central controller and their local observations. This is the centralized learning (i.e., based on global information) and decentralized execution (i.e., based on local observation) paradigm, which has become increasingly popular in recent research (foerster_learning_2016; lowe_multiagent_2017; lin_efficient_2018; li_efficient_2019).
While training is stabilized conditioning on the information of other agents such as joint state and joint action in the centralized training paradigm, scalability becomes a critical issue in MARL because the joint state space and joint action space grow exponentially with the number of agents. To make MARL tractable when a large number of agents coexist, yang_mean_2018 employed the mean field theory to simplify the interaction among agents. The basic idea is, from the perspective of an agent, to treat other agents as a mean agent. Thus, the complexity of interactions among a large number of agents is substantially eased by reducing the dimension in the Qvalue function. The large scale MARL with hundreds of or even thousands of agents becomes solvable. To investigate the largescale order dispatching problem where thousands of agents are present, li_efficient_2019 adopted a mean field approximation and proposed to take the average response from neighboring agents as a proxy of the interaction between the agent and other agents.
Recent studies have successfully applied MARL to multidriver repositioning and large scale order dispatching problems (lin_efficient_2018; li_efficient_2019; zhou_multiagent_2019). Different from treating each driver as an agent in previous studies, jin_coride_2019 treated each spatial grid as a worker agent and each region composed of several spatial grids as a manager agent and adopted hierarchical reinforcement learning to tackle the joint task of order dispatching and fleet management. All these studies rely on an underlying assumption that drivers are willing to cooperate under a specifically crafted reward function. For example, embedding the goal of the platform such as improving the gross merchandise volume (GMV) or the order response rate (ORR) into the reward function of a driver encourages cooperation among drivers. Humandrivers are, however, selfish in nature and will only cooperate if the overall return from cooperation is higher than that from competition. This selfinterested behavior is utilized to achieve certain degree of cooperation among agents such as adjusting the reward for each agent. However, when the imposed reward function (lin_efficient_2018; li_efficient_2019; zhou_multiagent_2019; jin_coride_2019) is not aligned with the goal of real drivers (e.g., a real driver’s goal can simply be maximizing her monetary return), drivers will not follow the derived optimal policy. Thus, in this work, instead of enforcing a reward function for drivers to cooperate, drivers are regarded as selfish and noncooperative, and the reward for a driver is simply the monetary return that the driver earns.
Although the approaches in lin_efficient_2018; li_efficient_2019; zhou_multiagent_2019; jin_coride_2019 are efficient under a given reward function, the reached equilibrium is very likely to be a suboptimal from the overall perspective of the system. In this paper, we show that by integrating a reward design mechanism which adjusts the monetary return that a driver earns, a desirable equilibrium can be reached in this intrinsically largescale noncooperative system. The desirable equilibrium refers to a Nash equilibrium where each independent and selfish agent’s strategy is the bestresponse to other agents’ strategies and will produce better overall performance of the system. mguni_coordinating_2018 proposed a twolayer architecture with an incentive designer as the upper layer and a potential game as the lower layer and formulated the incentive designer’s problem as an optimization problem. In contrast, the MARL problem in our context may not be able to be transformed as a potential game, complicating computation of its equilibrium.
In summary, the major contributions of this paper are as follows: (1) Instead of intentionally crafting a reward function, which aligns with the goal of the platform but may not reflect the intrinsic reward of real drivers, this paper takes the monetary return of a driver as the reward function. It aims to improve the performance of the platform by adjusting the monetary return that one driver can earn through a reward design mechanism of the platform (e.g., platform service charge and incentives). (2) With the lower level as the MAS and the upper level as the reward design, this paper formulates a bilevel optimization problem in which a mean field actorcritic algorithm is developed to solve the MAS and a Bayesian optimization algorithm is adopted to efficiently solve the problem.
The remainder of the paper is organized as follows. Section (2) introduces the singleagent actorcritic algorithm, which is a stepping stone for MARL. Section (3) presents the mean field multiagent reinforcement learning algorithm. Section (4) presents a reward design mechanism and formulates a bilevel optimization problem. Section (5) presents the result and validates the effectiveness of the proposed reward design. Section (6) concludes.
2 Single agent reinforcement learning
As a stepping stone, we first introduce the single agent reinforcement learning where only one agent interacts with the environment.
2.1 Problem definition
A Markov decision process (MDP) (puterman_markov_1994) is typically specified by a tuple , where denotes the state space, stands for the allowable actions, collects rewards,
denotes a state transition probability from one state to another, and
is a discount factor. A general MDP proceeds simply as follows. Starting from the initial state, the agent specifies an action whenever the agent is in a state . The agent then transits into a new state with probability and observes an immediate reward by obeying the dynamics of the environment. Then the process repeats until a terminal state is reached. A policy simply maps from state to the probability of taking action in state , i.e., . The goal of solving an MDP is to derive an optimal policy so that the agent can maximize her long term expected reward by following the policy. In reinforcement learning problems, the transition probability matrix is commonly unknown, and the agent learns about from its interaction with the environment.Denote as the state value, which is the expected cumulative reward that an agent can earn by starting from state and following a policy . can be recursively given as (sutton_introduction_1998) . Denote as the stateaction value, which is the expected cumulative reward that an agent can earn by starting from state , taking action , and following a policy . is related with through .
The optimal value can then be written as . The Bellman optimality equation is given as (sutton_introduction_1998):
where the optimal stateaction value is .
Our task is then to derive an optimal policy (i.e., to solve the MDP) with which the agent can optimize its expected cumulative reward.
To demonstrate how to apply MDPs problems to the context of ehailing driver reposition, we will use examples on a 2by2 grid world throughout the paper every time when models are introduced.
Example 2.0.
(SingleAgent ). The singleagent driver reposition is presented in Figure (1). We adopt a grid world setup where the index of each grid (denoted as ) is shown at the upper left corner. The taxi icon denotes the driver, and the person icon is the passenger request with the corresponding fare shown above. The time beneath the driver and the passenger request records the current time of the driver and the appearance time of the passenger request, respectively. The dashed line with arrow shows the origin and destination of the passenger request.
S. The state of the driver consists of two components, namely, the grid index and current time , i.e., . For instance, the current state of the driver is in this example.
A. The allowable action of the driver is either moving into one of the neighboring grids or staying within the current grid. To be concise, we use the index of grid where the driver chooses to enter as the action. Suppose the driver decides to go rightward in the example, then we can denote . We further assume it takes the driver one time step to enter grid . In other words, the current time of the driver is when the driver arrives in grid .
P. Considering the driver arrives in grid at time , and at the same time a passenger request appears in grid with probability. If this driver is matched to the passenger and picks up the passenger, the driver will transit to the passenger’s destination, which is grid . Denote the transition time from grid to grid as . We can define the new state . Then the transition probability from the state at time to the state at time is , mathematically, . If there is no passenger request in grid at time , then the driver ends up in state . The transition probably becomes .
R. If we take the fare of the fulfilled passenger request as the reward, in the example. Based on the received reward at this step and the future cumulative reward, the driver chooses an action in the new state , and the state transition process repeats until a terminal state (i.e., where is a predefined ending time, say, the end of the driver’s work time) is reached. ∎
2.2 ActorCritic method
To solve optimal policies, there are two types of methods, namely, value based or criticonly method and policy based or actoronly method. Value based and policy based methods are commonly used terminologies, but from now on we will use criticonly and actoronly methods for the purpose of introducing the actorcritic method.
Criticonly methods aim to output the optimal policy through optimizing the stateaction or the state value . Actoronly methods directly output an optimal policy without resorting to stored value functions or
as an intermediary. Both methods have pros and cons. Criticonly methods enjoy a low variance in the estimate of the stateaction value but may lack guarantees on the optimality or nearoptimality of the resulting policy if an optimal policy cannot be easily solved from value functions. Actoronly methods work well on continuous and large action spaces but may suffer from high fluctuation in policies
(konda_actorcritic_2003; grondman_survey_2012). To overcome the shortcomings of these methods, actorcritic methods are developed to combine strengths of both methods (konda_actorcritic_2003).Figure (2) presents the architecture of the actorcritic algorithm. One agent, who has an actor and a critic, interacts with the environment. The agent observes its state from the environment and inputs
to the actor that outputs the policy, i.e., a probability distribution over all possible actions. The agent samples an action
from the probability distribution and takes action in the environment. Then the agent observes a state transition and receives a reward from the environment. Based on the onestep transition as well as action and reward , the agent updates its critic. With the updated Qvalue , the agent updates its actor using policy gradient. Now we detail both the critic and the actor, respectively.Critic. The critic takes as input state and action and outputs Qvalue . Qlearning is the most commonly used algorithm to update the Q value based on the state transition with reward and updates the Qvalue by
(1) 
where is the learning rate and . If reduces over time properly, the Qlearning update converges (sutton_introduction_1998). Equation (1), however, is only applicable to a finite and discrete state and action space. In other words, one needs to maintain a Q table with all possible combinations of and
, which is not tractable for a continuous and large state and action space. Therefore we need functional approximation to the original Qvalue. Deep neural network, i.e., deep Q network (DQN), is one of the most popular value approximator
(mnih_humanlevel_2015). Denote a deep neural network parameterized by as , to approximate . DQN updates its parameter by minimizing the loss(2) 
This problem can be solved by the gradient descent method, whose gradient is straightforward to compute as follows: , where the gradient is not taken with respect to the target.
Actor. The actor takes as input state and outputs a probability distribution on all allowable actions in this state. Similarly to how we use a value network to approximate Qvalue, we can also use a deep neural network, i.e., policy network, to approximate the policy . Denote the policy network parameterized by as . The goal of the actor is to maximize its expected cumulative reward, denoted as , where is the reward the actor receives at time . To solve the optimal policy of the actor requires us to know its gradient. The gradient of the policy is complicated to solve and is given as (sutton_policy_1999)
(3) 
where denotes the Qvalue function following the policy , is some baseline (e.g., , i.e., the value function following the policy ), and is called the advantage of a taken action , a measure of the goodness of an action. If it is greater than zero, it means this taken action is generally good, otherwise it may be bad. Naturally, the underlying rationale in computing the policy gradient defined in Equation (3) is to update the policy distribution to concentrate on potentially good action(s). When the chosen action leads to a positive advantage, i.e., , the policy is updated towards the direction of favoring action . When the advantage is negative for action , the policy is updated in the direction of against action .
To summarize, in addition to the policy network , the actorcritic algorithm also maintains a value network so that the calculation of the gradient of the policy in Equation (3) directly uses the Qfunction approximator , to ensure stability of policy update. The actorcritic algorithm simultaneously updates critic (by minimizing the loss given in Equation (2)) and the actor (by the gradient given in Equation (3)) as more samples are fed in.
3 Multiagent reinforcement learning
To tackle a realworld problem with multiple agents, the aforementioned single agent reinforcement learning falls short of capturing the coupling effects or the competition among multiple agents. In this section, we introduce a mean field multiagent reinforcement learning approach to model the multidriver repositioning task.
3.1 Problem definition
The multiagent problem is modeled as a partially observable Markov decision process (POMDP) (littman_markov_1994), defined by a tuple , where is the number of agents and is the environment state space. Environment state is not fully observable. Instead, agent draws a private observation which is correlated with . is the observation space of agent , yielding a joint observation space , is the action space of agent , yielding a joint action space , is the state transition probability, is the reward function for agent , and is the discount factor.
Agent uses a policy to choose actions after drawing observation . After all agents taking actions, the joint action triggers a state transition based on the state transition probability . Agent draws a private observation corresponding to and receives a reward . Agent aims to maximize its discounted expected cumulative reward by deriving an optimal policy which is the best response to other agents’ policies. This process repeats until agents reach their own terminal state.
Due to the existence of other agents, the Qvalue function for agent , i.e., , is now dependent on the environment state and the joint action of all agents, i.e,
(4) 
Similarly, the value function of agent , i.e., , is dependent on the environment state .
Subsequently, we will demonstrate how to formulate the multidriver repositioning problem in MARL, building on the singleagent example developed in the previous section.
Example 3.0.
(MultiAgent ). The multiagent driver reposition is presented in Figure (3). Same as before, a grid world setup is adopted. Now we have two drivers with their indices shown above the taxi icon and two passenger requests with fare presented above the passenger icon. The time beneath drivers and passenger requests records the current time of the driver and the appearance time of the passenger request, respectively. The dashed line with arrow shows the origin and destination of the passenger request.
N. There are drivers moving around in the environment. We denote drivers by .
S. The environmental state consists state information of both drivers. For driver , her state is composed of her current location (i.e., the grid index based on a grid world setup) and current time , i.e., . The joint state of both drivers, i.e., the environment state , at time is denoted as . In this example, at current time , .
A. For driver , her action can be any of the five possible actions, i.e., moving into any of her four neighboring grids or staying in the current grid. The same as before, we use the index of grid where the driver chooses to enter as the action. The joint action of both drivers is . Assuming driver decides to go rightward (i.e, to enter grid ) and driver chooses to go leftward (i.e., to enter grid ), the joint action is . We further assume it then takes driver one time step to enter grid and driver one time step to enter grid . In other words, after driver arrives in grid and driver arrives in grid , the clock ticks one step forward and the current time is now .
P. The joint action triggers a state transition with some probability according to the state transition function, i.e., . Driver gets matched to the passenger request in grid at , loads up the passenger, and drives to the destination of the passenger. Driver then arrives in a new state where is the transition time from grid to grid . Driver gets matched to the passenger request in grid at , loads up the passenger, and drives to the destination of the passenger. Driver then arrives in a new state where is the transition time from grid to grid . . In this simple example, due to the deterministic appearance of passenger requests.
R. Along with the state transition, each driver receives a reward, i.e., . The reward function for each agent is simply the fare of the fulfilled passenger request, i.e., and . ∎
This example will be revisited later in this section to illustrate the algorithm.
3.2 Techniques to simplify the Qvalue function
The dependency of the Qvalue of an agent on other agents’ states and actions, as shown in Equation (4), however, introduces prohibitively high difficulties in learning the optimal Qvalue. The main reasons are twofold. First, although each agent draws its private observation from the environment state s, s cannot be observed by any agent, i.e., s is unknown. Second, one agent does not observe the actual actions taken by all agents, i.e., is unknown.
To make the Qvalue of an agent in the multiagent system tractable, the dependency of the Qvalue on the environment state and joint action needs to be simplified. A very natural approach, inspired by the singleagent setting, is independent learning where each agent only has information about its own observation and action but has no information about other agents. Thus, the Qvalue function of agent is reduced to
(5) 
In other words, private observations and joint action of other agents are not used by agent . After all agents choosing actions, the joint action triggers a state transition. Agent then draws a new private observation and receives a reward .
The independent learning algorithm, although is intuitive and simple, can be unstable and hard to reach convergence since the environment is no longer Markovian and stationary due to the appearance of other agents (matignon_independent_2012).
3.2.1 Centralized training and decentralized execution
To make the training more stable and ensure convergence, we employ the centralized training and decentralized execution paradigm (foerster_learning_2016; lowe_multiagent_2017; lin_efficient_2018; li_efficient_2019). In this paradigm, to train the policy of agents, we assume these agents know the global information such as the joint observation and/or joint action. In other words, in addition to observation and action , agent also has access to the observations and/or actions of other agents during training. While in the execution phase, decentralized testing or execution is implemented, meaning they would not have access to the global information anymore. To realize this paradigm, the aforementioned actorcritic algorithm naturally fits in, because we can apply global information to the critic, i.e., joint observation and joint action in , in the training phase, while feeding local information to the actor, i.e., in , in the execution phase. Decentralized execution becomes possible because only actors are used in execution.
Then the Qvalue function of agent becomes
(6) 
where and denote the joint observation and joint action of all agents except agent , respectively.
In the context of ehailing driver repositioning, considering the definition of the action, which is the index of the grid where the driver chooses to enter, the Qvalue function of driver , i.e., , does not depend on the joint observation of other drivers, i.e., . Explanations are as follows. When driver chooses action based on its observation , driver then enters grid . At the same time, other drivers also enter some grid based on their joint action regardless of their joint observation . The Qvalue function of driver only depends on the current distribution of drivers, which has been determined by their joint action . Therefore it is the joint action which affects . The Qvalue function is thus further reduced to
(7) 
3.2.2 Mean field approximation
The centralized training and decentralized execution paradigm, however, can easily become intractable due to the exponential increase in the joint action space with the increasing number of agents. For example, the size of the joint action space easily blows up for agents with possible actions (i.e., possibilities). To simplify the interaction among agents, we adopt the mean field approximation. The basic idea of the mean field approximation is to simplify the complicated interaction between one agent and all other agents by a pairwise interaction between the agent and a virtual mean agent which is formed by the neighboring agents of the agent. Thus, the complexity of interactions among a large number of agents is substantially eased by reducing the dimension in the input of the Qvalue function. Therefore the large scale MARL with hundreds of or even thousands of agents becomes solvable.
To be more precise, we provide brief explanations that lead to the applicability of the mean field approximation in MARL as described in yang_mean_2018. First, from the perspective of agent , the multiagent effect or competition effect mainly comes from its neighboring agents, i.e., , where denotes the neighboring agents of agent . However, it is still cumbersome to compute for the neighboring agents of agent if this number is large. Define a mean action , which is a proxy of the actions taken by the neighboring agents. Accordingly, can be further simplified to when Taylor expansion is applied, which is
(8) 
Interested readers can refer to yang_mean_2018 for a detailed explanation and proof.
Example 3.0.
(MultiAgent ). The mean action of the neighboring drivers of driver i is defined as the demand to supply ratio in the grid where driver is entering. Assuming both drivers choose action , i.e., in the multiagent example shown in Figure (3), there are 2 drivers and 1 passenger request in grid after both drivers enter grid . The mean action for both drivers is thus . This definition of mean action captures the level of competition in a grid. A larger mean action denotes a higher demand to supply ratio and lower level of competition, and vice versa. ∎
3.3 Mean field actorcritic algorithm
As previously mentioned, each agent maintains a policy network (i.e., the actor) and a Qvalue network (i.e., the critic). For a realworld multiagent task, there are typically hundreds of or even thousands of agents, indicating that maintaining two deep neural networks (i.e., one for the actor and one for the critic) per agent is not computationally tractable. Considering that for a class of multiagent tasks where anonymous agents share the same state space, action space, and reward function, agents are thus homogeneous. The multiagent task can then be largely simplified by sharing both the actor and the critic among drivers, i.e., and .
After adopting the mean field approximation, the loss function for the critic, which was presented in Equation (
2) for the singleagent setting, now becomes(9) 
The only difference is the incorporation of the mean action into the Qvalue function approximation. Similarly, the gradient of the policy, which was presented in Equation (3) for singleagent setting, is now
(10) 
Example 3.0.
(MultiAgent ). Now we apply the mean field actorcritic algorithm to the multidriver example shown in Figure (3). Figure (4
) presents the architecture of the mean field actorcritic algorithm particularly for the context of multidriver repositioning. Homogeneous agents, who share a common actor and a common critic, interact with the environment. The shared actor is a multilayer perceptron with 32 neurons in its hidden layer and takes as input observation
and outputs a five dimensional vector denoting the probability distribution of taking five actions. Similarly, the shared critic takes as input
and outputs the Qvalue. During training, agent draws its private observation from the environment and inputs to the actor which outputs a probability distribution over actions. Agent samples an action from the probability distribution and takes the sampled action in the environment. Joint action of all agents triggers a state transition in the environment. Agent then observes the mean action , draws a new observation , and receives a reward from the environment. The agent then uses to update the shared critic by minimizing the loss presented in Equation (9). Based on the advantage calculated from the critic, agent updates the shared actor using the gradient presented in Equation (10).The aforementioned training process is centralized because the mean action used in the critic is actually some global information. During execution, agents only need to use the updated actor, which only takes as input the local information, i.e., the private observation. In other words, the shared critic is not used in execution.
The derived Q values corresponding to four scenarios of interest are presented in Figure (5). In Figure ((a)a), when both drivers choose action #4, the observed mean action for both of them is the ratio of demand to supply, i.e., . The resulting expected value for both drivers is , i.e., , because both of them have an equal probability to take the passenger request with . Similarly, the observed mean actions and resulting Q values can be explained in other scenarios. The Qvalue bimatrix is presented in Table (1) where driver is the column player and driver is the row player. When driver chooses action and driver 2 chooses action , Qvalues for them are and , respectively, according to Figure ((d)d). Similarly, Qvalues for both drivers can be read from Figure (5) for other scenarios. Based on the bimatrix, driver always chooses action because action is strictly better than action regardless of the observed mean action, and driver always chooses action for the same reason. Thus, the optimal policy for both drivers is to enter grid with an expected payoff .
4 Reward design for multiagent reinforcement learning
Due to selfishness of each agent, performing MARL under a given reward function in an MAS is very likely to yield an undesirable equilibrium from the perspective of the system. In other words, this equilibrium may not be an optimum with respect to some system objectives. To guide a multiagent system towards a desirable equilibrium, system planners could resort to reward design mechanisms by modifying the reward function of agents. In this paper, we introduce a new parameter into agents’ reward, where is the feasible domain of . Parameter can be either a scalar or a vector. The goal of system planners is to maximize some system performance measure dependent of , denoted as . The system planner first chooses a value of and inputs to the MAS. With the given which determines the reward, the developed mean field actorcritic algorithm is employed to derive an optimal policy , which is dependent on , for all agents in the system. Some performance measure , which is calculated by executing the derived optimal policy for all agents, is then fed into the reward design. The performance measure is dependent on through the dependency of on . In other words, .
In summary, the reward design problem is to select a parameter to maximize the performance measure on the upper level, while the distributed agents aim to maximize their individual cumulative rewards on the lower level once is given as part of their reward. This process can be formulated as a bilevel optimization problem, mathematically,
(11)  
The interaction between upper and lower levels through exchange of variables is shown in Figure (6).
The optimization problem presented in Equation (11), however, is not straightforward to solve due to the unknown complex structure of over the parameter . The traditional gradient based method such as gradient descent is thus no longer applicable.
In this paper, we adopt Bayesian optimization (hereafter we call it BO). The procedure of BO is as follows. First, BO places a statistical model on the objective function , such as a Gaussian process. Second, BO devises an acquisition function to decide where to evaluate next, i.e., to choose an based on the statistical model. Third, BO updates the statistical model based on the newly evaluated , and the process repeats. The pesudocode of BO is listed in Algorithm (2). Interested readers are referred to (frazier_tutorial_2018) for more details on BO.
To be more concrete, now we use the multiagent example presented in Figure (3) to illustrate the potential of the reward design.
Example 4.0.
(MultiAgent ). We take the order response rate (ORR), i.e. the ratio of the number of fulfilled passenger requests to the total number of passenger requests, as the performance measure of the system. The direct application of mean field actorcritic algorithm yields a 50% ORR, which is obviously not the desired equilibrium from the perspective of the system. Noticing that the platform typically charges a certain proportion of the fare paid by the passenger as the socalled platform service charge, which is reportedly to be dependent on various factors such as distance, duration, and city. We aim to improve the performance of the system by devising a proper reward design.
In Figure (3), trip fares are shown right above each passenger request, the reached equilibrium for both drivers without any charge are to enter grid and get an expected reward as , leading to an oversupply (i.e., a low demand to supply ratio) in grid and an undersupply (i.e., a high demand to supply ratio) in grid , which is not beneficial for the system. A reward design which deducts from the passenger request paid to the driver in grid will effectively attract one driver to leave grid for grid to get more monetary return, resulting in a order response rate.
5 Case Study
We test the bilevel optimization model on a 2by2 grid world example, where an analytical solution of the reward design can be derived. Then we compare both values to justify the correctness of our BO algorithm.
The dataset consists of seven deterministic passenger requests in a 2by2 grid world setup, as shown in Figure (7). At , there are five idle drivers in grid and five in grid . At time , five passenger requests with fare deterministically appear in grid and two passenger requests with fare appear in grid .
Without any reward design, the optimal policy for all drivers is to enter grid , because the expected return for entering grid is at least (i.e., ) while that for entering grid is at most . The resulting ORR is , which is not desirable from the perspective of the platform because it is expected to achieve a ORR in this setting. Actually, the platform can achieve a better ORR by adjusting the reward that drivers earn through the use of a platform service charge (aka the commission fee). The platform service charge used in this study is denoted as a fare percentage. For instance, a 10% service charge means the platform takes 10% of the fare paid by the passenger to the driver as its revenue. In other words, the driver gets less money under a higher service charge while the payment from the passenger remains the same. To achieve a better ORR, the platform needs to place a high service charge in grids which are oversupplied. Drivers oversupply grid because on average they can earn more by entering grid , compared with entering other grids. A high service charge placed in grid can effectively reduce monetary returns for drivers entering and make grid less attractive to drivers. Thus, some drivers choose other grids and take other passenger requests, resulting in an increase in ORR.
Before introducing a functional form of the platform service charge, we formly provide two notations, namely demand to supply ratio (DS) and service charge (SC). We then construct an effctive form of SC as a function of DS. The rationale between SC and DS is explained as follows. In a grid , a relatively small indicates that grid is oversupplied, and a relatively large means the grid is undersupplied. The goal of the platform is to get each as close to as possible, meaning a balance between demand and supply. In a grid with below , is expected to be large to discourage drivers from oversupplying the grid; while in a grid with above , is supposed to be small. To demonstrate the rationale, we use a piecewise linear function with a parameter as SC in grid , i.e.,
(12) 
where a relatively high SC is applied to grids with a low DS and no SC is applied to grids with DS above .
With an adjustable parameter , the platform aims to maximize some objective , consisting of two components, namely ORR and overall service charge (OSC), where
The rationale of choosing these two components is as follows. First, from the perspective of the platform, it aims to maximize ORR, because a larger ORR typically means a higher revenue and a higher customer satisfaction. To maximize ORR, the platform simply chooses the largest possible value of . The reason is that with the largest possible , the platform penalizes drivers heavily for oversupplying a grid, and therefore drivers will be directed to other grids. This strategy, i.e., choosing the largest , however, is a big threat for the longterm growth of the platform because drivers are very likely to quit under such a high service charge. Thus, the platform also needs to maintain a relatively small OSC. Considering the competition between ORR and OSC, we use a weighted average of ORR and as the objective of the platform, i.e.,
(13) 
where is the weight for ORR. In this case study, we set , meaning that the platform cares more about ORR. We then use two methods, namely, BO and an analytical method, to determine the optimal value of .

BO. We first employ BO with the objective function given in Equation (13). For a bilevel optimization problem, first we need to check the convergence of the lower level. As an example to validate the convergence, ORR and (1  OSC) versus the index of iterations are presented in Figure (8) with . ORR increases very fast and (1  OSC) steadily decreases during the first 1,000 iterations where agents explore the environment and learn the optimal policy. ORR and (1  OSC) gradually converge after 1,000 iterations when agents mainly exploit the knowledge they have gained through their previous explorations.
With the validated convergence of the lower level MAS, we run BO with a limited computation budget of 20 evaluations (i.e., we are only allowed to evaluate the objective at 20 different values of ). The result from BO is presented in Figure (9). It is noticeable that the evaluation of the objective on s seems quite noisy. In other words, the evaluated objective may be slightly different even for the same
. This is expected because there are multiple local optima when solving the lower level MAS. Actually, it is commonly impossible to find a global optimum using deep learning, and researchers usually settle for local optima
(Goodfellowetal2016). Local optima introduce noise into the evaluation of the objective at each . Although the evaluations are noisy, the fitted curve is able to capture the mean objective for each . Due to a relatively flat shape of around , there are multiple s, i.e., , yielding the same optimal mean objective, i.e., , which is 4.0% higher than the objective without any reward design. 
Analytical method. Due to the simplicity of this case, we can analytically derive the optimal value of and shed some light on the effectiveness of the proposed platform service charge. Recall that the optimal policy for all drivers is to enter grid when . The resulting DS ratio in grid is , which is well below , meaning that grid is oversupplied. ORR is . To increase ORR, one needs to increase to penalize drivers who oversupply a grid. As gradually increases, grid becomes less attractive, because the expected return one driver can earn decreases as increases. When the expected return one driver can earn is less than , one driver will enter grid instead of grid for a higher monetary return. Note that to ease the analysis, we assume the number of drivers entering a grid is always an integer. Similarly, as one keeps increasing , the second driver will choose to enter instead of grid . Now we present how we calculate the critical value of below which there is no driver choosing to enter grid while above which there is one driver attracted by grid . With one driver entering grid , there are 9 drivers entering grid , resulting a DS ratio in grid . , meaning that the expected return for these 9 drivers is . The expected return for the driver entering grid is . We then have the critical condition , yielding . Similarly, we can calculate the critical value of below which there is one driver choosing to enter grid while above which there are two drivers attracted by grid , and the critical value is .
DS ratio in grid ORR OSC Table 2: Values of interest Values of interest are presented in Table (2). With , and . The objective is . With increasing to , there is one driver attracted by grid , resulting in a ORR. The OSC is calculated as follows. The DS ratio in grid is now , resulting in . Thus, . The objective is . Similarly, with increasing to , , , and the objective is . Increasing further does not improve ORR but increases OSC, resulting in a decrease in the objective. Thus, the analytically derived optimal value of is .
The analytically derived optimal value of , i.e., , agrees well with the derived optimal range of from BO, i.e., . The optimum from the analytical solution, i.e., , however, deviates from its numerical counterpart, i.e., . The reason is explained as follows. First, in the analytical solution, the policy for agents is deterministic and exact two drivers choose grid after increasing to ; while in BO, the derived optimal policy for agents with is stochastic, introducing variance in drivers’ actions. For example, each driver has a probability of choosing grid and a probability of choosing grid . Although the expected number of agents in grid is and the expected number of agents in grid is , the probability of all agents choosing grid is . This variance reduces both ORR and (1  OSC), resulting in a lower objective from BO, compared with the objective from the analytical solution. Second, for a given , there are multiple local optima when solving the lower level MAS, which may also contribute to a smaller objective.
Despite the intrinsic difference in the policy between the analytical approach and the numerical method (i.e., BO), the overall agreement in the optimal value of from both methods validates the proposed bilevel optimization model. Although the derived optimal seems large, it represents the service charge at a DS ratio of zero and DS ratio typically does not go below some value (e.g., in this case). Actually, the OSC is 0.181 which falls in a reasonable range. It is also worth mentioning that the objective can be increased by using a simple platform service charge. Other forms of reward design may better improve the performance of the platform and are left in future research.
6 Conclusion
Noticing the underutilization of taxi resources due to idle taxi drivers’ cruising behavior, this study aims to model the multidriver repositioning task through a mean field multiagent reinforcement learning approach. A mean field actorcritic algorithm is developed to solve the MAS with a given reward function. The direct application of the mean field actorcritic algorithm to the MAS is, however, very likely to yield a suboptimal equilibrium from the standpoint of the system. Thus, this study proposes a bilevel optimization with the upper level as a reward design and the lower level as the MAS. The upper level interacts with the lower level by adjusting the reward for the MAS.
To improve the performance of the system, current studies intentionally craft a reward function which aligns with the goal of the system but may not reflect the intrinsic reward of real drivers. In other words, drivers are forced to cooperate, which is not the real case. In this study, we treat drivers as selfish and noncooperative. Drivers aim to maximize their own interests instead of the performance of the system. The central controller (e.g., the ehailing platform) achieves its goal by adjusting the reward that a driver can earn. To effectively solve the optimal control parameter, we adopted a Bayesian optimization approach.
The bilevel optimization model is applied a synthetic dataset. Using a simple piecewise linear platform service charge, the optimal intercept is dervied as a range of . The results show that the objective of the platform can be improved by using the proposed simple platform service charge, compared with no reward design. More complicated forms of reward design are believed to better increase the performance of the platform and are left in future research.
Comments
There are no comments yet.