1 Motivation
Taxi, complementary to massive transit systems such as bus and subway, provides flexibleroute doortodoor mobility service. However, taxi drivers usually have to spend 3560 percent of their time on cruising to find the next potential passenger (Powell et al., 2011). Such passengerseeking process not only decreases taxi drivers’ income but also generates additional vehicle miles traveled, adding congestion and pollution into the increasingly saturated roads.
Cruising is primarily caused by an imbalance between travel demand and supply. Market regulation (Yang et al., 2002) or taxi fare structure design (Yang et al., 2010a) were proposed respectively to balance taxi travel demand and supply. A network equilibrium model was developed (Yang and Wong, 1998; Wong and Yang, 1998)
to capture the spatial imbalance between travel demand and supply, where a logitbased probability was introduced to describe the meeting between a vacant taxi and a waiting passenger. This model, in which a taxi driver is supposed to minimize the individual search time for the next passenger, is further extended to incorporate congestion effects and customer demand elasticity
(Wong et al., 2001), to include the fare structure and fleet size regulation (Yang et al., 2002), to consider multiple user classes, multiple taxi modes, and customer hierarchical modal choice (Wong et al., 2008), and to use a meeting function to describe the search frictions between vacant taxis and waiting passengers (Yang et al., 2010b; Yang and Yang, 2011; Yang et al., 2014).As taxis GPS trajectories become increasingly available, qualitative analysis has been performed to uncover drivers’ actual searching strategy. Liu et al. (2010) found that drivers with higher profits prefer to choose routes with higher speed in both operational and idle states. Li et al. (2011) discovered that hunting is a more efficient strategy than waiting by comparing profitable and nonprofitable drivers. Several logitbased quantitative models were developed to capture idle drivers’ searching behavior (Szeto et al., 2013; Sirisoma et al., 2010; Wong et al., 2014a, b, 2015a, 2015b)
. The bilateral searching behavior (i.e., taxi searching for customers and customers searching for taxis) was modeled through an absorbing Markov chain approach
(Wong et al., 2005). A probabilistic dynamic programming routing model was proposed to capture the taxi driver’s routing decisions at intersections (Hu et al., 2012). Furthermore, a twolayer approach, in which the first layer models the driver’s pickup location choice and the second layer accounts for the driver’s detailed route choice behavior, was presented (Tang et al., 2016).Upon the understanding of drivers’ searching behavior, recommendations can be provided to idle drivers on where to find the next passenger. An accurate prediction of both taxi supply (Phithakkitnukoon et al., 2010) and demand (MoreiraMatias et al., 2012) as well as travel time (Tan et al., 2018) are stepping stones to these recommendations. The objectives that recommendations aim to achieve include the minimization of waiting time at the recommended location (Hwang et al., 2015) or of the distance between the current location and the recommended location (Powell et al., 2011; Hwang et al., 2015), and the maximization of the expected fare for the next trip (Powell et al., 2011; Hwang et al., 2015), of the probability of finding a passenger (Ge et al., 2010), or of the potential profit of a driver (Qu et al., 2014; Yuan et al., 2011).
The aforementioned studies mainly focused on the recommendation of the cruising routes or next cruising locations at the immediate next step without considering the optimization of longrun payoffs. A recommended customer searching strategy may help a driver to get an order as fast as possible but may not maximize this driver’s overall profit in one day. Models which can capture drivers’ longterm optimization strategy are needed. In recent years, Markov Decision Process (MDP) becomes increasingly popular in optimizing a single agent’s sequential decisionmaking process given a period of time (Puterman, 1994). Several studies (Rong et al., 2016; Zhou et al., 2018; Verma et al., 2017; Gao et al., 2018; Yu et al., 2019) have employed MDPs to model idle drivers’ optimal searching strategy. In an MDP, an idle driver is an agent who makes sequential decisions of where to go to find the next passenger in a stochastic environment. The environment is characterized by a Markov process and transitions from one state to another once an action is specified by the idle driver. The driver aims to select an optimal policy which optimizes her longterm expected reward. Dynamic programming or Qlearning approaches are commonly used to solve an MDP (Sutton and Barto, 1998). Table (1) summarizes the existing studies using MDPs for passengerseeking optimization. Note that in ehailing, there is actually no passenger seeking because it is the ehailing paltform that matches an idle ehailing driver to a passenger. However, ehailing drivers still need to reposition themselves in order to get better chance of getting matched to a passenger request. In this paper, we will use the terminologies passenger seeking and repositioning interchangably.
Reference  Network representation  State space  Action space  Reward  Algorithm 
Rong et al. (2016)  Grid world  (grid id, time, incoming direction)  moving to a neighboring grid or staying in the current grid  Taxi fare  Dynamic programming 
Verma et al. (2017)  Grid world (static and dynamic zone structure)  (dayofweek, grid id, timeinterval)  moving to any chosen grid (proposed an action detection algorithm)  Taxi fare  traveling distance cost  time cost  Qlearning (Monte Carlo) 
Gao et al. (2018)  Grid world  (grid id, operating status)  driving vacantly to neighboring grids to search, finding a passenger in the current grid, waiting static at the same spot  the ratio of the occupied taxi trip mileage to the previous empty mileage  Qlearning (Temporal Difference) 
Yu et al. (2019)  Link node  (node id, indicator of the current pickup dropoff cycle)  outgoing links from the current node  taxi fare  operating cost  Value iteration 
Lin et al. (2018)  Grid world  (grid id, time interval, global state)  moving into a neighboring grid or staying in the current grid  taxi fare  operating cost  Reinforcement Learning (Deep Q learning) 
The existing MDP models were primarily developed for traditional taxi drivers’ sequential decisionmaking where a driver has to see a passenger before a match happens. In other words, an idle driver’s searching process ends only when this driver sees a passenger and the passenger accepts the ride (see Figure ((a)a)). Ehailing applications (such as Didi and Uber), on the other hand, offer an online platform to match a driver with a passenger even when they are not present in the same space at the same time (He and Shen, 2015; Qian and Ukkusuri, 2017). In other words, even when an idle driver sees a passenger waiting on the roadside, as long as the ehailing platform does not match them, the driver cannot give a ride to the passenger. However, it does not mean ehailing drivers always stay at the previous dropoff spot and wait for the platform to match. Drivers tend to reposition themselves so that the platform can find them a match sooner. As a result, the decisionmaking process of ehailing drivers is quite different from the traditional taxi drivers in the following aspects:

An ehailing driver may receive a matched order before she drops off the previous passenger, thus there is no passenger seeking (see Figure ((b)b)).

Different from traditional taxi that a driver has to see a passenger to find a match, ehailing platforms very likely find a match even when the driver and the passenger are spatially far from each other. In other words, a driver’s search process may end before a passenger is picked up (see Figure ((c)c)).
Because of the inherent differences in drivers’ decisionmaking, this paper aims to develop an MDP to model ehailing drivers’ sequential decisionmaking in searching for the next passenger. 44,160 Didi drivers’ 3day GPS trajectories are used to calibrate and validate our model. Previously, there is research using Didi’s data for the study of largescale fleet management (Lin et al., 2018) and largescale order dispatch (Xu et al., 2018) in ehailing platforms.
The major contributions of this paper are as follows: (1) The proposed MDP model incorporates ehailing drivers’ new characteristic features. The derived optimal policy shows that with the help of an ehailing platform, drivers can simply choose to wait or stay within the current grid, especially when they are in the busy city area, which is ehailing specific. (2) To the best of our knowledge, this is the first study using large amounts data to devise and calibrate a dynamic adjustment strategy of the order matching probability to address the competition among multiple drivers. The strategy essentially attenuates the order matching probability in an exponential manner for subsequent drivers to be guided into a grid when some drivers have already entered the grid. The strategy is further verified to be efficient in providing different recommendations for multiple drivers.
The remainder of the paper is organized as follows. Section 2 introduces our modified MDP model and details definitions of states, actions, and state transitions and the process of extracting parameters from the data. Section 3 presents the proposed dynamic adjustment strategy of the order matching probability and details the calibration process. Section 4 introduces the data we used in this research and presents the results, including the derived optimal policy and the Monte Carlo simulation. Section 5 concludes the paper and provides some future research directions.
2 Markov Decision Process (MDP) for a single agent
2.1 Preliminaries
An MDP is specified by a tuple , where denotes the state space, stands for the allowable actions, collects rewards, defines a state transition matrix, and is the starting state. Given a state and a specified action at time , the probability of reaching state at time is determined by the probability transition matrix , which is defined as
(2.1) 
From the initial state , the process proceeds repeatedly by following the dynamics of the environment defined by the Equation (2.1) until a terminal state (i.e., either the current time exceeds the terminal time or the current state is an absorbing state) is reached. An MDP satisfies the Markov property which essentially says that the future process is independent on the past given the present, i.e.,
(2.2) 
There are two types of value functions in MDPs, namely, a state value and a stateaction value . The actions that an agent will take form a policy , which is a mapping from a state and an action to the probability of taking action at state . Then the value function of a state by following the policy , denoted as , can be taken as the expectation of the future rewards, i.e.,
(2.3) 
where is a discount factor. The stateaction value of taking action at state by following policy is
(2.4) 
The value function is actually a weighted average of the stateaction value , i.e.,
(2.5) 
where is again the probability of taking action at state according to policy , and is the total number of actions that are allowed to be taken in state .
Several algorithms have been developed to solve the MDP, i.e., to derive the optimal policy, and the corresponding value functions, such as the dynamic programming method and the Qlearning approach (Sutton and Barto, 1998). The dynamic programming algorithm is used in this work and will be explained later in Section (2.2.4). Furthermore, these two types of value functions at optimality are related by the following mechanism
(2.6) 
The rationale underlying the relationship at optimality is simply to choose a policy which maximizes the value function expressed in Equation (2.5). For example, when an agent is at state , the optimal policy simply suggests the agent to take an action with the largest stateaction value, i.e., (i.e., the probability of taking action in state is 1). Accordingly, the stateaction value at optimality can be written as
(2.7) 
where is the probability of landing in state after taking action in state , and is the reward for choosing action at state and landing in state .
2.2 MDP for ehailing drivers
In this section, we will develop an MDP model for ehailing drivers’ stochastic passenger seeking process. Notations which will be used in the subsequent analysis are listed in Table (2).
Variable  Explanation 

Index of the current grid  
Current time  
Indicator, denoting whether the driver has been matched to a request before the next dropoff  
State,  
State space, a collection of all states  
Action  
Action space  
Time spent on seeking for a passenger in grid  
Time spent on moving from grid to grid  
Distance traveled when seeking for a passenger in grid  
Distance traveled for moving from grid to grid  
The probability that the driver can be matched to a request during cruising in grid  
The probability of picking up a passenger in grid when the request from the passenger was matched to the driver in grid  
The probability of dropping off a passenger in grid when the passenger was picked up in grid  
The probability of receiving a new request before the driver finishing her current order at grid  
The average taxi fare from grid to grid  
Coefficient of fuel consumption and other operating costs per unit distance 
2.2.1 States
In our MDP model, the state consists of three components, namely, a grid index , current time , and an indicator . Note that a hexagonal grid world setting with 6,421 grids is adopted in this research and will be explained later. Considering the fact that an ehailing driver may receive a request before she drops off the previous passenger, we have therefore added an indicator into the state. The indicator denotes whether the driver has been matched to a request before she arrives at the current state. Accordingly, states with indicator are decisionmaking states in which the driver needs to spend time on seeking the next passenger, and states with indicator are nondecisionmaking states. For example, is a nondecisionmaking state which says that the driver is in grid when and the driver has already been matched to a request so she will not spend time on seeking at the current state.
2.2.2 Actions
In decisionmaking states, the driver has to choose one from eight allowable actions, denoted as . In nondecisionmaking states, the driver will not take any action but drive to pick up the next passenger and transport the passenger to the destination. Among the allowable action space , each of the first six actions is to transit from the current grid to one of the six neighbor grids. Note that some of the six neighboring grids may be nonreachable, we thus add a large penalty, i.e., a large distance, to the transition from a grid to a nonreachable neighboring grid to prevent the agent from taking the action which leads the agent to the nonreachable neighboring grid. The seventh action is to stay and cruise around within the current grid. The last action is to wait in the current grid. We stress that the last two actions are essentially different because from the data we have observed that some drivers will just wait near the previous dropoff spot, especially when they are around downtown or transportation terminals while some drivers usually cruise within the current grid after completing a ride. In addition, the fuel cost associated with waiting can be neglected while that of staying can be substantial because the driver keeps cruising around during his/her staying in the current grid. Furthermore, drivers can take a rest and refresh their minds during waiting and hence their driving strategy can be more efficient for future trips. These arguments, however, do not necessarily suggest that the driver should always choose waiting rather than staying. Actually, drivers have to cruise around to get closer to the potential requests under certain circumstances.
2.2.3 State transition
After completing a ride, there are two possible scenarios according to two different values of the indicator. If the indicator is , the driver needs to specify an action, i.e. where to find the next passenger, and then moves into the grid along the direction defined by the action and spends some amount of time seeking for the next passenger in the new grid. There are two possible outcomes associated with this passenger seeking process. Either the driver confirms a request and ends up arriving at a different grid by following the passenger’s travel plan or the driver fails to find a request and stays in the current grid. The reward is usually positive for the former while negative for the latter due to the fuel consumption and other operating costs. If the indicator is , the driver drives to the pickup spot and then transports the passenger to the destination without any passenger seeking involved.
Figure (2) illustrates the aforementioned state transition process. The driver currently stays at state .
If , the driver specifies an action , which is assumably taken as going northeast in the demonstration. Then the driver moves into the grid along the direction defined by the action and thereafter spends some amount of time on seeking the next passenger. There are two possible outcomes of this passenger seeking process.
The first possibility is that the driver fails to get any request in grid after . In this case, the state of the driver will be . The reward for this passenger seeking process is , which is negative. Let denote the probability that the driver will receive at least one request in grid . Then the probability of the occurrence of this outcome is . In other words, with probability , the driver will end up in state .
The second possibility is that the driver confirms one request during the cruising process in . The probability of the occurrence of this outcome is . We let denote the probability of confirming a request in grid and picking up the passenger in grid . Once the passenger is on board, the driver will directly move to the destination , which only depends on the passenger’s travel plan. We let denote the probability of picking up a passenger in grid and dropping off the passenger in grid . After dropping off the passenger, the driver will end up in grid at time and earn a reward of . Hence, the probability of the driver receives a request in grid , picks up the passenger in grid , and transports the passenger to grid is . Notice that for a driver, during her trip from the passenger’s origin to the passenger’s destination , there is a probability at which the driver will confirm a request before she drops off the passenger. Let denote the probability of receiving a request before the driver reaches grid . Then we can conclude that with probability , the driver will end up in state , and with probability , the driver will end up in state .
If , the driver will not need to specify any action and will directly drive to the pickup spot of the next passenger and then transport the passenger to the destination. Again, during her trip to the passenger’s destination, there is a probability, denoted as , at which the driver will receive a request before she drops off the passenger. As illustrated in Figure (2), with probability , the driver will end up in state ; with probability , the driver will end up in state .
In both scenarios, namely, either or , the driver will thereafter start the whole process from again until the current time exceeds the time interval, i.e., a terminal state has been reached.
2.2.4 Solving MDP
The objective of the MDP model is to maximize the total expected revenue of a driver. Considering the fact that in a time interval, a driver can finish a finite number of pickup and dropoff cycles, indicating that the MDP model is finitehorizon. When current time of the driver has reached the end of the time interval, no more actions can be taken and no more rewards can be earned. Suppose a driver is currently at state . If , meaning that the driver is at a decisionmaking state, the maximum expected revenue that a driver can earn by starting from and specifying an action is
(2.8)  
where , meaning that the grid in which an ehailing driver will be cruising is dependent on the current state , actually through the grid index of , and the specified action , , , , and , , and stand for the maximum expected revenue that a driver can earn by reaching state , , and , respectively. If , meaning that the driver is at a nondecisionmaking state, the driver will not specify any action, and the expected revenue that the driver can earn is
(2.9)  
where , , and and stand for the maximum expected revenue that a driver can earn by reaching state and , respectively.
Then the optimal policy for a driver to follow at a decisionmaking state is
(2.10) 
and the maximum expected revenue that a driver can earn by reaching state is
(2.11) 
The policy in Equation (2.10) is deterministic, meaning that the driver can only take one action at the current decisionmaking state if she follows the policy. Actually here we slightly abuse the notation. The policy is supposed to be , i.e., the probability of taking action at state is 1. It is equivalent to say that at state , the action to take is , and thus we write the policy at state as Equation (2.10). A deterministic policy defines a onetoone mapping from a state to an action. The deterministic policy works when there is only one driver who learns the optimal policy and follows the policy. Otherwise, there might be excess taxi supply at some areas, resulting in a localized competition among taxis. A circulating mechanism was employed to tackle this overload problem (Ge et al., 2010). A multiagent reinforcement learning approach (Lin et al., 2018) was proposed to consider the competition among drivers. In this research, we use a dynamic adjustment strategy to update the order matching probability when multiple idling ehailing drivers are guided into the same grid. The proposed dynamic adjustment strategy will be introduced in Section (3).
To efficiently solve the MDP, i.e., to derive an optimal policy, a dynamic programming approach is employed (Bertsekas, 2000; Sutton and Barto, 1998). The basic idea of the dynamic programming algorithm is to divide the overall problem into subproblems and hence to make use of the results of the subproblems to solve the overall problem. An important advantage of the dynamic programming algorithm is that it caches results of all subproblems and thus it is guaranteed that the same subproblem is only solved once.
Now we elucidate how we apply the dynamic programming algorithm to solve the MDP. The goal is to solve the optimal value for all states , where , , and . There are in total states, and half of them are decisionmaking states. We define one subproblem as solving the optimal value for one state and thus we have in total subproblems. Noticing that at the final time step, i.e., , the maximum expected reward that a driver can earn is obviously zero, we thus have for all states where . For any state with and a chosen action , the calculation of the stateaction value depends on the value of some future states, i.e., , , and in Equation (2.8). In other words, the subproblem, i.e., solving the optimal value for state , depends on some subproblems, i.e., solving the optimal value of some future states, e.g., , , and . For a future state , there might be several states from which the agent will reach the future state , indicating the calculation of the optimal value of all these states requires the calculation of the optimal value of the future state , resulting in calculating the optimal value of the same state multiple times and thus wasting computation power. To avoid the repeated calculation of the optimal value for the same state, we adopt the dynamic programming algorithm. Since the optimal values for all states with are known and the optimal value of a state depends on the optimal value of some future states, we solve the optimal value of states backwards in time and simply store the solved optimal values in a hash table. Then for a state and a chosen action , we simply read the optimal values of future states , , and from the hash table and use Equation (2.8) to calculate the stateaction value , based on which the optimal value of the state can be derived from Equation (2.11). The pesudo code is in Algorithm (1).
2.3 Extracting parameters from data
In the dataset we used in this research, we have GPS trajectories for both the empty and occupied trips. We now introduce how to extract the parameters we used in the state transition from the dataset.
2.3.1 Order matching probability
The order matching probability estimates the probability at which a vacant taxi can be matched to a passenger when the taxi is cruising, including staying, or waiting at grid
. As we have mentioned before, the purposes for introducing waiting and staying in this work are different. In addition to six actions which allow an ehailing driver to move into one of the six neighboring grids, the action staying gives the driver extra flexibility in choosing to stay and cruise within the current grid due to some potential benefits, such as a relatively high order matching probability in the current grid, a possibly high cost to move into neighboring grids vacantly, etc. Actually, as we have listed in Table (1), there are several studies in the literature that have already included the action staying into the action space, such as (Rong et al., 2016), (Verma et al., 2017), and (Lin et al., 2018). Thus, the way to calculate the order matching probability for staying is the same as the way to calculate the order matching probability for cruising into one of the six neighboring grids. In other words, the order matching probability for a driver just entering the grid from one of the six neighboring grids is supposed to be the same as the order matching probability for a driver who was in the grid and chose to stay in the grid.Waiting, different from staying and other six actions which allow the driver to move into one of the six neighboring grids, is included into the action space based on the observation that sometimes a driver will choose to stop cruising and simply to wait statically for passenger requests to come in, especially when the driver is around downtown or transportation terminals. The action waiting was previously included in the action space in (Gao et al., 2018).
We thus approximate the order matching probabilities for cruising and waiting separately. We say a driver is waiting for a passenger request whenever the driver’s traveling distance is less than 200 meters for a 3minute interval. To rule out some unrealistic waiting actions, such as a driver being stuck in traffic, we further limit the possible locations for waiting to be the places around subway stations, bus terminals, airports, and some famous tourism attractions. For cruising, the order matching probability can be approximated as the ratio of the number of times that a taxi is matched to a passenger in grid while cruising, denoted as , to the number of times that the grid is passed by an empty taxi while cruising, denoted as . For waiting, the order matching probability can be approximated as the ratio of the number of times that a taxi is matched to a passenger in grid while waiting, denoted as , to the number of times that empty taxis have waited in the grid , denoted as .
(2.12) 
2.3.2 Pickup probability
The pickup probability measures the likelihood of picking up a passenger at grid when the request sent from the passenger was matched to the driver at grid . This parameter can be estimated as the ratio of the the number of passenger pickups in grid which were matched to drivers in grid , denoted as , to , which is the summation of and .
(2.13) 
2.3.3 Destination probability
The destination probability measures the likelihood of the destination of the passenger being grid when the passenger was picked up in grid . This parameter can be estimated by dividing the number of trips ending in grid which originated from grid , denote as , by the total number of pickups in grid , denoted as .
(2.14) 
2.3.4 Order matching probability while on trip
As we have mentioned before, there is a probability at which the driver will receive a request when she is on the trip to transport the current passenger to the destination. We denote this order matching probability while on trip as . This probability can be estimated by dividing the number of occupied trips among which there is at least one request received by the driver before the driver reaching the destination while the origin is , denoted as , by the total number of occupied trips ending in grid and originating in grid , denoted as .
(2.15) 
2.3.5 Driving time and driving distance
The driving time and the driving distance denote the estimated driving time and driving distance from grid to grid , respectively. Here we simply take the average of all driving times from grid to grid as an approximation of the . Similarly, the driving distance is calculated by taking the average of all driving distances between grid and grid .
2.3.6 Taxi fare
The taxi fare denotes the estimated gross revenue that a driver can earn by transporting a passenger from grid to her destination grid . Here we take the average of all the fares of the occupied trips which are from grid to grid as a proxy of the real taxi fare from grid to .
2.3.7 Seeking time and seeking distance
The seeking time and the seeking distance denote the estimated seeking time and seeking distance within grid , respectively. From the field data, the distribution of the seeking time in each grid was extracted and is shown in Figure (3). The median of the distribution of the seeking time is approximately 45 seconds. Since the time step size is 1 minute in this work, thus we simply take the seeking time as 1 minute. Considering the average speed of seeking trips (around 300 meters/minute), the seeking distance is taken as 300 meters for each grid. Note that the seeking distance is zero when the driver chooses to wait.
2.3.8 The fuel consumption coefficient
estimates the fuel consumption and other operating cost per unit distance during driving. Here we take (Yuan/kilometer).
2.4 Numerical example
To illustrate the Markov Decision Process of ehailing drivers, we use a 3 by 3 grid world numerical example, as shown in Figure ((a)a).
Suppose now we have the following five trajectories.
Each element in the trajectory is a tuple consisting of three items, namely, the grid index, current time, and a status indicator showing if the driver has been matched to another order during the trip. For example, basically states that the driver is at grid at time , and the driver has not been matched to any order before she finished the previous trip.
All five trajectories started in grid , and then the driver moved into grid during idling. After the driver searched the grid , there are two possible outcomes: either the driver finds an order match or the driver fails to find any ehailing order. If the driver fails to get an order match after searching, the driver will move into other grids to find another order or the driver will stop working. To simplify the demonstration, we simply assume the trajectory ends in grid . For other four trajectories, the driver managed to find an ehailing order in grid . Based on this piece information, we can calculate the probability of finding an ehailing order in grid as .
After confirming an order match in grid , the driver drives into grid to pick up the passenger in trajectories and and stays within grid to pick up the passenger in trajectories and , respectively. Thus, the pickup probability can be calculated as and .
When the driver picks up the passenger in grid , as illustrated in trajectories and , the passenger’s destination is grid in and grid in , respectively. Thus, the destination probability can be calculated as and .
In trajectories and , the driver drives to grid to pick up the passenger, and the passenger goes to grid . During the trip, the driver has a of receiving a new order before she arrives at the destination of the passenger. The order matching probability while on trip can thus be calculated as .
Based on the probabilities calculated above, an example of state transition is presented in Figure ((b)b). To make the state transition consistent with the five trajectories, we suppose the driver is initially in grid with a status indicator , i.e., the driver is in state . If , meaning that the driver needs to seek for an ehailing order, the driver drives into grid and seeks for ehailing orders in the grid. There are two possible outcomes associated with this case.

The driver fails to find any ehailing order in grid . The driver will end up in state and receive a negative reward , which is actually the fuel cost. This outcome happens with probability

The driver successfully finds an ehailing order in grid . For the purpose of demonstration, we assume the driver goes to grid to pick up the passenger, and the destination of the passenger is grid . The probability of this outcome is . The driver will receive a total reward of by completing this ride. During the trip, the driver may have a probability of getting a new request before she arrives at the destination of the previous passenger. Hence, there are two possible subbranches from this outcome.

If the driver is matched to a new request before she drops off the previous passenger, then the driver will end up in state . The probability of the occurrence of this subbranch is .

if the driver fails to be matched to another request while on trip, the driver will then end up in . This subbranch occurs with a probability

For the sake of the completeness of the state transition, the other two subbranches associated with are also displayed in Figure ((b)b). These two subbranches are quite selfexplanatory, and thus the detailed discussion will be omitted.
3 MDP for multiple agents
Note that the deterministic policy derived is only applicable when there is one agent following the policy. Otherwise there can be local competition among ehailing drivers since some drivers may be guided into the same grid. We thus need to address the competition among ehailing drivers if there are multiple idling ehailing drivers being present in the same region within a short time interval. Lin et al. (2018) proposed a contextual multiagent reinforcement learning approach in which the multiagent effect is captured by attenuating the reward through an averaging fashion. Zhou et al. (2018) employed a simple discounting factor to update the order matching probability when the taxi is being guided to a road if there are already taxis going to that road. The discounting factor proposed is effective in the sense that it makes the order matching probability smaller for subsequent taxis following the policy. However, the simple discounting factor may underestimate the order matching probability since the effect of the number of orders in each grid was neglected. In other words, except the effect of the number of drivers being guided into a grid, there is an underlying correlation between the decrease in the order matching probability and the number of orders in that grid. Here we use an example to show the existence of the aforementioned correlation. We suppose an ehailing driver is guided into grid with a order matching probability 50%. We consider two extreme scenarios: (1) there was 1 order emerging in grid and (2) there were infinite orders emerging in grid . After one driver is guided into grid , for a second driver, the order matching probability in grid is supposed to decrease substantially in the first scenario while almost keeps the same in the second scenario. The rationale underlying this argument is that compared to a grid with a smaller number of orders, a grid with a larger number of orders is capable of accepting more cruising drivers while still maintain a relatively acceptable level of order matching probability.
To incorporate this correlation, we develop a dynamic adjustment strategy. Before formally providing the form of the strategy, we list four intuitive observations: (1) The order matching probability for the first driver being guided into grid is simply ; (2) The order matching probability for the driver being guided into grid decreases with , meaning that the order matching probability is getting smaller when there are more drivers cruising vacantly in grid ; (3) For the driver, the order matching probability increases with the number of orders in grid , meaning that a grid with a larger number of orders is able to accept more cruising drivers; (4) Under the extreme scenario where there are infinitely many orders in grid , the order matching probability keeps its level at regardless of the number of drivers being guided into grid , as long as it is finite. Based on these four observations, the order matching probability of the driver in grid can be expressed as
(3.1) 
where is the number of orders in grid and is a parameter to be determined.
To calibrate the strategy for the driver, the order matching probability is required. Note that
Comments
There are no comments yet.