I Introduction
With the development of smart devices and largescale data processing technology, most ridehailing fleet networks (e.g., Uber, Lyft, and taxi services) can now track vehicles’ GPS locations and passengers’ pickup requests in real time. This data can then be utilized to predict passenger demand and vehicle mobility patterns in the future, reducing passengers’ waiting times by proactively dispatching vehicles to predicted future pickup locations [1].
Proactive taxi^{1}^{1}1We use “taxi” and “vehicle” interchangeably in this work. dispatch over a large city poses significant coordination and uncertainty challenges: it requires realtime decision making over uncertain future demand for thousands of drivers competing to service pickup requests. Moreover, individual drivers may have an incentive to deviate from coordinated solutions, e.g., if the globally optimal coordinated solution requires them to drive a long distance. Solving these challenges simultaneously is difficult: computing a coordinated dispatch solution for thousands of vehicles may take time, exacerbating the uncertainty challenge of optimizing over rapidly changing passenger demands. Even evaluating possible solutions is difficult due to the many sources of future uncertainty (e.g., passenger demand, vehicle trip times), which are hard to model. Yet realistic models are needed to assess the tradeoffs between multiple, possibly conflicting objectives like minimizing the passenger waiting time, the number of unserved requests, and vehicles’ idle cruising time. For instance, vehicles may need to drive long distances to the locations with predicted pickup requests, increasing their idle cruising time to reduce the number of unserved requests. Thus, in this work we answer two major research questions:

Can a distributed dispatch approach that does not rely on system models outperform a coordinated approach?

What are the performance tradeoffs of these approaches in a realistic environment with uncertain future demand, vehicle trip times, and driving routes?
Ia Related Work
Traditional taxi networks dispatch taxis by having individual drivers look for passengers hailing vehicles on the street. Digitizing these systems allows drivers to view passenger demands through a mobile application and move to regions of higher demand, reducing passenger waiting times. However, such apps still rely on drivers’ human intuition; they do not show future demand, preventing drivers from proactively heading to locations where future pickups are likely. Our goal is to develop optimized dispatch algorithms that do not rely on human intuition and account for likely future demands.
Most previous works on fleet management address prediction challenges with a modelbased approach, which first models pickup request locations, vehicle travel times, etc. and then optimally dispatches vehicles given these models. Indeed, vehicle routing from a central depot is a classical operations research problem [2, 3, 4]. Recent studies have taken advantage of realtime taxi information to fit system models and minimize passengers’ waiting times and vehicle cruising times [5, 6, 7, 8]. For instance, Miao et.al [1, 9] designed a Receding Horizon Control (RHC) framework, which incorporates a demand/supply model and realtime GPS location and occupancy information. Both studies show a reduction in the total idle distance through extensive tracedriven analysis. Others have proposed matching algorithms [10] and rebalancing methods for autonomous vehicles [11], considering both global service fairness and future costs.
Though the modelbased approaches considered in these works can improve system performance, they are inherently limited by prespecifying system models [12]. Such specification may be particularly restrictive in a highly dynamic environment like fleet management, where components like trip times and the actual routes vehicles should take must be continually updated based on historical information.
In this work, we introduce MOVI (Modelfree Optimization of Vehicle dIspatching), the first modelfree
approach to fleet management. MOVI uses a reinforcement learning technique called deep Qnetwork (DQN)
[14, 13] that focuses on finding the optimal actions rather than accurately modeling the system. DQNs’ known strengths for systems with a large number of input variables allow them to solve the uncertainty challenge presented by fleet management, but they exacerbate the coordination challenge: the complexity of the DQN solution grows exponentially with the number of dispatch possibilities, which in our scenario can be very large given the thousands of taxi vehicles in a city. Indeed, most modelfree approaches would face this challenge, due to their lack of a model to guide the search through dispatch possibilities. Thus, we take a distributed approach in which each vehicle solves its own DQN problem, without coordination. We introduce a new DQN training method to ensure fast training at each vehicle.Prior studies have taken a similar vehiclecentric approach by providing route recommendations that aim to maximize individual drivers’ profits [15, 16] or modeling individual driver behavior [17]. We show that a distributed DQN decision framework outperforms a modelbased centralized dispatch framework, indicating that modelfree approaches can add significant value to fleet management and that there may be limited value to a coordinated vehicle dispatch approach.
IB Our Contributions
In this paper, we focus on modern fleet networks that can collect vehicles’ GPS location and occupancy status in real time and receive pickup requests from passengers over the Internet at a cloudbased dispatch center. Our goal is to optimally direct a fleet of taxi vehicles to different locations in a city so as to minimize passengers’ wait times and vehicles’ idle driving costs. Our contributions are as follows:

To the best of our knowledge, MOVI is the first work to design a modelfree approach for a largescale taxi dispatch problem. To ensure scalability, we use a distributed DQN with streamlined training algorithm.

To evaluate our modelfree, distributed DQN approach, we formulate a baseline modelbased, centralized RHC policy
based on a linear program, integrating predicted demands and trip times and fleet system dynamics.

We design and build MOVI as a largescale realistic fleet simulator based on 15.6 million New York City taxi records and Open Street Map road data [19, 18]. MOVI uses a modular architecture that ensures policyagnostic dispatch responses from the simulated environment, allowing us to fairly compare our RHC and DQN policies.

In spite of relying on individual decisions, we find that our DQN approach reduces the average reject rate by 76% compared to the results without dispatch and by 20% compared to the modelbased RHC approach in our simulator. Moreover, DQN leads to a higher minimum vehicle utilization rate, indicating that drivers have more incentive to follow its policies.
A DQNbased dispatch framework not only outperforms RHC, but also offers significant practical benefits, e.g., being more scalable to large numbers of drivers. We formally define the taxi dispatch problem in Section II before introducing our RHC and DQN policies in Sections III and IV respectively. We then present our fleet simulator in Section V and our results in Section VI. We finally conclude the paper in Section VII.
Ii Problem Definition
We assume the ride service consists of a dispatch center, a large number of geographically distributed vehicles, and passengers with a mobile ride request application. Figure 1 illustrates this framework. The dispatch center tracks each vehicle’s realtime GPS location and availability status and all passenger pickup requests. It uses this information to proactively dispatch vehicles to locations where it predicts future pickups will be requested, and to match vehicles to incoming pickup requests. We focus our optimization on policies for proactive dispatching, as shown in Figure 1, rather than vehicle matching. In this section, we formulate the proactive dispatch problem using the notation summarized in Table I.
We view the dispatch center as an agent that interacts with its external environment through a sequence of observations, actions and rewards. We divide the geographical service area into regions and consider timeslots of length indexed by , where is the current timeslot. The number of pickup requests at the th region within time slot is then denoted by , and the number of available vehicles in this region at the beginning of time slot is denoted by . We also define as the number of vehicles that are occupied at time t but will drop off passengers and become idle in the th region in time slot . To predict the future given a set of dispatch actions, we use to denote the current location, occupied/idle status and destination of each vehicle available at time for the dispatch center. By combining this data, we can predict , a matrix that gives the number of vehicles available in each region from time to time , given the dispatch actions. Similarly, we define the future demand . The state of the external environment at time is then .
At each time step , the agent receives some representation of the environment’s state and reward . It then takes action to dispatch vehicles to the different regions so as to maximize the expected future reward:
(1) 
where represents a time discount rate. The action routes idle vehicles (i.e., with ), the set of which we denote by , to different regions. We formally define and for each policy in Sections III and IV. To define , we wish to minimize three performance criteria: the number of service rejects, passenger waiting time and idle cruising time. A reject means a ride request that could not be served within a given amount of time because of no available vehicles near a customer. The waiting time is defined by the time between a passenger’s placing a pickup request and the matched driver picking up the passenger; even if a request is not rejected, passengers would prefer to be picked up sooner rather than later. Finally, the idle cruising time is the time in which a taxi is unoccupied and therefore not generating revenue, while still incurring costs like gasoline and wear on the vehicle.
In the next two sections, we develop a baseline Receding Horizon Control (RHC) policy and a Deep QNetwork (DQN) policy to solve this dispatch problem.
Parameters  Description 

the number of vehicles  
the number of regions  
time discount rate  
step size  
maximum time steps  
state of the environment at the beginning of  
action taken at the beginning of (dispatch order)  
reward gained at the beginning of  
th vehicle’s state at the beginning of  
number of idle vehicles in each region at time slot  
number of occupied vehicles at time that become  
idle at time  
number of requests in each region at time slot  
number of predicted requests in each region  
at time slot  
number of vehicles to be dispatched between regions  
at time slot t  
expected travel time between the regions at time slot t  
probability distribution of the destination region  
given the origin region at time slot  
cost of a reject  
network parameters in Qnetwork ()  
network parameters in targetnetwork (  
demand supply distribution mismatch 
Iii RHC Policy Baseline
In the RHC formulation, we define our action variables in Section II to be , where each is the number of vehicles dispatched within time slot from the th to the th region. We wish to choose the so as to minimize a weighted sum of the number of rejects and the vehicles’ idle cruising time, defining the reward as the negative of this sum:
(2) 
The first term in this objective, , represents the difference between taxi demand and supply () at each region within time slot . Demand that cannot be served by these resources is deemed rejected^{2}^{2}2This definition can be easily extended by summing over multiple time slots in to allow for greater waiting times before rejection.. The second term in (2) corresponds to the idle vehicle cruising cost, where is the expected travel time from the th to the th region. Here weights the rejection cost compared to the idle cruising time.
To find in terms of the action variables , we find the future number of available vehicles:
Lemma 1
The number of idle vehicles in each time slot is:
(3) 
Here the first term corresponds to “leftover” vehicles from time slot , and the second term to the net number of idle vehicles dispatched to region at time , i.e., right before the start of time slot .^{3}^{3}3For simplicity, we assume that dispatched vehicles are not assigned to any customers while traveling and that they always get to the destination regions in the next time slot, as we specify in (4). Extending this definition still results in a linear optimization problem as in (4). The last two terms represent the vehicles that come into region at time : the term corresponds to occupied vehicles at time that will drop off their passengers in time slot . The final term corresponds to currently idle vehicles that will serve customers in the future and drop them off in the th region within time slot . To derive this term, we sum over all regions and times for which the expected travel time to region places them in region at time . The number of these trips given and is then , where is the fraction of trips that start at time in region and end in region .
Assuming the are known, we choose the dispatch actions so as to maximize the expected reward into the future:
Proposition 1
The optimal RHC policy solves the linear optimization problem
(4)  
subject to  
The first constraint in (4) ensures that the total number of vehicles dispatched from the th region does not exceed the number of idle vehicles in the th region. The second constraint ensures that we do not dispatch vehicles to regions with travel times that exceed , ensuring that all dispatch movement completes within a time interval; as noted above, this constraint may be relaxed without changing the linearity of the optimization problem. Using the definition of in (2) and the vehicle dynamics (3), we see by inspection that (4) can be written as a linear optimization problem. For simplicity, we assume that the are continuous variables, as there are generally a large number of taxis to be dispatched; we can then solve (4) efficiently with known linear programming methods.^{4}^{4}4We show in our simulations that even with this approximation, the RHC policy yields significant performance improvement. We retain to execute now, updating the future dispatch actions by resolving (4) in each future timeslot as new information arrives.
Algorithm 1 presents the RHC dispatch algorithm using Proposition 1. In addition to solving (4), the algorithm predicts the input trip times and destination distributions from historical trip data (cf. Section V). It then assigns specific idle vehicles to finegrained dispatch locations within each region, given the number of vehicles to be dispatched to each region . For computational efficiency, we specify vehicle locations in a greedy manner. We define as a set of locations within the th region, which satisfies:
(5) 
where and represent the number of available vehicles and requests at location respectively. The demand supply distribution mismatch at location is then given by:
(6) 
For each dispatch , we send vehicles from locations with a greater mismatch, i.e., higher , to those with lower .
Iv Distributed DQN Policy
Our DQN policy learns the optimal dispatch actions for individual vehicles. To do so, we suppose that all idle vehicles sequentially decide where to go within a time slot . Each vehicle’s decision accounts for the current locations of all other vehicles, but does not anticipate their future actions. Since drivers have an app that updates with other drivers’ actions in real time, and it is unlikely that drivers would make decisions at the exact same times, they would have access to this knowledge. We can thus express the DQN reward function for each vehicle :
where is the weighted sum of the number of rides the th vehicle picks up at time , , and the total dispatch time , analogous to the RHC reward (2). Here is the action update cycle, or minimum time between dispatches sent to a given vehicle. The action represents the region to which the th vehicle should head. We limit the action space in the range of the dispatch cycle similar to the RHC. Note that this reward is not an explicitly specified function of : DQN’s modelfree approach means that the exact relationship between and will the learned by the DQN algorithm.
We define the optimal actionvalue function for vehicle as the maximum expected return achievable by any policy :
(7) 
which satisfies the Bellman equation:
(8) 
where denotes the expectation with respect to the environment after time . Instead of using the full representation of the state , we approximate
with a neural network. We use
to denote the weights of this Qnetwork, which can be trained by updating the at each iterationto minimize the following loss function:
(9) 
This function represents the meansquared error in the Bellman equation, where the optimal target values are substituted with approximate target values , using parameters from some previous iteration.
The dispatch algorithm for the DQN policy is shown in Algorithm 2. The input and output of the DQN policy are the same as for the RHC policy (Algorithm 1). An action for each vehicle is selected by taking the argmax of the Qnetwork output. Whenever the algorithm adds a dispatch order to the solution, we update according to the selected action. This update enables subsequent vehicles to take other vehicles’ actions into account; note, however, that decisions are still made myopically with respect to possible future decisions taken by other vehicles, limiting coordination between vehicles. As in the RHC algorithm, after determining dispatched regions, the DQN policy finds specific dispatch locations in a greedy manner using the demandsupply mismatch .
V MOVI Fleet Simulator Design
To realistically evaluate our RHC and DQN policies, we design and implement MOVI as a taxi fleet simulator based on 15.6 million taxi trip records from New York City [18]
. We used Python and tensorflow
[20] for the implementation.We base MOVI’s implementation on NYC Taxi and Limousine Commission trip records from May and June 2016, including the pickup and dropoff dates/times, locations, and travel distances for each trip [18]. While each pickup location in this dataset represents where a passenger hailed a taxi on the street, we assume the distribution of demand is similar when the passenger uses a mobile application. We use 12.8 million trip records from May 2016 to train the simulator and 2.8 million trip records from June 2016 for testing.
We extract trip records in New York City within the area shown in Figure 2
’s heat map, which covers more than 95% of trips in the dataset after removing records with outliers. The colored zones in the figure represent the number of total ride requests in the training data. The weekly numbers of ride requests for the training and test datasets are shown in Figure
3, indicating that the demand curves in both datasets exhibit the same daily periodicity, with a dip in demands over the weekend. In the discussion below, we outline the architecture of the simulator and then our RHC and DQN implementations. More details are given in [21].Va A Modular Architecture
Figure 4 presents MOVI’s modular architecture design: to ensure a fair comparison between different dispatch policies, MOVI does not rely on the DQN policy. Instead, the dispatch policy is a separate module that does not affect the other simulator modules, which simulate policy responses in the surrounding environment. Thus, the simulated responses to dispatch decisions are policyagnostic.
MOVI is based around the fleet object, which maintains the states of all vehicles at all times. In every time step, all vehicles update their states according to their matching and dispatch assignments. We discretize the city into grid locations of size m, which are later grouped into regions to compute the RHC and DQN dispatch policies. As detailed in Algorithm 3, MOVI first initializes vehicles and generates ride requests based on the real trip records. The agent then computes the actions , using either the RHC or DQN policy, and after the vehicles have gone to these locations, the dispatcher matches appropriate idle vehicles (those in the set ) to requesting customers. When the agent sends a dispatch order to the vehicles, MOVI creates an estimated trajectory to the dispatched location based on the shortest path in the road network graph, and the vehicles move to dispatched locations within the trip time given by our ETA model. If there are no available resources in the customer’s region, this ride request is rejected and disappears.
Road Network Graph. We construct a directed graph to model the road network in the service area from Open Street Map data [19]. Whenever a vehicle is dispatched from an origin to destination for a vehicle, the simulator first finds the closest road edges to the and coordinates and then conducts an A* search for the shortest path between them.
ETA (Estimated Time of Arrival) Model. To estimate the trip times for every dispatch at time from location to location
, we trained a multi layer perceptron. The input feature vector consists of the sine and cosine of the day of the week and hour of the day, pickup latitude and longitude, dropoff latitude and longitude, and trip distance. We use a random 70% of the trip records in the training dataset to train the perceptron and 30% for validation. With the trained model, the rootmeansquare error (RMSE) for the training and validation datasets are 4.740 and 4.739 minutes respectively.
Matching Algorithm. When a pickup request arrives, we assign it to the closest available vehicle. If there are no idle vehicles within five kilometers of the pickup location, the request is instead rejected. An assigned vehicle heads towards the pickup location with trip time predicted by the ETA model. After the pickup, the vehicle drives to the drop off location within the trip time on the actual trip record.
VB Optimized Dispatch Policies
The agent in Figure 4 runs either the RHC or DQN algorithm (Algorithms 1 and 2)). We detail our implementations of both in this section, after outlining the demand prediction that both algorithms use to represent the environment state.
Demand Prediction.
To predict future demand, we build a small convolutional neural network. The output of the network is a
heat map image in which each pixel stands for the predicted number of ride requests in a given region in the next 30 minutes. The network inputs are the actual demand heat maps from the last two time steps; we capture daily periodicity by also including the sine and cosine of the day of the week and hour of the day. The network consists of two hidden layers; configuration details are in [21]. We use thirtyminute timeslots, with the first 70% of timeslots used for training and the last 30% for validation. The RMSEs for the training and validation datasets are 1.047 and 0.980 respectively, i.e., our predicted demand is accurate to within a single request. Figure 5 shows an example of the target and predicted demand heat maps; they are visually identical.RHC Implementation. Since the RHC optimization involves variables, the number of regions significantly affects the computational time. Thus, we use the 226 taxi zones shown in Figure 2 as our dispatch regions. Predicted demand in each zone is calculated by aggregating outputs of the demand prediction model. We use a timeslot length of minutes, reflecting the minutescale runtime of the RHC algorithm, and . The destination distribution given a trip’s origin, , was extracted from training data using a histogram count of the number of trips between regions for time ’s day of the week and hour of the day, thus taking into account cyclical demand patterns (cf. Figure 3).
Streamlined DQN Training. For the DQN policy, we use smaller dispatch regions so as to utilize spatial convolution in training the network. We divide the entire service area into a grid of regions, each of which is around m. Each vehicle can move at most 7 regions horizontally or vertically from its current region, matching the constraint on vehicle travel times in the RHC optimization problem (4) and resulting in a map of possible destination regions. We select minute as the length of each simulation step and a horizon of , retraining the Qnetwork after each simulation step. Our technical report [21] has more details on the Qnetwork input features and training.
Figure 6 presents our Qnetwork’s architecture. We use a convolutional neural network with a output map corresponding to the estimated Qvalue for each possible action, given the input state. The input features are
summarized in Table II. In addition to the predicted ride requests and future available vehicles , we include environment features like the vehicle location, time of the day, and day of the week. We use three hidden layers and one output layer.
Type  Feature  Plane size  # of planes  Description 
Main  Demand  1  Predicted number of ride requests next 30 minutes in each region  
Supply  3  Expected number of available vehicles in each region in 0, 15 and 30 minutes  
Idle  1  Number of vehicles in in each region  
Main*  Cropped  5  Main features applied (23, 23) cropping  
Average  5  Main features applied (15, 15) average pooling with (1, 1) stride 

Double Average  5  Main features applied (30, 30) average pooling with (1, 1) stride  
Auxiliary  Day of week  2  Constant planes filled with sin and cos of the day of week  
Hour of day  2  Constant planes filled with sin and cos of the hour of day  
Position  1  A constant plane filled with 0 except current position of the vehicle with 1  
Coordinate  2  Constant planes filled with current normalized coordinates of the vehicle  
Move Coordinate  2  Normalized coordinate at this point  
Distance  1  Normalized distance to this point from the center  
Sensibleness  1  Whether a move at this point is legal 
Reinforcement learning is known to be unstable when a nonlinear approximator like a neural network is used to represent the function. This instability is mainly due to correlations in the sequence of experiences and between the actionvalues and the target values . We use experience replay to remove these correlations and the Double DQN algorithm to prevent overestimation [22, 23]
, using the RMSProp algorithm to train the Qnetwork.
We further streamline this training procedure to handle one of the biggest challenges in applying DQN to a fleet of vehicles: as vehicles execute policies, the state from the perspective of other vehicles changes, disrupting their Qnetwork training. Thus, we introduce a new parameter as the probability that a vehicle moves in each simulation step, increasing linearly from 0.3 to 1.0 over the first 5000 training steps. Thus, only 30% of the vehicles move in the first step, which is roughly the percentage of vehicles taking actions in the optimal policy. We trained the Qnetwork for a total of 20,000 steps, corresponding to two weeks of data, and used a replay memory of the 10,000 most recent transitions. As illustrated in Figure 7, our method achieves stable loss and maximum Qvalues over time. Once the average maxQvalue reaches 100, it starts decreasing: training in the previous time steps has improved taxis’ Qnetworks, allowing them to compete more for passengers and decreasing the average return an individual taxi can gain.
Vi Results and Discussion
For our evaluation, we use 2.8 million trip records from Monday, 6/6/2016 to Sunday, 6/12/2016. We assume that a day starts at 4 a.m. and ends at 4 a.m. in the following day, e.g., “Monday” is defined as 4 a.m. on Monday 6/6/2016 to 4 a.m. on Tuesday 6/7/2016. For each day, we conduct a dispatch simulation with 8000 taxi vehicles, whose initial locations are chosen from the pickup locations of the first 8000 ride requests in our data. We initialize the environment by first running the simulation for 30 minutes without dispatching.
For each day of the week, we compute three performance metrics: the average reject rate, wait time, and idle cruising time. The average reject rate is defined as the number of rejected requests divided by the number of total requests in each day, and the wait time is defined as the average time between a (unrejected) pickup request originating to the time it is fulfilled. We define the idle cruising time as the total driving time without passengers divided by the number of accepted requests. We also track the total trip time with passengers for each vehicle to compute the utilization rate, or fraction of time for which a given vehicle is occupied.
Via Performance Results
We show the results of each policy before comparing them.
RHC Policy. We conduct a simulation with the test dataset using the RHC policy and compute each metric’s average value over a week. Figure 7(a) shows the results with different reject penalties from 0 to 40. While all three metrics improve as increases from 10 to 20, they take nearly the same value when . This result indicates that in practice, our three performance criteria do not conflict, which is surprising: we would expect the idle cruising time to increase as the reject rate decreases, due to vehicles traveling longer distances to pick up more passengers. The result suggests that most vehicles are close to passenger demand locations, yet many requests are not served quickly due to drivers not realizing this proximity. The floor on the reject rate as increases, however, indicates that some requests are simply too far from any idle vehicles; our constraint on idle cruising time in (4) then prevents any vehicles from traveling to their locations.
We also investigate the importance of maximum horizon . In the technical report [21], we show that the performance does not change significantly with , indicating that there is limited value to coordinating vehicle locations too far into the future, perhaps due to limited ability to predict future demands.
DQN Policy. Similar to RHC, we evaluate our DQN policy by a simulation calculating each metric’s average value over a week. Figure 7(b) shows the results with different reject penalties from 0 to 20. As seen in the figure, as increases, the reject rate improves until , while the idle cruising time increases modestly for . As for the RHC policy, all metric values level off as increases, indicating that there is a nonzero floor for the reject rate.
Performance Comparison.
We compare both dispatch policies to a simulation without dispatch. We summarize the results of no dispatch (NO), DQN with (DQN), and RHC with (RHC) in Figure 9. Our results indicate that DQN outperforms RHC, but both significantly outperform no dispatching, indicating the value of optimized dispatch algorithms. They also suggest that DQN’s better adaptability compensates for RHC’s better coordination between vehicles.
In every day of the week, the RHC and DQN policies significantly reduce the reject rate and wait time compared to no dispatching, while the idle cruising time stays almost the same. The reject rate and average wait time of the DQN policy are reduced by 76% and 34% respectively compared with no dispatch, and by 20% and 12% compared with the results of the RHC policy. The idle cruising time of the DQN policy increases by 1.3% compared with the time of no dispatch, and by 4.0% compared with the time of RHC. Since DQN optimizes individual vehicle rewards, its policies may have individual drivers travel further to pickup requests, even though closer vehicles could also have served those requests.
Figure 10 shows the reject rate, wait time, and idle cruising time with RHC and DQN dispatch and without dispatch on Tuesday. DQN dispatch consistently reduces the reject rate and wait time more than RHC; the technical report [21] shows that this holds for Saturday as well. We note that the greatest reduction in the reject rate occurs at the time of highest demand, around 8pm to midnight (cf. Figure 3). Thus, optimized dispatch policies realize the most benefit at times of high demand. At these times, without dispatching, drivers may not search for the locations of future ride requests, instead simply waiting for a request at their current locations. At these times DQN, but not RHC, drastically reduces passenger wait times, perhaps due to DQN having vehicles drive more to look for pickups. Indeed, the idle cruising times for the DQN policy are slightly higher than those for the RHC policy at these times, which is consistent with the overall results in Figure 9.
We finally show that DQN more evenly distributes ride requests between vehicles by considering our 8000 vehicles’ mean and minimum utilization rates in Figure 11
. While the mean utilization rates for the two policies are almost the same, DQN’s minimum utilization rate is much greater than RHC’s. This smaller variance may be due to the fact that the DQN policy learns the optimal policy for an individual vehicle, meaning that every vehicle tries to take the best actions for itself. On the other hand, the RHC policy aims to maximize the total reward, forcing vehicles to take actions that may not benefit themselves, but do benefit the system as a whole.
ViB DQN Advantages
Despite the fact that the DQN policy does not make coordinated decisions for idle vehicles, our results show that DQN’s reject rate is lower than RHC’s on every day of week. We conjecture that this is due to DQN’s much faster dispatch decisions, allowing the dispatch policies to rapidly adapt to the environment state. Compared to the fast computation of a neural network forward pass in each vehicle for DQN, which takes less than a hundred milliseconds, solving a linear program with tens of thousands of variables in the RHC policy is more expensive, taking from seconds to tens of seconds.
In order to investigate the effect of the dispatch cycle, we simulate the DQN policy with the same dispatch cycle as the RHC policy. The results are plotted as DQN* in Figure 9, showing that the reject rate is almost the same as RHC. DQN’s faster ondemand dispatch thus helps the agent to adapt to disruptive environmental changes more quickly than RHC, at the expense of centralized cooperation. Even when DQN and RHC have the same dispatch cycles, DQN’s lack of model constraints allows it to compensate for its lack of coordination and perform as well as RHC.
Obeying the DQN policy is generally more beneficial for drivers than the RHC policy, as DQN predicts the best action for each individual vehicle given its current state. Thus, the DQN policy may be more realistic to implement in realworld applications. Indeed, ridesharing platforms like Uber allow drivers to choose where they go and which pickup requests to accept [24], which may partially explain their success in improving passenger wait times compared to traditional taxi services. Other potential advantages of a DQN approach include the fact that the same network architecture and input features can be used for different service areas; DQN’s forward computation time is also independent of the number of dispatch regions, making it suitable for large service areas. In addition, other input features such as a vehicle’s speed and capacity can easily be taken into account in the dispatch policies by simply adding them to the network input.
Vii Conclusion
In this paper, we propose MOVI, a Deep Qnetwork (DQN) framework to dispatch taxis, which uses valuebased function approximation with deep learning models and learns a optimal policy through directly interacting the environment. Dispatch simulation using taxi trip records in New York City shows that DQN policies lead to significantly fewer service rejects and wait times compared to no dispatching, outperforming the RHC policy with centralized coordination. In the future, it will be important to explore different network architectures and other input features such as the estimated time of each action to improve the DQN performance and computational efficiency, as well as establish a theoretical basis for DQN’s superiority to RHC. Our work takes a first step in demonstrating the benefits of applying a modelfree, practical dispatch solution with stateoftheart deep reinforcement learning techniques to largescale taxi dispatch problems.
References
 [1] F. Miao, S. Han, S. Lin, J. A. Stankovic, D. Zhang, S. Munir, H. Huang, T. He, and G. J. Pappas, “Taxi Dispatch With RealTime Sensing Data in Metropolitan Areas: A Receding Horizon Control Approach,” IEEE Trans. Autom. Sci. Eng., vol. 13, no. 2, pp. 463–478, Apr. 2016.
 [2] G. Laporte, “The vehicle routing problem: An overview of exact and approximate algorithms,” European journal of operational research, vol. 59, no. 3, pp. 345–358, 1992.
 [3] B. L. Golden, E. A. Wasil, J. P. Kelly, and I.M. Chao, “The impact of metaheuristics on solving the vehicle routing problem: algorithms, problem sets, and computational results,” Fleet management and logistics, pp. 33–56, 1998.

[4]
J.F. Cordeau, M. Gendreau, and G. Laporte, “A tabu search heuristic for periodic and multidepot vehicle routing problems,” Networks, vol. 30, no. 2, pp. 105–119, 1997.
 [5] Kiam Tian Seow, Nam Hai Dang, and DerHorng Lee, “A Collaborative Multiagent TaxiDispatch System,” IEEE Trans. Autom. Sci. Eng., vol. 7, no. 3, pp. 607–616, Jul. 2010.
 [6] D. Zhang, T. He, S. Lin, S. Munir, and J. A. Stankovic, “Dmodel: Online Taxicab Demand Model from Big Sensor Data in a Roving Sensor Network,” in IEEE International Congress on Big Data, 2014, pp. 152–159.
 [7] B. Li, D. Zhang, L. Sun, C. Chen, S. Li, G. Qi, and Q. Yang, “Hunting or waiting? Discovering passengerfinding strategies from a largescale realworld taxi dataset,” in the IEEE International Conference on Pervasive Computing and Communications Workshops, 2011, pp. 63–68.

[8]
A. Jauhri, C. JoeWong and J. P. Shen, “On the realtime vehicle placement problem,” NIPS 2017 Workshop on Machine Learning for Intelligent Transportation Systems.
 [9] F. Miao, S. Han, A. M. Hendawi, M. E. Khalefa, J. A. Stankovic, and G. J. Pappas, “Datadriven distributionally robust vehicle balancing using dynamic region partitions,” in Proc. of ACM ICCPS, pp. 261–271, 2017.
 [10] H. Zheng and J. Wu, “Online to Offline Business: Urban Taxi Dispatching with PassengerDriver Matching Stability,” in Proc. of IEEE ICDCS, 2017.
 [11] R. Zhang and M. Pavone, “Control of robotic mobilityondemand systems: A queueingtheoretical perspective,” Int. J. Rob. Res., vol. 35, no. 1–3, pp. 186–203, Jan. 2016.

[12]
L. P. Kaelbling, M. L. Littman, and A. W. Moore, ”Reinforcement Learning: A Survey”, Journal of Artificial Intelligence Research, vol. 4, pp. 237285, 1996.
 [13] V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski, S. Petersen, C. Beattie, A. Sadik, I. Antonoglou, H. King, D. Kumaran, D. Wierstra, S. Legg, and D. Hassabis, “Humanlevel control through deep reinforcement learning.,” Nature, vol. 518, no. 7540, pp. 529–33, Feb. 2015.
 [14] Y. LeCun, Y. Bengio, and G. Hinton, “Deep learning.,” Nature, vol. 521, no. 7553, pp. 436–44, May 2015.
 [15] J. W. Powell, Y. Huang, F. Bastani, and M. Ji, “Towards Reducing Taxicab Cruising Time Using SpatioTemporal Profitability Maps,” Springer, Berlin, Heidelberg, 2011, pp. 242–260.
 [16] M. Qu, H. Zhu, J. Liu, G. Liu, and H. Xiong, “A costeffective recommender system for taxi drivers,” in Proc. of ACM KDD, 2014, pp. 45–54.
 [17] B. D. Ziebart, A. L. Maas, A. K. Dey, and J. A. Bagnell, “Navigate like a cabbie: Probabilistic reasoning from observed contextaware behavior,” in Proc. of ACM UbiComp, pp. 322–331, 2008.
 [18] New York City Taxi and Limousine Commission. http://www.nyc.gov/html/tlc/html/about/trip_record_data.shtml
 [19] OpenStreetMap https://www.openstreetmap.org
 [20] TensorFlow https://www.tensorflow.org/
 [21] “A ModelFree Approach to Dynamic Fleet Management,” technical report, 2017, https://www.dropbox.com/s/ujqova12lnklgn5/dynamicfleetmanagementTR.pdf?dl=0.
 [22] T. Schaul, J. Quan, I. Antonoglou, and D. Silver, “Prioritized experience replay,” arXiv preprint arXiv:1511.05952, 2015.
 [23] H. van Hasselt, A. Guez, and D. Silver, “Deep Reinforcement Learning with Double Qlearning,” Sep. 2015.
 [24] Uber, “How to use the Uber driver app,” 2017, https://www.uber.com/drive/resources/howtousethedriverapp/
Comments
There are no comments yet.