Two Phase Q-learning for Bidding-based Vehicle Sharing

by   Yinlam Chow, et al.

We consider one-way vehicle sharing systems where customers can rent a car at one station and drop it off at another. The problem we address is to optimize the distribution of cars, and quality of service, by pricing rentals appropriately. We propose a bidding approach that is inspired from auctions and takes into account the significant uncertainty inherent in the problem data (e.g., pick-up and drop-off locations, time of requests, and duration of trips). Specifically, in contrast to current vehicle sharing systems, the operator does not set prices. Instead, customers submit bids and the operator decides whether to rent or not. The operator can even accept negative bids to motivate drivers to rebalance available cars to unpopular destinations within a city. We model the operator's sequential decision-making problem as a constrained Markov decision problem (CMDP) and propose and rigorously analyze a novel two phase Q-learning algorithm for its solution. Numerical experiments are presented and discussed.



There are no comments yet.


page 1

page 2

page 3

page 4


Pricing Vehicle Sharing with Proximity Information

For vehicle sharing schemes, where drop-off positions are not fixed, we ...

Relocation in Car Sharing Systems with Shared Stackable Vehicles: Modelling Challenges and Outlook

Car sharing is expected to reduce traffic congestion and pollution in ci...

On the Bike Spreading Problem

A free-floating bike-sharing system (FFBSS) is a dockless rental system ...

Maximum Customers' Satisfaction in One-way Car-sharing: Modeling, Exact and Heuristic Solving

One-way car-sharing systems are transportation systems that allow custom...

A dynamic ridesharing dispatch and idle vehicle repositioning strategy with integrated transit transfers

We propose a ridesharing strategy with integrated transit in which a pri...

Tight Competitive Analyses of Online Car-sharing Problems

The car-sharing problem, proposed by Luo, Erlebach and Xu in 2018, mainl...

Towards Decision Support in Dynamic Bi-Objective Vehicle Routing

We consider a dynamic bi-objective vehicle routing problem, where a subs...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

One-way vehicle sharing systems represent an increasingly popular mobility paradigm aimed at effectively utilizing usage of idle vehicles, reducing demands for parking spaces, and possibly cutting down excessive carbon footprints due to personal transportation. One-way vehicle sharing systems (also referred to as mobility-on-demand –MOD– systems) consist of a network of parking stations and a fleet of vehicles. A customer arriving at a given station can pick up a vehicle (if available) and drop it off at any other station within the city. Existing vehicle sharing systems include Zipcar [13], Car2Go [27] and Autoshare [25] for one-way car sharing, and Velib [21] and City-bike [8] for one-way bike sharing. Figure 1 shows a typical Toyota i-Road one-way vehicle sharing system [15].

Figure 1: A Typical one-way vehicle sharing system that allows different pick-up and drop-off locations [15].

Despite the apparent advantages, one-way vehicle sharing systems present significant operational challenges. Due to the asymmetry of travel patterns within a city, several stations will eventually experience imbalances of vehicle departures and customer arrivals. Stations with low customer demands (e.g., in suburbs) will possess excessive unused vehicles and require a large number of parking spaces, while stations with high demands (e.g., in the city center) will not be able to fulfill most customers’ requests during rush hours.

Literature Review: In general, there are two main methods available in the literature to address demand-supply imbalances in one-way vehicle sharing systems (however, if the vehicles can drive autonomously, additional rebalancing strategies are possible [34]). A first class of methods is to hire crew drivers to periodically relocate vehicles among stations. From a theoretical standpoint, optimal rebalancing of drivers and vehicles has been analyzed in [33], under the framework of queueing networks. In [2], [14], [28], the effectiveness of similar rebalancing strategies is numerically investigated via discrete event simulations. The work in [20] considers a stochastic mixed-integer programming model where the objective is to minimize vehicle relocation cost subject to a probabilistic constraint on the service level. The works in [30] and [18] consider a similar approach. While rebalancing can be quite effectively carried out by this way, these methods substantially increase sunk costs due to the large number of staff drivers that needs to be hired (a characterization of the number of drivers is provided in [33]), and may not scale well to large transportation networks.

Alternatively, demand-supply imbalances can be addressed by imposing incentive pricing to vehicle rentals. A typical incentive pricing mechanism is described in [23] and portrayed in Figure 2.

Figure 2: Depiction of an incentive pricing mechanism where rental prices are adjusted based on inventories and customers’ demands [23].

The strategy is to adjust rental prices at each station as a function of current vehicle inventory and customers’ requests. The work in [31] proposes a method to optimize vehicle assignments by trip splitting and trip joining, and [19] proposes a dynamic pricing strategy that enables clients of a carpooling system to trade-off convenience of a trip (i.e., duration) and cost. Carpooling strategies, however, may not be a scalable solution due to safety, convenience, and sociological reasons. Recently, the work in [6] proposes a bidding mechanism for vehicle rentals where at each station customers place bids and the vehicle sharing company decides which bids to accept. The operator’s sequential decision-making problem is posed as a constrained Markov decision problem (CMDP), which is solved exactly, or approximately using an actor-critic method.

A bidding strategy, such as the one proposed in [6], is attractive for several reasons. First, accepted vehicle rental bids instantly reflect current demands and supplies at different stations. Second, by providing on-demand financial rewards for rebalancing vehicles, the rental company can save overhead costs associated with hiring crew drivers and renting extra parking spaces. Third, such pricing mechanism promotes high vehicle utilizations by encouraging extra vehicle rentals to less popular destinations and during non-rush hours.

Contributions: The contribution of this paper is threefold.

  • Leveraging our recent findings in [6], we propose a novel CMDP formulation for the problem of optimizing one-way vehicle sharing systems. The actions in the CMDP model represent vehicle rental decisions as a function of bids placed by customers. The objective is to maximize total revenue from rental assignments, subject to a vehicle utilization constraint 111The vehicle utilization constraint ensures that the assignment policy does not excessively favor short-term rentals. Accordingly, this constraint provides service guarantees for customers in need of long-term rentals. Further details about the practical relevance of this constraint can be found in [5]..

  • We derive a two-phase Bellman optimality condition for CMDPs. Such condition shows that CMDPs can be solved exactly by two-phase dynamic programming (DP).

  • We propose a novel sampling-based two-phase learning algorithm for the solution of CMDPs, and show under mild assumptions convergence to an optimal solution.

This paper provides a first step toward designing market-based mechanisms for the operational optimization of one-way vehicle sharing systems. We describe a wealth of open problems at the end of the paper. Furthermore, the results concerning CMDPs are of independent interest and applicable more broadly. Due to space limitations, in this paper we only include the statements of our theoretical results. All proofs can be found in the supplementary material section.

2 Mathematical Model

In this section we present a mathematical model for one-way vehicle sharing systems and then pose a CMDP decision-making problem for its optimization.

2.1 General Model and Problem Data

Assume the vehicle sharing company owns vehicles, indexed , that can be parked at stations, indexed . The company only allows passenger to rent for a maximum of time slots; furthermore, the maximum fare/reward for each rental period is . We consider a discrete time model , where is the time horizon. At time and at each station

, customers’ destinations, rental durations, and proposed travel fares can be modeled via a multi-variate (three-dimensional) stationary probability distributions

with domain . The distribution

models the inherent stochastic, time-variant customer demand for car rentals. A practical approach to estimate the demand distribution

at each station is to use data-driven methods, for example [10].

At each time and station , there are rental requests, where

is modeled as a Poisson random variable with rate

(as usually done in the literature, see, for example, [17]). The associated customers’ destinations, rental durations, and proposed travel fares are i.i.d. samples drawn from the distribution

. Such samples are collected in a random vector

, that is:

where random variables , , and denote, respectively, a customer’s destination, rental duration, and proposed fare (the meaning of the indices is clear from the definitions).

For , denote by the number of customers that arrive at time at station and that wish to travel to station . By definition, , where represents the indicator function.

This model captures both concepts of renting and rebalancing. Note that at station , the random price offered by customer , i.e., , can either be positive or negative. When this quantity is positive, it means that the customer is willing to pay fare units for a vehicle, traveling to station in time units. If this quantity is negative, it means that the company is paying fare units to customer to travel to station in time units to fulfill rebalancing needs.

Note that at each time instant and at each station, there can potentially be more rental requests than vehicles available. The strategy is to rank all incoming rental requests by a price-to-travel time function and assign vehicles according to such ranking. Specifically, for destination , we define the price-to-travel time function as

By assigning vehicles according to the price-to-travel time function (in a descending order), one favors customers with high rental prices and short travel times (when renting), and drivers with low rewards and short travel times (when rebalancing). (If all vehicles move at similar speeds, one could equivalently consider distances instead of times.)

To explicitly model the instantaneous demand-supply imbalance throughout the transportation network, one can refine function by including dependencies on the arrival station and time instant . While this generalization is straightforward, we omit the details in the interest of brevity.

2.2 State Variables

We now proceed to construct a CMDP model for a one-way vehicle sharing system. We consider the following state variables:

  • For , is a counter state.

  • For and , is the destination station at time of the vehicle. Let .

  • For and , is the remaining travel time to reach the destination for the vehicle. Let .

Collectively, the state space is defined as . We let denote the initial state, and the state at time .

2.3 Decision Variables

At each time , in order to maximize expected revenue, the company makes decisions about renting vehicles to customers. Specifically, at each time , the decision variables are:

  • For each station , each vehicle , and , is a binary decision variable that indicates whether vehicle is destined to station at time . Let .

For each station-destination pair , we consider the following constraint to upper bound the number of vehicle dispatches at time :


Intuitively, constraints (1) restrict the number of vehicle dispatches to be less than the number of customer requests. In the special case when , the upper bound is instead of because one needs to take into account the case when vehicle stays idle at station . Additionally, we consider the following constraints to guarantee well-posedness of the vehicles assignments:


Accordingly, the action space is , and is the action taken at time . Furthermore, define the set of admissible controls at state as , such that .

2.4 State Dynamics

For each time and station-destination pair , let

denote the set of vehicles at location which are allocated to destination . For , denote the rental durations and proposed fares of the highest ranked customers (according to the price-to-travel time function ) as .

The case is special, as the quantity comprises vehicles that (1) will eventually return to station , and (2) stay idle at station . Notice that the definition of does not allow one to distinguish between these two cases. However, in this case, it is not profitable to rent vehicles to customers submitting negative bids. Hence one concludes that the number of vehicles that will eventually return to station should be equal to the number of customers that want to return to station and submit a positive bid. We denote such number as . We similarly define the vector , with the understanding that the first elements correspond to the customers that submit positive bids, while each remaining element is set to .

The state dynamics are then given as follows:

  • For each vehicle such that ,

  • For all other vehicles, for each station-destination pair , and each vehicle , uniformly randomly allocate to the available vehicles, i.e.,

    for all .

As the problem is high-dimensional even for a moderate number of vehicles and stations, and since the state update equations depend in a rather involved fashion on the information vector , the explicit derivation of the state transition probabilities (denote by ) is impractical. This motivates the -learning approach proposed in this paper.

2.5 Revenue and Constraint Cost Functions

Define the immediate reward function as

The total revenue is then given by

From a pure profit maximization standpoint, an operator should favor short term rental assignments, in order to minimize the opportunity cost of rejecting future customers that might potentially be more profitable. While this strategy optimizes the long-term revenue, it is not fair toward customers that require extended rental periods. To balance total profit with customers’ satisfaction, we impose an additional constraint that lower bounds vehicle utilization. Specifically, the constraint cost function captures vehicle utilization at each time instant according to the definition

where represents a threshold for the average utilization rate specified by the system operator. The vehicle utilization constraint is then given by

The objective is then to maximize expected revenue while satisfying the vehicle utilization constraint.

2.6 CMDP Formulation

Equipped with the state space , control space , immediate reward , transition probability (implicitly defined above), initial state , and immediate constraint cost function , we pose the problem of controlling a vehicle sharing system as a CMDP. Specifically, note that the problem can be formulated on an infinite horizon by interpreting the set of states (indeed containing a single state) as absorbing, i.e., once , the system enters a termination state, call it END, and stays there with zero reward and constraint cost. From this perspective, represents a deterministic stopping time, and the upper limits in the summations for the reward and cost functions can be replaced with . We also define as the set of transient states. Let be the set of closed-loop, Markovian stationary policies . A policy induces a stationary mass distribution over the realizations of the stochastic process . It is well known that for CMDPs there is no loss of optimality in restricting the attention to policies in (instead, e.g., of also considering history-dependent policies).

Accordingly, in this paper we wish to solve the CMDP problem:

Problem – Solve

subject to

Since, in our problem, the state and action spaces are exponentially large (in terms of the vehicle number and station number ) and the explicit derivation of the state transition probability is intractable, exact solution methods for CMDPs (e.g., [1]) are not applicable.

On the other hand, even if one computes an optimal randomized policy for Problem , executing such policy on a vehicle sharing platform with simple system architecture may lead to problems in vehicle mis-coordination [24]. This motivates us to restrict the structure of admissible policies in problem to Markovian, stationary and deterministic. In the next section we will introduce a two-phase Bellman optimality condition for problem , which serves as the theoretical underpinning for the design of an asymptotically optimal two-phase learning algorithm.

3 Two-Phase Dynamic Programming Algorithm

In this section, by leveraging the results in [11], we present a two-phase DP algorithm for problem . As we shall see, the first step is to compute a -function that allows one to refine the set of feasible control actions (essentially, we retain only those actions that can guarantee the fulfillment of the constraint in problem ). We then define a Bellman operator (restricted to the refined set of control actions) that allows the computation of an optimal policy. The DP algorithm presented in this section provides the conceptual basis for the two-phase learning algorithm presented in the next section.

3.1 Phase 1: Finding the Feasible Set

In this section we characterize the feasible set for problem using the set of optimal policies from an auxiliary MDP problem, defined as follows:

Problem – Solve


where is the set of closed-loop, Markovian, stationary, and deterministic policies.

The following (trivial) result related minimizer for problem with feasible solutions for problem .

Let be a Markovian stationary deterministic policy that minimizes problem , that is

Then the solution cost is equal to zero if and only if

Equipped with the above result, one can immediately deduce the following:

  • If the solution cost to problem is strictly larger than zero, i.e.,

    then problem is infeasible.

  • Otherwise, the feasible set of Markovian stationary deterministic policies is given by

In order to characterize the feasible set , we derive a Bellman optimality condition for problem and demonstrate how can be computed via DP.

Before getting into the main result, we define the Bellman operator for problem as follows:

where is the indicator function

With such definition of Bellman operator , we will later see that the fixed point solution of , is equal to the solution of problem , given by (3).

For any bounded initial value function estimate with for , we define the value function sequence


The following theorem shows that this sequence of value function estimates converges to the solution of problem , which is also the unique fixed point of , . [Bellman Optimality for ] For any initial value function estimate where at , there exists a limit function such that


Furthermore, is a unique solution to the fixed point equation: , . By running the value iteration algorithm in (4), one obtains the optimal value function for problem . If , then, by Lemma 3.1, every feasible policy for problem , denoted by , can be obtained as

However, since the number of feasible policies is exponential in the size of state and action spaces, their exhaustive enumeration is intractable, and the above result is useful only form a conceptual standpoint. To address this problem, we consider a refined notion of feasible control actions in terms of optimal state-action value functions (functions), which provides the basis for the two-phase DP algorithm. Specifically, a function is defined as:

where for . By defining the state-action Bellman operator

equivalently is a unique fixed point solution to for any , . Note the interpretation of from these equations: it is the constraint cost of starting at state , using control action in the first stage, and using an optimal policy of problem thereafter.

We are now in a position to define a refined set of feasible control actions, denoted by , whereby we retain only those actions that can guarantee the fulfillment of the constraint in problem . Specifically, for any state , we define:

3.2 Phase 2: Constrained Optimization

From the definition of the set of feasible policies , one can reformulate problem as:


This problem can be solved via value iteration by defining the Bellman operator (with respect to the refined set of control actions ):

The following theorem shows there exists a unique fixed point solution to and such solution corresponds to the value function for the problem in (6) (and, hence, problem ).

[Bellman Optimality for ] For any initial value function estimate such that for any , there exists a limit function such that

Furthermore, is a unique solution to the fixed point equation: , for any .

Therefore, for any bounded initial value function estimate such that at , the value function estimate sequence


converges to the value function for problem .

Finally, we define the function for problem (6) (denoted as to distinguish it from the function for problem ):

where is the value function for problem (6). By defining the state-action Bellman operator

one can show, similarly as before, that is a unique fixed point solution of for , .

The above two-phase Bellman optimality condition immediately leads to a two-phase DP algorithm for the solution of problem . However, such DP algorithm presents two main implementation challenges. First, the algorithm is not applicable to the vehicle sharing problem considered in this paper since the state transition probabilities are not available explicitly (as discussed above). Second, when the size of the state and action spaces are large, updating the value iteration estimates is computationally intractable.

To address these computation challenges, in the next section we present a sampling-based two-phase learning algorithm that approximates the solution to problem . Similar to the two-phase DP algorithm, in the first phase one updates the function estimates for problem “by sampling” the vehicle sharing model. Then, in a second phase, one updates the function estimates for problem (recall that such functions are referred to as functions ).

4 Two Phase learning

In this section we present both synchronous and asynchronous versions of two-phase learning to solve problem . In the synchronous version, the function estimates of all state-action pairs are updated at each step. In contrast, in the asynchronous version, only the function estimate of a sampled state-action pair is updated. Under mild assumptions, we show that both algorithms are asymptotically optimal. While the convergence rate of synchronous learning is higher [12], asynchronous learning is more computationally efficient.

4.1 Synchronous Two Phase learning

Suppose is an initial function estimate such that for any . At iteration , the synchronous two-phase learning algorithm samples states and updates the function estimates for each state-action pair as follows:


The step size pair follows the update rule


The last equation implies that the function update is on the fast time scale, and the function update is on the slow time scale. Notice that in the sampling approach, the state trajectory will enter the absorbing set in steps. While the convergence of learning is a polynomial of and (see the finite sample analysis in Theorem 1 of [12]), in order to get an accurate estimate of the function one needs more state-action samples from the transient state space. However once the state trajectory enters , it will never visit the transient state space again. To collect more samples from the transient state space, similar to the approaches adopted by sampling based methods in [32], [9], here we reset the state to its initial condition immediately after it enters the absorbing set. The convergence result for the synchronous two-phase learning algorithm is given in the following theorem. [Convergence of Synchronous -learning] Suppose the step-sizes follow the update rule in (10). Then the sequence of function estimates computed via synchronous two-phase learning converges to the optimal function pair component-wise with probability .

After both functions converge, a near-optimal policy can be computed as


where is the iteration index when the leaning is stopped.

4.2 Asynchronous Two-Phase learning

Suppose is an initial function estimate such that for any . At iteration and state , the asynchronous two-phase learning algorithm (1) generates a control action


(2) samples states , and (3) updates the function estimates as follows:

  • for and , function estimates are updated according to equations (8) and (9),

  • otherwise, the function estimates are equal to their previous values, i.e.,

The convergence result for the asynchronous two-phase learning algorithm is given in the following theorem.

[Convergence of Asynchronous -learning] Suppose the step-sizes follow the update rule in (10). Also suppose each state action pair is visited infinitely often. Then, the sequence of function estimates computed via asynchronous two-phase learning converges to the optimal function pair with probability .

Note that the convergence result relies on the assumption that each state-action pair is visited infinitely often. While this is a standard assumption in the learning literature [3], by following analogous arguments as in [29], the above result can proven under milder assumptions by using PAC analysis. As for synchronous two-phase learning, a near optimal policy can be computed by (11) after the functions converge.

In the next section, we perform numerical experiments to compare the proposed two-phase learning algorithms to a number of alternative approaches. In particular, we consider a Lagrangian relaxation method [1], whereby one transforms problem into a min-max MDP and solve for the optimal saddle point. However, finding the optimal Lagrange multiplier is a challenging problem. While multi-scale stochastic approximation algorithms such as actor-critic [4] are available for optimizing both the Lagrange multiplier and policy online, in order to update the Lagrange multiplier, one requires sequential gradient approximations. This makes the convergence of these algorithms very sensitive to the multiple step-sizes and thus non-robust to large scale problems. Further numerical insights are provided below.

4.3 Numerical Results

Consider a small vehicle sharing system that consists of vehicles (), stations (), and a horizon of hours (). In problem , one aims to find an optimal assignment strategy that maximizes the total revenue subject to the vehicle utilization constraint. The constraint threshold is set equal to to ensure that the average utilization time of each vehicle is at least hours. In our comparative study, we consider (1) a learning algorithm [12] that maximizes total revenue and does not take into account the utilization constraint; (2) a penalized learning algorithm, which maximizes a combined utility function of revenue and constraint violation penalty; (3) a learning algorithm with Lagrangian update [4], which approximately solves problem using an actor-critic method; and (4) the proposed two-phase learning algorithm. Performance of these algorithms is evaluated via Monte Carlo trials, for which the corresponding empirical rewards and constraint costs are shown in Figures 3 and 4 respectively. The policy computed via learning returns the highest total revenue, but the average vehicle utilization time is only hours. On the other hand, the computed policies from two-phase learning and learning with Lagrangian update decrease total revenue by but guarantee that average vehicle utilization time is over hours. Also, one can note that the proposed two-phase learning algorithm converges faster than the learning algorithm with Lagrangian update. Finally, the policy from penalized learning222Here we perform grid search on the penalty parameter in order to maximize the total revenue while satisfying the vehicle utilization constraint. has the highest average vehicle utilization time ( hours) but lowest total revenue (with a gap). As a further comparison, a greedy policy that assigns as many rentals as possible provides a total revenue as low as .

Figure 3: Reward Curve for Various Assignment Methods.
Figure 4: Constraint Curve for Various Assignment Methods.

In each iteration of the two-phase learning algorithm, the assignment problem in (12) is cast as a

bilinear integer linear programming

(BILP) problem. Although BILP problems are NP-hard in general, with readily available optimization packages such as CPLEX [7], our algorithm is capable of solving medium-scale problems with up to vehicles, stations, and a horizon of hours. We believe there is still ample room for improvement by leveraging parallelization and characterizing the functions with function approximations.

5 Conclusion

In this paper, we propose a novel CMDP model for one-way vehicle sharing systems whereby the real-time rental assignment of vehicles relies on an auction-style bidding paradigm. We rigorously derive a two-phase Bellman optimality condition for the CMDP and show that this problem can be solved (conceptually) using two-phase dynamic programming. Building upon this result, we propose a practical sampling-based two-phase learning algorithm and show that the solution converges asymptotically to the value function of the CMDP.

Future work includes: 1) Providing a convergence rate for our two-phase learning algorithm; 2) Generalizing the proposed bidding mechanism by using market-design mechanisms [16]

and game theory

[22]; and 3) Evaluating our algorithm on a large-scale vehicle sharing system.

6 Acknowledgements

Y.-L. Chow and M. Pavone are partially supported by Croucher Foundation Doctoral Scholarship, the Office of Naval Research, Science of Autonomy Program, under Contract N00014-15-1-2673, and by the National Science Foundation under CAREER Award CMMI-1454737.


  • [1] E. Altman.

    Constrained Markov Decision Processes

    , volume 7.
    CRC Press, 1999.
  • [2] M. Barth and M. Todd. Simulation Model Performance Analysis of a Multiple Station Shared Vehicle System. Transportation Research Part C: Emerging Technologies, 7(4):237–259, 1999.
  • [3] D. Bertsekas and J. Tsitsiklis. Neuro-dynamic Programming. Athena Scientific, 1996.
  • [4] V. Borkar. An Actor-critic Algorithm for Constrained Markov Decision Processes. Systems & Control letters, 54(3):207–213, 2005.
  • [5] D. Brook. Carsharing and Vehicle Sharing Report., 2005. California Center for Innovative Transportation at the University of California at Berkeley and Caltrans.
  • [6] Y. Chow and J. Yu. Real-time Bidding Based Vehicle Sharing. In Proceedings of the 2015 International Conference on Autonomous Agents and Multiagent Systems, pages 1829–1830. International Foundation for Autonomous Agents and Multiagent Systems, 2015.
  • [7] CPLEX, IBM ILOG. V12. 1: User’s Manual for CPLEX. International Business Machines Corporation, 46(53):157, 2009.
  • [8] M. DiDonato. City-bike Maintenance and Availability. PhD thesis, Worcester Polytechnic Institute, 2002.
  • [9] D. Egloff et al. Monte Carlo Algorithms for Optimal Stopping and Statistical Learning. The Annals of Applied Probability, 15(2):1396–1432, 2005.
  • [10] V. Epanechnikov. Non-parametric Estimation of a Multivariate Probability Density. Theory of Probability & Its Applications, 14(1):153–158, 1969.
  • [11] Z. Gábor, Z. Kalmár, and C. Szepesvári.

    Multi-criteria Reinforcement Learning.

    In ICML, volume 98, pages 197–205, 1998.
  • [12] M. Kearns and S. Singh. Finite-sample Convergence Rates for learning and Indirect Algorithms. NIPS, pages 996–1002, 1999.
  • [13] P. Keegan. Zipcar-The Best New Idea in Business. CNNMoney. com, 2009.
  • [14] A. Kek et al. Relocation Simulation Model for Multiple-station Shared-use Vehicle Systems. Transportation Research Record: Journal of the Transportation Research Board, 1986(1):81–88, 2006.
  • [15] J. Kendall. Toyota Takes the i-Road. Automotive Engineering International, 121(3), 2013.
  • [16] V. Krishna. Auction Theory. Academic press, 2009.
  • [17] Y. Lim, S. Lee, and J. Kim. A Stochastic Process Model for Daily Travel Patterns and Traffic Information. In Agent and Multi-Agent Systems: Technologies and Applications, pages 102–110. Springer, 2007.
  • [18] D. Mauro et al. The Bike Sharing Rebalancing Problem: Mathematical Formulations and Benchmark Instances. Omega, 45:7–19, 2014.
  • [19] W. Mitchell. Reinventing the Automobile: Personal Urban Mobility for the 21st Century. MIT press, 2010.
  • [20] R. Nair. Fleet Management for Vehicle Sharing Operations. Transportation Science, 45(4):524–540, 2011.
  • [21] R. Nair et al. Large-Scale Vehicle Sharing Systems: Analysis of Vélib. International Journal of Sustainable Transportation, 7(1):85–106, 2013.
  • [22] N. Nisan et al. Algorithmic Game Theory. Cambridge University Press, 2007.
  • [23] D. Papanikolaou et al. The Market Economy of Trips. PhD thesis, Massachusetts Institute of Technology, 2011.
  • [24] P. Paruchuri, M. Tambe, F. Ordonez, and S. Kraus. Towards a Formalization of Teamwork with Resource Constraints. In Proceedings of the Third International Joint Conference on Autonomous Agents and Multiagent Systems-Volume 2, pages 596–603. IEEE Computer Society, 2004.
  • [25] E. Reynolds and K. McLaughlin. Autoshare: The Smart Alternative to Owning a Car. Autoshare, Toronto, Ontario, Canada, 2001.
  • [26] S. Ross. Stochastic Processes, volume 2. John Wiley & Sons New York, 1996.
  • [27] A. Schmauss. Car2go in Ulm, Germany, as an Advanced Form of Car-sharing. European Local Transport Information Service (ELTIS), 2009.
  • [28] J. Shu et al. Bicycle-sharing System: Deployment, Utilization and the Value of Re-distribution. National University of Singapore-NUS Business School, Singapore, 2010.
  • [29] A. Strehl, L. Li, E. Wiewiora, J. Langford, and M. Littman. PAC Model-free Reinforcement Learning. In

    Proceedings of the 23rd international conference on Machine learning

    , pages 881–888. ACM, 2006.
  • [30] R. Tal et al. Static Repositioning in a Bike-sharing System: Models and Solution Approaches. European Journal of Transportation and Logistics, 2:187–229, 2013.
  • [31] K. Uesugi et al. Optimization of Vehicle Assignment for Car Sharing System. In Knowledge-based intelligent information and engineering systems, pages 1105–1111. Springer, 2007.
  • [32] H. Yu and D. Bertsekas. A Least Squares learning Algorithm for Optimal Stopping Problems. Lab. for Information and Decision Systems Report, 2731, 2006.
  • [33] R. Zhang and M. Pavone. A Queueing Network Approach to the Analysis and Control of Mobility-on-demand Systems. In Proceedings of American Control Conference, 2015.
  • [34] R. Zhang and M. Pavone. Control of Robotic Mobility-On-Demand Systems: A Queueing-Theoretical Perspective. International Journal of Robotics Research, 2015.

Appendix A Appendix: Technical Proofs

a.1 Proof of Lemma 3.1

First notice that

Thus for any minimizer of problem such that the solution is , it directly implies that

i.e., is a feasible policy of problem .

On the other hand, suppose a control policy is feasible to problem , i.e.,

This implies that

Therefore is a minimizer to problem because the objective function of this problem is always non-negative.

a.2 Technical Properties of Bellman Operators

The Bellman operator has the following properties. The Bellman operator has the following properties:

  • (Monotonicity) If , for any , then .

  • (Translational Invariant) For any constant , , for any .

  • (Contraction) There exists a positive vector and a constant such that 333.

The proof of monotonicity and constant shift properties follow directly from the definition of Bellman operator. Now we prove the contraction property. Recall that the element in state is a time counter, its transition probability is given by if and if . Obviously the transition probability , which is a multivariate probability distribution of state , is less than or equal to the marginal probability distribution of element. Thus for vector such that


we have that

Here one observes that the effective “discounting factor” is given by


Then for any vectors , ,

This further implies that the following contraction property holds: . Similarly, the Bellman operator also has the following properties. The Bellman operator is monotonic, translational invariant and it is a contraction mapping with respect to the norm. The proof of this lemma is identical to the proof of Lemma A.2 and is omitted for the sake of brevity.

a.3 Proof of Theorem 4

The first part of the proof is to show by induction that for ,


For , the definition of Bellman operator implies that

By the induction hypothesis, assume (15) holds at . For ,