1 Introduction
Oneway vehicle sharing systems represent an increasingly popular mobility paradigm aimed at effectively utilizing usage of idle vehicles, reducing demands for parking spaces, and possibly cutting down excessive carbon footprints due to personal transportation. Oneway vehicle sharing systems (also referred to as mobilityondemand –MOD– systems) consist of a network of parking stations and a fleet of vehicles. A customer arriving at a given station can pick up a vehicle (if available) and drop it off at any other station within the city. Existing vehicle sharing systems include Zipcar [13], Car2Go [27] and Autoshare [25] for oneway car sharing, and Velib [21] and Citybike [8] for oneway bike sharing. Figure 1 shows a typical Toyota iRoad oneway vehicle sharing system [15].
Despite the apparent advantages, oneway vehicle sharing systems present significant operational challenges. Due to the asymmetry of travel patterns within a city, several stations will eventually experience imbalances of vehicle departures and customer arrivals. Stations with low customer demands (e.g., in suburbs) will possess excessive unused vehicles and require a large number of parking spaces, while stations with high demands (e.g., in the city center) will not be able to fulfill most customers’ requests during rush hours.
Literature Review: In general, there are two main methods available in the literature to address demandsupply imbalances in oneway vehicle sharing systems (however, if the vehicles can drive autonomously, additional rebalancing strategies are possible [34]). A first class of methods is to hire crew drivers to periodically relocate vehicles among stations. From a theoretical standpoint, optimal rebalancing of drivers and vehicles has been analyzed in [33], under the framework of queueing networks. In [2], [14], [28], the effectiveness of similar rebalancing strategies is numerically investigated via discrete event simulations. The work in [20] considers a stochastic mixedinteger programming model where the objective is to minimize vehicle relocation cost subject to a probabilistic constraint on the service level. The works in [30] and [18] consider a similar approach. While rebalancing can be quite effectively carried out by this way, these methods substantially increase sunk costs due to the large number of staff drivers that needs to be hired (a characterization of the number of drivers is provided in [33]), and may not scale well to large transportation networks.
Alternatively, demandsupply imbalances can be addressed by imposing incentive pricing to vehicle rentals. A typical incentive pricing mechanism is described in [23] and portrayed in Figure 2.
The strategy is to adjust rental prices at each station as a function of current vehicle inventory and customers’ requests. The work in [31] proposes a method to optimize vehicle assignments by trip splitting and trip joining, and [19] proposes a dynamic pricing strategy that enables clients of a carpooling system to tradeoff convenience of a trip (i.e., duration) and cost. Carpooling strategies, however, may not be a scalable solution due to safety, convenience, and sociological reasons. Recently, the work in [6] proposes a bidding mechanism for vehicle rentals where at each station customers place bids and the vehicle sharing company decides which bids to accept. The operator’s sequential decisionmaking problem is posed as a constrained Markov decision problem (CMDP), which is solved exactly, or approximately using an actorcritic method.
A bidding strategy, such as the one proposed in [6], is attractive for several reasons. First, accepted vehicle rental bids instantly reflect current demands and supplies at different stations. Second, by providing ondemand financial rewards for rebalancing vehicles, the rental company can save overhead costs associated with hiring crew drivers and renting extra parking spaces. Third, such pricing mechanism promotes high vehicle utilizations by encouraging extra vehicle rentals to less popular destinations and during nonrush hours.
Contributions: The contribution of this paper is threefold.

Leveraging our recent findings in [6], we propose a novel CMDP formulation for the problem of optimizing oneway vehicle sharing systems. The actions in the CMDP model represent vehicle rental decisions as a function of bids placed by customers. The objective is to maximize total revenue from rental assignments, subject to a vehicle utilization constraint ^{1}^{1}1The vehicle utilization constraint ensures that the assignment policy does not excessively favor shortterm rentals. Accordingly, this constraint provides service guarantees for customers in need of longterm rentals. Further details about the practical relevance of this constraint can be found in [5]..

We derive a twophase Bellman optimality condition for CMDPs. Such condition shows that CMDPs can be solved exactly by twophase dynamic programming (DP).

We propose a novel samplingbased twophase learning algorithm for the solution of CMDPs, and show under mild assumptions convergence to an optimal solution.
This paper provides a first step toward designing marketbased mechanisms for the operational optimization of oneway vehicle sharing systems. We describe a wealth of open problems at the end of the paper. Furthermore, the results concerning CMDPs are of independent interest and applicable more broadly. Due to space limitations, in this paper we only include the statements of our theoretical results. All proofs can be found in the supplementary material section.
2 Mathematical Model
In this section we present a mathematical model for oneway vehicle sharing systems and then pose a CMDP decisionmaking problem for its optimization.
2.1 General Model and Problem Data
Assume the vehicle sharing company owns vehicles, indexed , that can be parked at stations, indexed . The company only allows passenger to rent for a maximum of time slots; furthermore, the maximum fare/reward for each rental period is . We consider a discrete time model , where is the time horizon. At time and at each station
, customers’ destinations, rental durations, and proposed travel fares can be modeled via a multivariate (threedimensional) stationary probability distributions
with domain . The distributionmodels the inherent stochastic, timevariant customer demand for car rentals. A practical approach to estimate the demand distribution
at each station is to use datadriven methods, for example [10].At each time and station , there are rental requests, where
is modeled as a Poisson random variable with rate
(as usually done in the literature, see, for example, [17]). The associated customers’ destinations, rental durations, and proposed travel fares are i.i.d. samples drawn from the distribution. Such samples are collected in a random vector
, that is:where random variables , , and denote, respectively, a customer’s destination, rental duration, and proposed fare (the meaning of the indices is clear from the definitions).
For , denote by the number of customers that arrive at time at station and that wish to travel to station . By definition, , where represents the indicator function.
This model captures both concepts of renting and rebalancing. Note that at station , the random price offered by customer , i.e., , can either be positive or negative. When this quantity is positive, it means that the customer is willing to pay fare units for a vehicle, traveling to station in time units. If this quantity is negative, it means that the company is paying fare units to customer to travel to station in time units to fulfill rebalancing needs.
Note that at each time instant and at each station, there can potentially be more rental requests than vehicles available. The strategy is to rank all incoming rental requests by a pricetotravel time function and assign vehicles according to such ranking. Specifically, for destination , we define the pricetotravel time function as
By assigning vehicles according to the pricetotravel time function (in a descending order), one favors customers with high rental prices and short travel times (when renting), and drivers with low rewards and short travel times (when rebalancing). (If all vehicles move at similar speeds, one could equivalently consider distances instead of times.)
To explicitly model the instantaneous demandsupply imbalance throughout the transportation network, one can refine function by including dependencies on the arrival station and time instant . While this generalization is straightforward, we omit the details in the interest of brevity.
2.2 State Variables
We now proceed to construct a CMDP model for a oneway vehicle sharing system. We consider the following state variables:

For , is a counter state.

For and , is the destination station at time of the vehicle. Let .

For and , is the remaining travel time to reach the destination for the vehicle. Let .
Collectively, the state space is defined as . We let denote the initial state, and the state at time .
2.3 Decision Variables
At each time , in order to maximize expected revenue, the company makes decisions about renting vehicles to customers. Specifically, at each time , the decision variables are:

For each station , each vehicle , and , is a binary decision variable that indicates whether vehicle is destined to station at time . Let .
For each stationdestination pair , we consider the following constraint to upper bound the number of vehicle dispatches at time :
(1) 
Intuitively, constraints (1) restrict the number of vehicle dispatches to be less than the number of customer requests. In the special case when , the upper bound is instead of because one needs to take into account the case when vehicle stays idle at station . Additionally, we consider the following constraints to guarantee wellposedness of the vehicles assignments:
(2) 
Accordingly, the action space is , and is the action taken at time . Furthermore, define the set of admissible controls at state as , such that .
2.4 State Dynamics
For each time and stationdestination pair , let
denote the set of vehicles at location which are allocated to destination . For , denote the rental durations and proposed fares of the highest ranked customers (according to the pricetotravel time function ) as .
The case is special, as the quantity comprises vehicles that (1) will eventually return to station , and (2) stay idle at station . Notice that the definition of does not allow one to distinguish between these two cases. However, in this case, it is not profitable to rent vehicles to customers submitting negative bids. Hence one concludes that the number of vehicles that will eventually return to station should be equal to the number of customers that want to return to station and submit a positive bid. We denote such number as . We similarly define the vector , with the understanding that the first elements correspond to the customers that submit positive bids, while each remaining element is set to .
The state dynamics are then given as follows:

For each vehicle such that ,

For all other vehicles, for each stationdestination pair , and each vehicle , uniformly randomly allocate to the available vehicles, i.e.,
for all .
As the problem is highdimensional even for a moderate number of vehicles and stations, and since the state update equations depend in a rather involved fashion on the information vector , the explicit derivation of the state transition probabilities (denote by ) is impractical. This motivates the learning approach proposed in this paper.
2.5 Revenue and Constraint Cost Functions
Define the immediate reward function as
The total revenue is then given by
From a pure profit maximization standpoint, an operator should favor short term rental assignments, in order to minimize the opportunity cost of rejecting future customers that might potentially be more profitable. While this strategy optimizes the longterm revenue, it is not fair toward customers that require extended rental periods. To balance total profit with customers’ satisfaction, we impose an additional constraint that lower bounds vehicle utilization. Specifically, the constraint cost function captures vehicle utilization at each time instant according to the definition
where represents a threshold for the average utilization rate specified by the system operator. The vehicle utilization constraint is then given by
The objective is then to maximize expected revenue while satisfying the vehicle utilization constraint.
2.6 CMDP Formulation
Equipped with the state space , control space , immediate reward , transition probability (implicitly defined above), initial state , and immediate constraint cost function , we pose the problem of controlling a vehicle sharing system as a CMDP. Specifically, note that the problem can be formulated on an infinite horizon by interpreting the set of states (indeed containing a single state) as absorbing, i.e., once , the system enters a termination state, call it END, and stays there with zero reward and constraint cost. From this perspective, represents a deterministic stopping time, and the upper limits in the summations for the reward and cost functions can be replaced with . We also define as the set of transient states. Let be the set of closedloop, Markovian stationary policies . A policy induces a stationary mass distribution over the realizations of the stochastic process . It is well known that for CMDPs there is no loss of optimality in restricting the attention to policies in (instead, e.g., of also considering historydependent policies).
Accordingly, in this paper we wish to solve the CMDP problem:
Problem – Solve
subject to
Since, in our problem, the state and action spaces are exponentially large (in terms of the vehicle number and station number ) and the explicit derivation of the state transition probability is intractable, exact solution methods for CMDPs (e.g., [1]) are not applicable.
On the other hand, even if one computes an optimal randomized policy for Problem , executing such policy on a vehicle sharing platform with simple system architecture may lead to problems in vehicle miscoordination [24]. This motivates us to restrict the structure of admissible policies in problem to Markovian, stationary and deterministic. In the next section we will introduce a twophase Bellman optimality condition for problem , which serves as the theoretical underpinning for the design of an asymptotically optimal twophase learning algorithm.
3 TwoPhase Dynamic Programming Algorithm
In this section, by leveraging the results in [11], we present a twophase DP algorithm for problem . As we shall see, the first step is to compute a function that allows one to refine the set of feasible control actions (essentially, we retain only those actions that can guarantee the fulfillment of the constraint in problem ). We then define a Bellman operator (restricted to the refined set of control actions) that allows the computation of an optimal policy. The DP algorithm presented in this section provides the conceptual basis for the twophase learning algorithm presented in the next section.
3.1 Phase 1: Finding the Feasible Set
In this section we characterize the feasible set for problem using the set of optimal policies from an auxiliary MDP problem, defined as follows:
Problem – Solve
(3) where is the set of closedloop, Markovian, stationary, and deterministic policies.
The following (trivial) result related minimizer for problem with feasible solutions for problem .
Let be a Markovian stationary deterministic policy that minimizes problem , that is
Then the solution cost is equal to zero if and only if
Equipped with the above result, one can immediately deduce the following:

If the solution cost to problem is strictly larger than zero, i.e.,
then problem is infeasible.

Otherwise, the feasible set of Markovian stationary deterministic policies is given by
In order to characterize the feasible set , we derive a Bellman optimality condition for problem and demonstrate how can be computed via DP.
Before getting into the main result, we define the Bellman operator for problem as follows:
where is the indicator function
With such definition of Bellman operator , we will later see that the fixed point solution of , is equal to the solution of problem , given by (3).
For any bounded initial value function estimate with for , we define the value function sequence
(4) 
The following theorem shows that this sequence of value function estimates converges to the solution of problem , which is also the unique fixed point of , . [Bellman Optimality for ] For any initial value function estimate where at , there exists a limit function such that
(5) 
Furthermore, is a unique solution to the fixed point equation: , . By running the value iteration algorithm in (4), one obtains the optimal value function for problem . If , then, by Lemma 3.1, every feasible policy for problem , denoted by , can be obtained as
However, since the number of feasible policies is exponential in the size of state and action spaces, their exhaustive enumeration is intractable, and the above result is useful only form a conceptual standpoint. To address this problem, we consider a refined notion of feasible control actions in terms of optimal stateaction value functions (functions), which provides the basis for the twophase DP algorithm. Specifically, a function is defined as:
where for . By defining the stateaction Bellman operator
equivalently is a unique fixed point solution to for any , . Note the interpretation of from these equations: it is the constraint cost of starting at state , using control action in the first stage, and using an optimal policy of problem thereafter.
We are now in a position to define a refined set of feasible control actions, denoted by , whereby we retain only those actions that can guarantee the fulfillment of the constraint in problem . Specifically, for any state , we define:
3.2 Phase 2: Constrained Optimization
From the definition of the set of feasible policies , one can reformulate problem as:
(6) 
This problem can be solved via value iteration by defining the Bellman operator (with respect to the refined set of control actions ):
The following theorem shows there exists a unique fixed point solution to and such solution corresponds to the value function for the problem in (6) (and, hence, problem ).
[Bellman Optimality for ] For any initial value function estimate such that for any , there exists a limit function such that
Furthermore, is a unique solution to the fixed point equation: , for any .
Therefore, for any bounded initial value function estimate such that at , the value function estimate sequence
(7) 
converges to the value function for problem .
Finally, we define the function for problem (6) (denoted as to distinguish it from the function for problem ):
where is the value function for problem (6). By defining the stateaction Bellman operator
one can show, similarly as before, that is a unique fixed point solution of for , .
The above twophase Bellman optimality condition immediately leads to a twophase DP algorithm for the solution of problem . However, such DP algorithm presents two main implementation challenges. First, the algorithm is not applicable to the vehicle sharing problem considered in this paper since the state transition probabilities are not available explicitly (as discussed above). Second, when the size of the state and action spaces are large, updating the value iteration estimates is computationally intractable.
To address these computation challenges, in the next section we present a samplingbased twophase learning algorithm that approximates the solution to problem . Similar to the twophase DP algorithm, in the first phase one updates the function estimates for problem “by sampling” the vehicle sharing model. Then, in a second phase, one updates the function estimates for problem (recall that such functions are referred to as functions ).
4 Two Phase learning
In this section we present both synchronous and asynchronous versions of twophase learning to solve problem . In the synchronous version, the function estimates of all stateaction pairs are updated at each step. In contrast, in the asynchronous version, only the function estimate of a sampled stateaction pair is updated. Under mild assumptions, we show that both algorithms are asymptotically optimal. While the convergence rate of synchronous learning is higher [12], asynchronous learning is more computationally efficient.
4.1 Synchronous Two Phase learning
Suppose is an initial function estimate such that for any . At iteration , the synchronous twophase learning algorithm samples states and updates the function estimates for each stateaction pair as follows:
(8)  
(9) 
The step size pair follows the update rule
(10) 
The last equation implies that the function update is on the fast time scale, and the function update is on the slow time scale. Notice that in the sampling approach, the state trajectory will enter the absorbing set in steps. While the convergence of learning is a polynomial of and (see the finite sample analysis in Theorem 1 of [12]), in order to get an accurate estimate of the function one needs more stateaction samples from the transient state space. However once the state trajectory enters , it will never visit the transient state space again. To collect more samples from the transient state space, similar to the approaches adopted by sampling based methods in [32], [9], here we reset the state to its initial condition immediately after it enters the absorbing set. The convergence result for the synchronous twophase learning algorithm is given in the following theorem. [Convergence of Synchronous learning] Suppose the stepsizes follow the update rule in (10). Then the sequence of function estimates computed via synchronous twophase learning converges to the optimal function pair componentwise with probability .
After both functions converge, a nearoptimal policy can be computed as
(11) 
where is the iteration index when the leaning is stopped.
4.2 Asynchronous TwoPhase learning
Suppose is an initial function estimate such that for any . At iteration and state , the asynchronous twophase learning algorithm (1) generates a control action
(12) 
(2) samples states , and (3) updates the function estimates as follows:

otherwise, the function estimates are equal to their previous values, i.e.,
The convergence result for the asynchronous twophase learning algorithm is given in the following theorem.
[Convergence of Asynchronous learning] Suppose the stepsizes follow the update rule in (10). Also suppose each state action pair is visited infinitely often. Then, the sequence of function estimates computed via asynchronous twophase learning converges to the optimal function pair with probability .
Note that the convergence result relies on the assumption that each stateaction pair is visited infinitely often. While this is a standard assumption in the learning literature [3], by following analogous arguments as in [29], the above result can proven under milder assumptions by using PAC analysis. As for synchronous twophase learning, a near optimal policy can be computed by (11) after the functions converge.
In the next section, we perform numerical experiments to compare the proposed twophase learning algorithms to a number of alternative approaches. In particular, we consider a Lagrangian relaxation method [1], whereby one transforms problem into a minmax MDP and solve for the optimal saddle point. However, finding the optimal Lagrange multiplier is a challenging problem. While multiscale stochastic approximation algorithms such as actorcritic [4] are available for optimizing both the Lagrange multiplier and policy online, in order to update the Lagrange multiplier, one requires sequential gradient approximations. This makes the convergence of these algorithms very sensitive to the multiple stepsizes and thus nonrobust to large scale problems. Further numerical insights are provided below.
4.3 Numerical Results
Consider a small vehicle sharing system that consists of vehicles (), stations (), and a horizon of hours (). In problem , one aims to find an optimal assignment strategy that maximizes the total revenue subject to the vehicle utilization constraint. The constraint threshold is set equal to to ensure that the average utilization time of each vehicle is at least hours. In our comparative study, we consider (1) a learning algorithm [12] that maximizes total revenue and does not take into account the utilization constraint; (2) a penalized learning algorithm, which maximizes a combined utility function of revenue and constraint violation penalty; (3) a learning algorithm with Lagrangian update [4], which approximately solves problem using an actorcritic method; and (4) the proposed twophase learning algorithm. Performance of these algorithms is evaluated via Monte Carlo trials, for which the corresponding empirical rewards and constraint costs are shown in Figures 3 and 4 respectively. The policy computed via learning returns the highest total revenue, but the average vehicle utilization time is only hours. On the other hand, the computed policies from twophase learning and learning with Lagrangian update decrease total revenue by but guarantee that average vehicle utilization time is over hours. Also, one can note that the proposed twophase learning algorithm converges faster than the learning algorithm with Lagrangian update. Finally, the policy from penalized learning^{2}^{2}2Here we perform grid search on the penalty parameter in order to maximize the total revenue while satisfying the vehicle utilization constraint. has the highest average vehicle utilization time ( hours) but lowest total revenue (with a gap). As a further comparison, a greedy policy that assigns as many rentals as possible provides a total revenue as low as .
In each iteration of the twophase learning algorithm, the assignment problem in (12) is cast as a
bilinear integer linear programming
(BILP) problem. Although BILP problems are NPhard in general, with readily available optimization packages such as CPLEX [7], our algorithm is capable of solving mediumscale problems with up to vehicles, stations, and a horizon of hours. We believe there is still ample room for improvement by leveraging parallelization and characterizing the functions with function approximations.5 Conclusion
In this paper, we propose a novel CMDP model for oneway vehicle sharing systems whereby the realtime rental assignment of vehicles relies on an auctionstyle bidding paradigm. We rigorously derive a twophase Bellman optimality condition for the CMDP and show that this problem can be solved (conceptually) using twophase dynamic programming. Building upon this result, we propose a practical samplingbased twophase learning algorithm and show that the solution converges asymptotically to the value function of the CMDP.
Future work includes: 1) Providing a convergence rate for our twophase learning algorithm; 2) Generalizing the proposed bidding mechanism by using marketdesign mechanisms [16]
and game theory
[22]; and 3) Evaluating our algorithm on a largescale vehicle sharing system.6 Acknowledgements
Y.L. Chow and M. Pavone are partially supported by Croucher Foundation Doctoral Scholarship, the Office of Naval Research, Science of Autonomy Program, under Contract N000141512673, and by the National Science Foundation under CAREER Award CMMI1454737.
References

[1]
E. Altman.
Constrained Markov Decision Processes
, volume 7. CRC Press, 1999.  [2] M. Barth and M. Todd. Simulation Model Performance Analysis of a Multiple Station Shared Vehicle System. Transportation Research Part C: Emerging Technologies, 7(4):237–259, 1999.
 [3] D. Bertsekas and J. Tsitsiklis. Neurodynamic Programming. Athena Scientific, 1996.
 [4] V. Borkar. An Actorcritic Algorithm for Constrained Markov Decision Processes. Systems & Control letters, 54(3):207–213, 2005.
 [5] D. Brook. Carsharing and Vehicle Sharing Report. http://fresno.ts.odu.edu/newitsd/ITS_Serv_Tech/carsharing/carsharing_report.html, 2005. California Center for Innovative Transportation at the University of California at Berkeley and Caltrans.
 [6] Y. Chow and J. Yu. Realtime Bidding Based Vehicle Sharing. In Proceedings of the 2015 International Conference on Autonomous Agents and Multiagent Systems, pages 1829–1830. International Foundation for Autonomous Agents and Multiagent Systems, 2015.
 [7] CPLEX, IBM ILOG. V12. 1: User’s Manual for CPLEX. International Business Machines Corporation, 46(53):157, 2009.
 [8] M. DiDonato. Citybike Maintenance and Availability. PhD thesis, Worcester Polytechnic Institute, 2002.
 [9] D. Egloff et al. Monte Carlo Algorithms for Optimal Stopping and Statistical Learning. The Annals of Applied Probability, 15(2):1396–1432, 2005.
 [10] V. Epanechnikov. Nonparametric Estimation of a Multivariate Probability Density. Theory of Probability & Its Applications, 14(1):153–158, 1969.

[11]
Z. Gábor, Z. Kalmár, and C. Szepesvári.
Multicriteria Reinforcement Learning.
In ICML, volume 98, pages 197–205, 1998.  [12] M. Kearns and S. Singh. Finitesample Convergence Rates for learning and Indirect Algorithms. NIPS, pages 996–1002, 1999.
 [13] P. Keegan. ZipcarThe Best New Idea in Business. CNNMoney. com, 2009.
 [14] A. Kek et al. Relocation Simulation Model for Multiplestation Shareduse Vehicle Systems. Transportation Research Record: Journal of the Transportation Research Board, 1986(1):81–88, 2006.
 [15] J. Kendall. Toyota Takes the iRoad. Automotive Engineering International, 121(3), 2013.
 [16] V. Krishna. Auction Theory. Academic press, 2009.
 [17] Y. Lim, S. Lee, and J. Kim. A Stochastic Process Model for Daily Travel Patterns and Traffic Information. In Agent and MultiAgent Systems: Technologies and Applications, pages 102–110. Springer, 2007.
 [18] D. Mauro et al. The Bike Sharing Rebalancing Problem: Mathematical Formulations and Benchmark Instances. Omega, 45:7–19, 2014.
 [19] W. Mitchell. Reinventing the Automobile: Personal Urban Mobility for the 21st Century. MIT press, 2010.
 [20] R. Nair. Fleet Management for Vehicle Sharing Operations. Transportation Science, 45(4):524–540, 2011.
 [21] R. Nair et al. LargeScale Vehicle Sharing Systems: Analysis of Vélib. International Journal of Sustainable Transportation, 7(1):85–106, 2013.
 [22] N. Nisan et al. Algorithmic Game Theory. Cambridge University Press, 2007.
 [23] D. Papanikolaou et al. The Market Economy of Trips. PhD thesis, Massachusetts Institute of Technology, 2011.
 [24] P. Paruchuri, M. Tambe, F. Ordonez, and S. Kraus. Towards a Formalization of Teamwork with Resource Constraints. In Proceedings of the Third International Joint Conference on Autonomous Agents and Multiagent SystemsVolume 2, pages 596–603. IEEE Computer Society, 2004.
 [25] E. Reynolds and K. McLaughlin. Autoshare: The Smart Alternative to Owning a Car. Autoshare, Toronto, Ontario, Canada, 2001.
 [26] S. Ross. Stochastic Processes, volume 2. John Wiley & Sons New York, 1996.
 [27] A. Schmauss. Car2go in Ulm, Germany, as an Advanced Form of Carsharing. European Local Transport Information Service (ELTIS), 2009.
 [28] J. Shu et al. Bicyclesharing System: Deployment, Utilization and the Value of Redistribution. National University of SingaporeNUS Business School, Singapore, 2010.

[29]
A. Strehl, L. Li, E. Wiewiora, J. Langford, and M. Littman.
PAC Modelfree Reinforcement Learning.
In
Proceedings of the 23rd international conference on Machine learning
, pages 881–888. ACM, 2006.  [30] R. Tal et al. Static Repositioning in a Bikesharing System: Models and Solution Approaches. European Journal of Transportation and Logistics, 2:187–229, 2013.
 [31] K. Uesugi et al. Optimization of Vehicle Assignment for Car Sharing System. In Knowledgebased intelligent information and engineering systems, pages 1105–1111. Springer, 2007.
 [32] H. Yu and D. Bertsekas. A Least Squares learning Algorithm for Optimal Stopping Problems. Lab. for Information and Decision Systems Report, 2731, 2006.
 [33] R. Zhang and M. Pavone. A Queueing Network Approach to the Analysis and Control of Mobilityondemand Systems. In Proceedings of American Control Conference, 2015.
 [34] R. Zhang and M. Pavone. Control of Robotic MobilityOnDemand Systems: A QueueingTheoretical Perspective. International Journal of Robotics Research, 2015.
Appendix A Appendix: Technical Proofs
a.1 Proof of Lemma 3.1
First notice that
Thus for any minimizer of problem such that the solution is , it directly implies that
i.e., is a feasible policy of problem .
On the other hand, suppose a control policy is feasible to problem , i.e.,
This implies that
Therefore is a minimizer to problem because the objective function of this problem is always nonnegative.
a.2 Technical Properties of Bellman Operators
The Bellman operator has the following properties. The Bellman operator has the following properties:

(Monotonicity) If , for any , then .

(Translational Invariant) For any constant , , for any .

(Contraction) There exists a positive vector and a constant such that ^{3}^{3}3.
The proof of monotonicity and constant shift properties follow directly from the definition of Bellman operator. Now we prove the contraction property. Recall that the element in state is a time counter, its transition probability is given by if and if . Obviously the transition probability , which is a multivariate probability distribution of state , is less than or equal to the marginal probability distribution of element. Thus for vector such that
(13) 
we have that
Here one observes that the effective “discounting factor” is given by
(14) 
Then for any vectors , ,
This further implies that the following contraction property holds: . Similarly, the Bellman operator also has the following properties. The Bellman operator is monotonic, translational invariant and it is a contraction mapping with respect to the norm. The proof of this lemma is identical to the proof of Lemma A.2 and is omitted for the sake of brevity.
a.3 Proof of Theorem 4
The first part of the proof is to show by induction that for ,
(15) 
For , the definition of Bellman operator implies that
By the induction hypothesis, assume (15) holds at . For ,
Comments
There are no comments yet.