I Introduction
Riding on the wave of the sharing economy, carsharing services such as Car2go^{1}^{1}1https://www.car2go.com, Wunder Mobility^{2}^{2}2https://www.wundermobility.com, TURO^{3}^{3}3https://www.turo.com, Zipcar^{4}^{4}4https://www.zipcar.com and Communauto^{5}^{5}5https://www.communauto.com play increasingly important role in terms of offering economical and environmentally conscious mobility options to citizens, especially in highly populated urban areas. To the society, car sharing can save parking lots, reduce traffic congestion and air pollution [1]. To individual users, it requires fewer ownership responsibilities and less costs to satisfy their mobility needs. In addition, car sharing provides users with a large range of vehicles, which allows them to match vehicles to trip purposes. The earliest efforts of carsharing service can be traced back to the 1940s in Europe and 1980s in North America [2]. Despite its rather earlier origins, only the past decade has seen significant growth in largescale car sharing businesses, which can be mainly attributed to the proliferation of the mobile internet.
A carsharing service can be financed by public and /or private entities and managed by a service organization which maintains a fleet of cars and light trucks in a network of vehicle locations. Individuals gain access to carsharing by joining the membership of the organization. Typically, a member pay a modest fixed charge plus a usage fee each time they use a vehicle. Vehicles are usually deployed in a lot located in a neighborhood or at a transit station. A member can reserve a vehicle through a phone call or Internet. Once approved, the reserved vehicle is assigned to the member who picks it up at an appointed time and leaves it at a specific carsharing location, which may be the same as the pickup point (oneway carsharing systems [3]) or anywhere in a specified zone (free floating carsharing systems [4]).
Three levels of decisionmaking, namely, strategic level, tactical level, and operational level are involved in the management of carsharing [4, 5]. Strategic decisions include determining the mode assumed by the network (oneway, twoway, freefloating), the number, location, and capacity of stations and fleet size. Tactical decisions mainly involve management policies that govern the service in the medium term, such as reservation and pricing policies. Operational decisions are those need to be made on a daily bases according to the dynamic market and fleet conditions. Typical examples include the decisions of placing initial inventories at each location and relocating vehicles across the network of locations to accommodate the realized demands. In this paper, we propose a datadriven optimization framework to support vehicle relocation decisionmaking as well as initial inventory placement decisions in carsharing management. To begin with, We review the related works in the literature.
Ia Related Works
Vehicle relocation problems in the carsharing context are extensively studied in the literature. One major stream of work is to model CSRP by applying complicating deterministic optimization technique, which can be effectively solved by largescale optimization exact algorithms such as Lagrangian relaxation, branchandbound or by heuristic algorithms such as neighborhood search, simulated annealing etc. For example, Gambella et al.
[6] formulate electric vehicle relocation problem (EVRP) as two mixed integer programming (MIP) models to maximize the profit associated with the trips performed by the users in operating hours and nonoperating hours, respectively. In the model settings, EVs battery consumption and recharge process are taken into considerations. Two modelbased heuristic algorithms based on removing relocation and rolling horizon mechanisms are designed to solve the relocation model due to the computational complexity. The experiment results show that the proposed algorithms achieve nearoptimal solutions and outperforms the solutions by cplex restricted by a time limit. Similarly, the authors in [7]investigate the electric vehicle fleet size and trip pricing problem which is formulated as a mixedinteger nonlinear programming (MINLP) model to maximize the overall profit by defining both longterm resource allocation and shortterm operation strategy. Specifically, the proposed MINLP model aims to optimize the station location, station capacity and fleet size simultaneously. To solve this large scale MINLP problem, a customized gradient algorithm is introduced and validate in a real case study. An integrated framework for electric vehicle rebalancing and staff relocation (EVR&SR) is proposed by
[8]. The EVR&SR is represented using a spacetime network and formulated as mixedinteger linear programming (MILP) model to minimize the overall cost including investment costs and operation expenses. The determination of the optimal allocation plan of EVs and staff relocation in the strategic level as well as the decisions of EV relocation and staff relocation are both taken into considerations in this framework. Since even the mediumscale instances cannot be solved by CPLEX and Gurobi effectively, a Lagrangian relaxationbased solution approach which decomposes the primal problem into a group of subproblems embedded with dynamic programming and greedy algorithm is introduced to tackle the largescale problem instance. It is able to reach the nearoptimal solution in a short time. In [9], a more general framework which involves a multiobjective MILP model and a virtual hub is introduced. In details, the mulitobjeictive model considers both vehicle relocation and electrical charging requirements. While the virtual hub is aggregated to tackle the extremely large number of relocation variables. The problem can be solved by the typical branchandbound approach which generates the efficient frontier and reaches the tradeoff between operator’s and users’ benefits to maximize the net revenue for the operator. To guarantee the flexibility of carsharing service, [10] proposes a twostage optimization model which involves optimizing destination locations and maximizing manager’s profit. However, the aforementioned studies do not consider any uncertain parameters such as demand, supply and travelling time. Thus, these modeling approaches cannot be directly applied to our CSRP.Another line of literature models CSRP by applying stochastic programming modeling techniques. A similar application like CSRP called bike sharing allocation and rebalancing problem (BSA & RP) is introduced in [5]. In order to minimize the total expected penalty which involves the sum of all the charged penalties for delivery, rebalancing, extra and excess inventory and stockout, the problem is formulated as a twostage stochastic programming model. In the twostage SP model, the initial allocation in strategic level is considered in the firststage decision, while the rebalancing is tackled in the secondstage decision. Meanwhile, a solutionbased heuristic algorithm based on scenario generation is devised to solve the model. A multistage stochastic linear programming (SLP) model is developed for optimizing strategic allocation of carsharing vehicles (OSACV) in [11] considering dynamic and uncertain demands. In the problem settings, the vehicles are assumed to be in use, in transit empty or stationary empty. Additionally, the travelling time between locations is one day. The aim of the problem is maximizing total expected profits which involves revenue and moving cost in both strategic and operational levels. Since the SP model involves seven stages, a scenario tree approach is utilized to solve the complex multistage SP model. In [12]
, the authors address largescale dynamic repositioning and routing problem (DRRP) instances with stochastic customer demand. The DRRP can be applied in many similar fields such as bikesharing after simplified extension. A twostage stochastic programming model based on network flow formulation is built to minimize the expected cost, wherein, the customer arrivals and starting time are assumed to follow Poisson distribution. An iterative algorithm called SPAR (separable, projective, approximation, routine) is adapted to solve the model in a realworld case study. Nevertheless, the above modelings and approaches cannot be applied in datadriven environment directly since they do not utilize the historical data in an accurate way. Furthermore, mathematical models that are formulated based on SP assumed that the probability distribution is known with a specific type. However, in the real historical data, the probability distribution information may contain many even infinite parameters which cannot be described by simple known distribution such as Gaussian distribution or Poisson distribution as referred in
[12].IB Research Gaps
Nowadays, with the rapid development of transportation in cities, a huge amount of data is generated every day, which leads to the significant change in the intelligent transportation system [13, 14]. However, increasing data brings new challenges to traditional optimization of carsharing relocation problem (CSRP) which plays a key role in CSS. For example, the customer demand (traffic flow) variability has a great impact on inventory level, the inappropriate decisionmakings may lead poor service level [15]. Therefore, how to tackle the uncertainty factors in datadriven environment is the key factor for CSRP.
The major limitation of previous works related to SP is that the probability distribution information is assumed to be known or estimated by experience. Actually, in those relevant works, the probability distribution are determined by decisionmakers using parametric approach. Specifically, the decisionmakers select a specific parametric distribution (e.g. Gaussian distribution). Afterwards, the parameters of the distribution will be determined by statistical methods. However, in most real applications, the true distribution information may be too complex to be described by simple parametric approaches. Therefore, we explore utilizing related machine learning approaches to make the SP model more practical. Recently, combining machine learning (ML) / deep learning(DL)
[16] with optimization techniques becomes the trend in operations research (OR) community[17, 18], which is known as datadriven optimization. A few researchers attempted to leverage the advantages of ML to make optimization models more realistic, and applied this in chemical industry[19, 20]. In detail, they applied Dirichlet process mixture model (DPMM) and principle component analysis (PCA) on distributionally robust optimization (DRO) model, which cannot satisfy our purpose. To the best of our knowledge, no similar work are applied in CSRP.IC Objectives and Contributions
In light of the results from previous works[19, 20], to consider applying the concept in CSRP, we proposed an innovative datadriven stochastic programming framework named DDKSP, which organically integrates the nonparametric approach  kernel density estimation (KDE) and stochastic programming model. Specifically, unlike the previous relevant work in which the probability distribution are assume to be known or estimated by parametric approach, the true probability distribution of customer demands are extracted by KDE. Then a twostage nonlinear stochastic programming model with the derived parameters is proposed to formulate the CSRP. Finally, integrating sample average approximation method with Benders decomposition algorithm is introduced to solve the twostage nonlinear SP model. It is worth noting that our proposed framework can be easily extended to solve the homogeneous problems such as bikesharing and EVsharing problem [21, 22, 23, 24].
The rest of the paper is organized as follows. The problem description and formulation are discussed in section 2. While section 3 describes the DDKSP framework which involves KDE, sample average approximation (SAA) method and Benders decomposition algorithm. Data prepossession and numerical experiment are presented in section 4. Finally, we conclude our work and propose future work in section 5.
Ii Problem Formulation
Iia Problem Statement
Generally, we study the CSRP which is a typical decisionmaking under centralized environment. It involves two roles, a car company and customers. Consider a oneway carsharing system (pickup at one location while dropoff at any locations), a car company owns a number of vehicles and there is a number of locations for car dispatch. For the customers, they reserved cars in advance and picked the car at the specific location. The CSRP can be considered as a twostage decisionmaking problem which can be described as follows. In the firststage (in the strategic phase), during a time window (e.g., from 0 am to 4 am) before the upcoming customer demands realize, each vehicle location is allocated with a certain number of cars (initial inventory decisionmaking), which incurs holding costs denoted by . In the secondstage (in the operational phase), after the real customer demand revealed (we assume that there exist a deadline that no customer orders accepted for today, e.g. 4 am), customers who reserved the cars will visit the locations to pick up the vehicles which brings revenue denoted by . Meanwhile, the truck carriers in the car company must dynamically move the cars from lower demands locations to higher demands locations to prevent the imbalance of vehicles among locations, which incurs moving costs denoted by .
Since the firststage decision must be made before the secondstage, namely, the decisionmakers must decide the most appropriate number of cars at each location to satisfy all the possibilities (called scenarios in stochastic programming) of customer demands (more cars will incur more holding cost, less cars will incur more moving cost), while reducing moving cost as possible as they can. The mathematical model must be able to hedge against the customer demands uncertainty. Based on the problem settings, the objective of CSRP is maximizing the overall expected profit, which involves total revenue, holding costs at each location and moving costs between locations. In this sense, the CSRP in this work focus on answering the following questions. (1) How many initial vehicles before the real demands revealed are required in each location, (2) how to move cars between locations in order to satisfy customer demands while maximize the overall profit.
In this work, the most critical concern for CSRP is the way of modeling uncertainty under datadriven environment. For convenience, only customer demand is considered as uncertainty parameter. Since the CSRP is a typical twostage problem with demands uncertainty, we investigate to utilize twostage stochastic programming model to formulate the problem. In the twostage SP model, decision variables are divided into two groups: the first stage decision variables (hereandnow) which should be determined before the real demands revealed, and the second stage decision variables (waitandsee) which are determined after the real demands realized.
Meanwhile, without the loss of generality, in the problem settings, several assumptions are made in the following.
Assumption 1.
We assume that the vehicle reservations in our work are determined before the operational phase (secondstage) starts, which implies that the customers cannot cancel or delay the reservations.
Assumption 2.
Our work assume that all the vehicles are working in the same condition, which means homogeneous cars are provided for customers.
Assumption 3.
We assume that the historical customer demand at each location is available, which indicates that the probability distribution information can be derived from historical data.
Assumption 4.
It is assumed that the true demands at all the locations are realized simultaneously.
IiB Model Formulation
In this section, we will discuss CSRP model formulations include deterministic model and twostage SP counterpart. It is worth noting that probability distributions are required for SP model. For clarity, the notations are listed below.
Indices/Sets
regional origins and/or destinations
The set of scenarios
Parameters
: holding cost at location .
: moving cost from location to location .
: the average demand of location .
Decision Variables
: firststage decision variable which denotes the number of vehicles at location .
: the secondstage decision variable which denotes the number of vehicles moving from location to location under scenario .
Random Variables (for stochastic programming model)
: random demands which denotes the number of cars that will be picked up by customers at location .
: the probability of scenario .
IiB1 Deterministic CSRP Model
In the deterministic model, we consider to allocate the limited vehicles to different locations in order to maximize the overall profit. For convenience, we consider using the average demands. The deterministic model for CSRP can be formulated as follows.
(1) 
s.t.
(2) 
(3) 
(4) 
(5) 
The objective function (1) is to maximize the overall profit which equals the difference of total revenue and total holding cost. The constraint in equation (2) ensures that the number of total vehicles are not exceeded the capacity which can be easily estimated from historical data. The constraints in equation (3) imply twofold meanings. If the number of allocated cars at location is higher than the customer demand at location , then the number of vehicles that move out must be less than the difference of number of cars at this location and customer demand of this location. Otherwise, no cars move out from location which implies the quantity of available vehicles is lower than the customer demand at location . Constraints (4) and (5) are the types of decision variables.
Although the deterministic model is capable of tackling the optimization model in a simple way, the average demands for model may lead to optimal solution with high risk even infeasible. Additionally, it is worth noting that the objective function (1) is a piecewise linear function, therefore, it is required to reformulated to a linear function before solving.
IiB2 TwoStage SP CSRP Model
The carsharing operators wish to maximize expected profit over all possible realization of scenarios. Considering the customer demands are under uncertainty, we assume the demand scenarios are sampled from the probability distribution that are derived from historical data. Then the twostage SP model of CSRP can be formulated as follows.
(6) 
s.t.
(7) 
(8) 
(9) 
(10) 
The objective function (6) is to maximize the overall profit, which is denotes by the difference of revenue and overall cost (the summation of holding cost and moving/transferring cost). Constraint (7) is identical to constraint (2). Similar as constraints (3), constraints (8) also imply twofold meanings, slightly unlike constraint (3), it involves SP scenarios. Specifically, if the number of allocated cars at location is higher than the customer demand at location , then the number of vehicles that move out under scenario must be less than the difference of number of cars at this location and customer demand of this location under scenario . Otherwise, no cars move out from location . under scenario . Constraints (9) and (10) describe the type of decision variables.
Iii Ddksp
Inspired by the idea of integration of ML with OR, the DDKSP framework is proposed in this work, which is briefly described as follows. Basically, the DDKSP framework involves four components, specifically, ML / DL part (in our problem setting, it is KDE) is in charge of probability distribution extraction from uncertain data, SP part focuses on the problem modeling, SAA & Benders decomposition part aims at reformulation SP model, and the last part yields the final decisionmaking. The DDKSP framework can be illustrated in Fig. 1
. It is worth noting that our framework can be readily extended by components replacement. For example, the ML DL part can adopt general supervised and unsupervised learning algorithms depend on the specific problems, the SP part can be replaced by Robust Optimization (RO)
[25] or Distributionally Robust Optimization (DRO) [26], and the SAA & Benders decomposition part can be replaced by other largescale decomposition algorithms such as column generation, Lagrangian relaxation etc.Iiia Kde
For the first component, we adopt Kernel density estimation (KDE) for our work. KDE is a typical nonparametric approach which is applied to describe probability distribution without specifying the distribution form in advance [27]. Let f be the density function of parameters, given a set of data , then the KDE for f can be obtained as follows
where K is the kernel function and h is the bandwidth. In this work, we select Gaussian kernel function as the kernel which is given below.
IiiB TwoStage SP CSRP Model Reformulation
Unlike the deterministic model which can be solved by offtheshelf commercial solvers effectively. Normally, the twostage SP model required reformulation since the continuous probability distribution contains infinite scenarios. In this paper, we utilize the sample average approximation (SAA)[28]  a Monte Carlo method to reformulate the twostage SP model. The procedure of SAA can be summarized as follows.
Input: probability distribution , number of sample , size and twostage SP model
Output: the optimal value
Notice that the reformulation model in SAA, the objective function becomes
where is the number of scenarios. Additionally, the objective function is still a nonlinear objective function. We introduce the auxiliary variable to transform the nonlinear objective function to the linear type. Let . Then the twostage SP model becomes
s.t.
(11) 
(12) 
(13) 
(14) 
(15) 
(16) 
IiiC TwoStage SP CSRP Model Decomposition
After the reformulation, the twostage SP model becomes a very largescale deterministic model, for example, if we consider 50 locations and 1000 scenarios, the number of secondstage decision variables will be 50*50*1000 = 2,500,000. To solve largescale model effectively, decomposition algorithm is required. In this work, we introduce Benders decomposition[29] to solve the reformulated model. Generally, Benders decomposition is an effective algorithm aims solving mixed integer linear programming (MILP) model, in which the primal model is decomposed into one master problem (MP) and a group of subproblems (SUBP) in dual form, the outcome is yielded from iterative solving SUBP and updated MP.
For convenience, in the following, we neglect the constant . Then we divide the reformulated model into a MP
(17) 
and a SUBP in the dual form
(18) 
s.t.
(19) 
where and are the dual variables of SUBP, and are the fixed values that are determined by the MP. During each iteration in MP, the values are adjusted and assigned to the SUBP. Finally, the algorithm can be summarized as follows.
where is a very small factional number, which is usually set from to . Therefore, in our case, either values of upper bound or lower bound can be considered as the optimal solution.
Iv Numerical Experiment
Experiment Design. We design a group of experiments. To begin with, we do the data preprocessing & analysis including data aggregation for demand and demand distribution analysis. After that both nonparametric approach KDE and parametric approaches (Gaussian, Laplace and Poisson) are applied to derive probability distributions for the SP model. Then we compare the SP model with deterministic model in terms of values of objective functions and models running time. Moreover, we validate and compare the KDE with three parametric approaches  Gaussian, Laplace and Poisson distributions. Finally, we explore and show the twostage decision making based on a day record.
Experiment Setup. The algorithms (SAA, BD, KDE and parametric approaches) are implemented using Python 3.7, the mathematical models are solved by Gurobi ^{6}^{6}6https://www.gurobi.com/academia/academicprogramandlicenses/ 8.1 academic version under the platform Intel i7, 16GB RAM, Windows 10. It is worth noting that the deterministic parameters in our SP model like (revenue) and (transferring cost) can be estimated from the data set easily. For convenience, in the following experiments, the revenue per car is set to $100, the transferring cost is roughly estimated based on the distance between locations which ranges between 10 to 100, the number of available vehicles is set to 16,000, and the holding cost is assumed to follow the Gaussian distribution with the parameters .
Iva Data Analysis
The data sets are from New York taxi trip^{7}^{7}7https://www1.nyc.gov/site/tlc/about/tlctriprecorddata.page, we collected three years (July 2016  June 2019) green taxi trip records as the data source which is archived by month. We split the three years data sets into training set (from July 2016 to December 2018, 919 days) and testing set (from January 2019 to June 2019, 181 days), each data set involves thousands of naive onetrip records with a complex structure. Take the data set 201801 for example, it contains 793,529 records and 19 attributes. For our application purpose, we investigate 6 attributes which is listed in Table I. Additionally, in this data set the whole New York city is divided into 259 different locations. The New York city location division information details can be found via https://data.world/nyctaxilimo/taxizonelookup. The main task of data processing is to aggregate the trip records into demands, which are aggregated by days. After the data processing, we selected 20 locations (location IDs: 74, 41, 7, 75, 255, 82, 166, 42, 181, 97, 129, 25, 95, 244, 33, 260, 256, 66, 223 and 65, sorted by demands descending) with highest average demands, which are plotted on the map in Fig. LABEL:fig:nyc.
Attribute  Description 

lpep_pickup_datetime  pickup time 
lpep_dropoff_datetime  dropoff time 
PULocationID  pickup location ID 
DOLocationID  dropoff location ID 
trip_distance  total distance 
fare_amount  passenger fare 
Among the 20 locations with high demands that are estimated from the data sets, there are mainly two types of distributions for demands. One is unimodal type, the other one which represents the most locations is bimodal type. In the first type, a specific functional form for the density model such as Gaussian distribution can be assumed, in other words, parametric methods can be applied on these scenarios. Most of the works that related to SP adapts this approach. While in the second type, the particular form of parametric functions are unable to provide the appropriate representation of the real density. In such cases, we must consider using nonparametric or semiparametric approaches such as KDE or Gaussian mixture model (GMM).
Most of the parametric methods may work well in the unimodal distributions, but cannot achieve the same goal for bimodal distributions. That is why KDE approach is introduced in this work.
IvB Stochastic Model vs. Deterministic Model Results
In order to compare the deterministic model with SP one under different scenarios, We generate 5 groups of scenarios for SP model based on the probability distributions that are derived from KDE. The numbers of scenarios are 20, 50, 100, 200 and 500. Meanwhile, each group runs 10 times under SAA. Additionally, we consider deterministic model using the average demands that are calculated from training set (average demand of 919 days) and testing set (average demand of 181 days). The average objective values and time elapse can be seen in Table II.
Number of Scenario  Objective Value  Time Elapse (s) 

20  $1,477,845  2.73 
50  $1,487,606  6.87 
100  $1,475,688  10.89 
200  $1,484,367  21.73 
500  $1,469,642  53.12 
deterministic (average on training set)  $1,325,723  0.24 
deterministic (average on testing set)  $1,017,054  0.24 
Based on the experimental results, we come to conclude that the twostage SP model is able to yield more outcomes than the deterministic model. the objective value of twostage SP model is 11.56% and 45.42% more than deterministic counterpart on training set and testing set respectively. Additionally, by average demands, the overall profit on the training set is more that the one on the testing set.
IvC Validations on Parametric Approaches
Besides the nonparametric approach, we also use several popular parametric distributions (Gaussian, Laplace and Poisson distributions) as the customer demands distributions based on the data sets. Meanwhile, the parameters from Laplace , Gaussian , and Poisson distributions are estimated by maximum likelihood estimation (MLE) using the sampling data, which implies the following equations satisfy.
where denotes the number of sampling data.
Number of Scenario  KDE  Gaussian  Laplace  Poisson 

20  $1,477,845  $1,467,117  $1,425,569  $1,299,895 
50  $1,487,606  $1,422,868  $1,402,279  $1,312,471 
100  $1,475,688  $1,417,811  $1,417,403  $1,315,831 
200  $1,484,367  $1,406,112  $1,412,343  $1,321,364 
500  $1,469,642  $1,406,103  $1,398,546  $1,332,124 
average  $1,479,030  $1,424,002  $1,411,228  $1,316,337 
The comparison between KDE and the three parametric approaches is shown in Table III , the overall profit yielded from Gaussian distribution is slightly better than the one yielded from Laplace distribution, and both of them are better than Poisson distribution. However all of the parametric approaches are inferior to the nonparametric approach KDE in terms of the overall profit (3.72%, 4.58% and 11% lower than nonparametric method by average).
IvD Two Stages Decision Makings
In the twostage SP model, solutions involves two parts, the firststage decision variables which denote the numbers of cars that are placed at each location (or the initial inventory level) before demands realize, and the secondstage decision variables which denote the number of cars that are moving between locations for rebalancing. We design a group of experiment in this subsection.
Firstly, the values of firststage decision variables are derived from twostage SP model using KDE, Poisson, Laplace and Gaussian based on training sets (30 months), the results under different distributions are shown in Table IV, Table V, Table VI, Table VII, respectively. Take Table IV
for example, the rows denote the numbers of scenario in SP model, the columns denote top 20 locations with highest demands (by descending sort) as mentioned before. We come to conclude that the solutions by KDE are more stable (lower variance) compared with Poisson, Laplace and Gaussian distributions. In practical applications, the decisionmakers can use the average values as the firststage decisions.
scenario  top 20 locations with highest demands  
20  1544  1469  1529  1119  1034  736  825  483  452  849  513  630  466  593  580  495  447  498  413  325 
50  1541  1308  1055  1215  1074  978  732  504  664  663  653  663  591  609  561  509  469  468  403  340 
100  1595  1356  1212  1052  1046  876  770  474  641  652  630  634  655  560  544  534  528  505  406  330 
200  1564  1293  1315  1059  1008  822  822  507  655  681  658  642  596  535  573  549  490  473  428  330 
500  1567  1338  1316  1079  1027  843  814  473  599  660  638  634  620  544  557  529  499  462  451  350 
scenario  top 20 locations with highest demands  
20  1393  1488  1637  1044  1085  790  888  502  485  903  469  616  501  468  527  476  463  463  447  355 
50  1545  1390  1170  982  1092  867  809  641  485  718  633  624  581  576  521  543  586  414  454  369 
100  1553  1244  1391  1120  999  902  813  470  639  648  656  651  609  560  514  534  482  448  422  345 
200  1539  1248  1288  1073  1028  871  785  566  690  704  622  637  588  560  559  523  499  428  443  349 
500  1562  1300  1229  1099  1032  850  814  572  653  658  630  637  593  579  539  532  490  455  431  345 
scenario  top 20 locations with highest demands  
20  1267  1297  1164  1273  1223  687  519  733  607  862  538  625  565  560  568  648  554  465  467  377 
50  1670  1255  1401  801  920  914  798  427  526  894  621  717  630  423  585  540  537  518  472  351 
100  1607  1275  1383  1061  983  849  798  520  649  586  633  615  596  550  582  526  497  433  491  366 
200  1523  1312  1250  1002  1028  938  814  561  575  672  665  627  594  594  541  505  500  485  452  362 
500  1522  1322  1255  1104  1021  872  781  596  614  683  618  634  569  571  568  537  501  449  440  343 
scenario  top 20 locations with highest demands  
20  238  1541  1527  1330  1193  1063  961  900  812  826  0  689  0  662  648  582  585  539  516  388 
50  0  1483  1466  1276  1149  1052  275  834  796  812  698  679  672  646  612  582  563  527  492  386 
100  0  1492  1477  1261  1151  1032  475  861  752  791  717  679  661  608  580  572  550  519  457  365 
200  0  1481  1443  1282  1139  1008  546  829  787  783  707  660  648  623  601  561  565  498  473  366 
500  0  1472  1439  1281  1117  1011  757  796  755  779  698  665  633  621  591  544  531  502  449  359 
Secondly, after the real demands reveal, the decisionmakers must decide the vehicle moving strategy between locations (secondstage decisionmaking). We validate this using one day record (20190101) on the testing set, which is shown in Table VIII. Based on the firststage decisions from KDE, Poisson, Laplace and Gaussian, then the outcomes of secondstage decision are shown in Table X, Table XI, Table XII, Table XIII, respectively. The structure of the table is explained as follows, the rows denote the locations that cars moving in, while the columns represent the locations that cars moving out. The cell values imply the number of cars moving between the locations. For convenience, the numbers in both rows and columns are the top 20 locations with highest demands as mentioned above. It is worth noting that, the firststage decision values we use are from scenario 20 of the four types of distribution, the moving results may vary if we adopt scenario 50, 100, 200 and 500. It is clear to see that, in this use case, the total number of carmoving in KDE is much less than the rest of three parametric approaches. Meanwhile, we come to conclude that given the data set, the distribution type and parameters have a great impact on the result of stochastic programming model. For example, in the Table VII we observe that the firststage decision under Poisson is quite different from the rest of three, especially in the first location. Therefore, it leads the different secondstage decision which is shown in Table XIII. It is also worth noting that these outcomes are based on single day record, the outcomes will be different if it is applied on the rest of days record.
Location  0  1  2  3  4  5  6  7  8  9  10  11  12  13  14  15  16  17  18  19 

Demand  1370  687  1120  861  1041  374  780  487  505  785  326  308  325  572  536  373  325  289  663  245 
Finally, we come to investigate the profits based on different approaches over the entire testing sets. Specifically, we compute and compare the overall profit using KDE, Gaussian, Laplace and Poisson on the testing set. We compare the outcomes for six months (181 days), which are shown in Fig. 5, 9, respectively. The plots imply that the KDE approach outperforms the rest three approaches in terms of overall profits. Specifically, by average, Gaussian and Laplace distributions are ranked second and third, respectively, with a slight gap compared to KDE, Poisson distribution yielded 11% profit lower than KDE. This summarized result is shown in Table IX.
Approach  KDE  Gaussian  Laplace  Poisson 
Profit  $1,339,604  $1,317,018  $1,304,749  $1,200,684 
V Conclusions and Future Work
In this paper, we propose a datadriven stochastic programming framework DDKSP to solve CSRP using New York taxi trip record data sets. In more real world, the demand distribution would be time variant and evolves gradually (or the parameters of distribution vary at least), which renders the dynamic system outdated and leads to deteriorates the resulting solution quality[30]
. In order to describe this evolution in a more precise way, we will investigate Bayesian learning which focus on posterior probability distribution that is based on prior probability distribution and the likelihood of current data. Namely, we will explore the dynamic datadriven stochastic programming model for CSRP.
Additionally, in our work, the proposed framework treats the customer demands by days, which can be considered as an offline datadriven framework. In several applications, the customer demands may fluctuate intensively in hours even minutes such as taxi dispatch problem. Therefore, We will explore datadriven optimization frameworks with online learning using realtime data in our future works. Meanwhile, in this paper, for convenience, some other factors we do not consider. For example, we do not consider the capacity of locations, and the route condition of balancing which may lead different transportation costs. Later on, we will extend the twostage SP model to a more practical one.
References

[1]
M. Bruglieri, F. Pezzella, and O. Pisacane, “A twophase optimization method
for a multiobjective vehicle relocation problem in electric carsharing
systems,”
Journal of Combinatorial Optimization
, vol. 36, pp. 162–193, 2018.  [2] S. Shaheen, D. Sperling, and C. Wagner, “Carsharing in europe and north american: past, present, and future,” 1998.
 [3] R. Vosooghi, J. Puchinger, M. Jankovic, and G. Sirin, “A critical analysis of travel demand estimation for new oneway carsharing systems,” in 2017 IEEE 20th International Conference on Intelligent Transportation Systems (ITSC). IEEE, 2017, pp. 199–205.
 [4] S. Illgen and M. Höck, “Literature review of the vehicle relocation problem in oneway car sharing networks,” Transportation Research Part B: Methodological, 2018.
 [5] R. Cavagnini, L. Bertazzi, F. Maggioni, and M. Hewitt, “A twostage stochastic optimization model for the bike sharing allocation and rebalancing problem,” 2018.
 [6] C. Gambella, E. Malaguti, F. Masini, and D. Vigo, “Optimizing relocation operations in electric carsharing,” Omega, vol. 81, pp. 234–245, 2018.
 [7] K. Huang, G. H. de Almeida Correia, and K. An, “Solving the stationbased oneway carsharing network planning problem with relocations and nonlinear demand,” Transportation Research Part C: Emerging Technologies, vol. 90, pp. 1–17, 2018.
 [8] M. Zhao, X. Li, J. Yin, J. Cui, L. Yang, and S. An, “An integrated framework for electric vehicle rebalancing and staff relocation in oneway carsharing systems: Model formulation and lagrangian relaxationbased solution approach,” Transportation Research Part B: Methodological, vol. 117, pp. 542–572, 2018.
 [9] B. Boyacı, K. G. Zografos, and N. Geroliminis, “An optimization framework for the development of efficient oneway carsharing systems,” European Journal of Operational Research, vol. 240, no. 3, pp. 718–733, 2015.
 [10] A. Di Febbraro, N. Sacco, and M. Saeednia, “Oneway carsharing profit maximization by means of userbased vehicle relocation,” IEEE Transactions on Intelligent Transportation Systems, vol. 20, no. 2, pp. 628–641, 2018.
 [11] W. D. Fan, “Optimizing strategic allocation of vehicles for oneway carsharing systems under demand uncertainty,” in Journal of the Transportation Research Forum, vol. 53, no. 3, 2014.
 [12] J. Warrington and D. Ruchti, “Twostage stochastic approximation for dynamic rebalancing of shared mobility systems,” Transportation Research Part C: Emerging Technologies, vol. 104, pp. 110–134, 2019.
 [13] J. Zhang, F.Y. Wang, K. Wang, W.H. Lin, X. Xu, and C. Chen, “Datadriven intelligent transportation systems: A survey,” IEEE Transactions on Intelligent Transportation Systems, vol. 12, no. 4, pp. 1624–1639, 2011.
 [14] L. Zhu, F. R. Yu, Y. Wang, B. Ning, and T. Tang, “Big data analytics in intelligent transportation systems: A survey,” IEEE Transactions on Intelligent Transportation Systems, vol. 20, no. 1, pp. 383–398, 2018.
 [15] Y. Lv, Y. Duan, W. Kang, Z. Li, and F.Y. Wang, “Traffic flow prediction with big data: a deep learning approach,” IEEE Transactions on Intelligent Transportation Systems, vol. 16, no. 2, pp. 865–873, 2014.
 [16] Y. LeCun, Y. Bengio, and G. Hinton, “Deep learning,” nature, vol. 521, no. 7553, p. 436, 2015.
 [17] Y. Bengio, A. Lodi, and A. Prouvost, “Machine learning for combinatorial optimization: a methodological tour d’horizon,” arXiv preprint arXiv:1811.06128, 2018.
 [18] E. Larsen, S. Lachapelle, Y. Bengio, E. Frejinger, S. LacosteJulien, and A. Lodi, “Predicting solution summaries to integer linear programs under imperfect information with machine learning,” arXiv preprint arXiv:1807.11876, 2018.
 [19] C. Ning and F. You, “Datadriven stochastic robust optimization: General computational framework and algorithm leveraging machine learning for optimization under uncertainty in the big data era,” Computers & Chemical Engineering, vol. 111, pp. 115–133, 2018.
 [20] C. Shang and F. You, “Distributionally robust optimization for planning and scheduling under uncertainty,” Computers & Chemical Engineering, vol. 110, pp. 53–68, 2018.
 [21] S. Faridimehr, S. Venkatachalam, and R. B. Chinnam, “A stochastic programming approach for electric vehicle charging network design,” IEEE Transactions on Intelligent Transportation Systems, vol. 20, no. 5, pp. 1870–1882, 2018.
 [22] M. Cocca, D. Giordano, M. Mellia, and L. Vassio, “Free floating electric car sharing: A data driven approach for system design,” IEEE Transactions on Intelligent Transportation Systems, 2019.
 [23] ——, “Data driven optimization of charging station placement for ev free floating car sharing,” in 2018 21st International Conference on Intelligent Transportation Systems (ITSC). IEEE, 2018, pp. 2490–2495.
 [24] X. Huo, X. Wu, M. Li, N. Zheng, and G. Yu, “The allocation problem of electric carsharing system: A datadriven approach,” Transportation Research Part D: Transport and Environment, vol. 78, p. 102192, 2020.
 [25] A. BenTal, L. El Ghaoui, and A. Nemirovski, Robust optimization. Princeton University Press, 2009, vol. 28.

[26]
E. Delage and Y. Ye, “Distributionally robust optimization under moment uncertainty with application to datadriven problems,”
Operations research, vol. 58, no. 3, pp. 595–612, 2010.  [27] C. M. Bishop et al., Neural networks for pattern recognition. Oxford university press, 1995.
 [28] T. Santoso, S. Ahmed, M. Goetschalckx, and A. Shapiro, “A stochastic programming approach for supply chain network design under uncertainty,” European Journal of Operational Research, vol. 167, no. 1, pp. 96–115, 2005.
 [29] J. F. Benders, “Partitioning procedures for solving mixedvariables programming problems,” Computational Management Science, vol. 2, no. 1, pp. 3–19, 2005.
 [30] C. Ning and F. You, “Optimization under uncertainty in the era of big data and deep learning: When machine learning meets mathematical programming,” Computers & Chemical Engineering, 2019.
Appendix A Moving between Locations Based on the FirstStage Decision
0  1  2  3  4  5  6  7  8  9  10  11  12  13  14  15  16  17  18  19  

0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0 
1  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0 
2  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0 
3  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0 
4  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  7  0  0 
5  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0 
6  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0 
7  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  4 
8  0  0  0  0  0  0  45  0  0  0  0  0  0  8  0  0  0  0  0  0 
9  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0 
10  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0 
11  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0 
12  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0 
13  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0 
14  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0 
15  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0 
16  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0 
17  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0 
18  174  0  0  0  0  0  0  0  0  0  0  0  0  13  44  19  0  0  0  0 
19  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0 
0  1  2  3  4  5  6  7  8  9  10  11  12  13  14  15  16  17  18  19  

0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0 
1  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0 
2  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0 
3  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0 
4  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0 
5  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0 
6  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0 
7  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0 
8  0  20  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0 
9  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0 
10  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0 
11  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0 
12  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0 
13  0  0  0  0  44  0  0  0  0  0  0  0  0  0  0  0  0  0  0  60 
14  0  9  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0 
15  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0 
16  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0 
17  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0 
18  23  90  0  0  0  0  0  0  0  0  0  0  0  0  0  103  0  0  0  0 
19  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0 
0  1  2  3  4  5  6  7  8  9  10  11  12  13  14  15  16  17  18  19  

0  0  103  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0 
1  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0 
2  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0 
3  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0 
4  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0 
5  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0 
6  0  261  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0 
7  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0 
8  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0 
9  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0 
10  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0 
11  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0 
12  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0 
13  0  12  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0 
14  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0 
15  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0 
16  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0 
17  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0 
18  0  196  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0 
19  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0 
0  1  2  3  4  5  6  7  8  9  10  11  12  13  14  15  16  17  18  19  

0  0  854  0  0  0  0  0  0  0  0  0  0  0  0  112  0  0  166  0  0 
1  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0 
2  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0 
3  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0 
4  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0 
5  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0 
6  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0 
7  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0 
8  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0 
9  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0 
10  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  260  0  0  66 
11  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0 
12  0  0  0  0  117  0  0  0  0  41  0  0  0  90  0  0  0  0  0  77 
13  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0 
14  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0 
15  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0 
16  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0 
17  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0 
18  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  147  0  0  0  0 
19  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0 