I Introduction
In rapidly expanding metropolitan cities, taxis (which include cars working with ridesharing platforms such as Uber, Lyft and DiDi) play a vital role in residents’ daily commute among all the available modes of transportation [1]. Based on a survey in NYC [2], there is a stable demand of taxis, by passengers per day, which is fulfilled by more than taxis in the region. For these expanding cities, to meet the increasing demand of taxis, an emerging problem is to efficiently utilize the existing road networks to reduce potential traffic congestions and to optimize the effective travel time and distance. One promising solution to this problem is taxi carpool service [3]. In recent years, due to the advancement in the datadriven technologies and availability of big data, it becomes possible to develop more advanced algorithms to solve difficult problems such as taxi travel time estimation [4, 5, 6], taxi routing [7, 8] etc.
Carpooling is a quick and convenient way to minimize traffic congestion, to reduce air pollution, to save on gas and of course to save on travelers’ money. We consider the decision problem in a centralized carpooling service, where carpool assignments are issued to taxi drivers from a central decision system. In typical carpooling context, a taxi picks up multiple passengers (heading in a similar direction) and proceeds to each of the destinations onebyone in an efficient manner. Usually, taxis roam around with no passenger onboard and the requests are assigned to a taxi which is in close proximity to the requests. Request assignment to a taxi is a very crucial part of a carpooling service because a bad request assignment might lead a taxi to an area where taxi calls are less frequent and might end up the taxi roaming with no passenger onboard. This kind of situations not only reduce transportation efficiency but cause revenue loss to taxi drivers as well.
A crucial point to consider in optimizing a carpooling policy is to gauge the future prospect of being able to pick up additional passengers along the way at each decision point. Reinforcement Learning (RL) is a datadriven approach for solving a Markov decision process (MDP), which models a multistage sequential decisionmaking process with a long optimization horizon. We develop a framework powered by RL to generate datadriven carpooling policy which tells the driver when to accept a carpool request in order to maximize longterm transportation efficiency and reduce traffic congestion. To generate the samples of experience for RL we develop a carpooling simulator which returns a reward and new state corresponding to a stateaction pair.
A key piece of information required to build a carpooling simulation environment is the estimated travel time. Accurate estimates of travel time also help in building intelligent transportation systems such as for developing the efficient navigation systems, for better route planning and for identifying key bottlenecks in traffic networks. The travel time and distance prediction depends heavily on the observable daily and weekly traffic patterns and also on the timevarying features such as weather conditions and traffic incidents. For instance, bad weather or an accident on road slows down the speed of the vehicles and causes lengthy travel time. To tackle this problem, we propose a SpatioTemporal Neural Network (STNN) approach which jointly learns the travel time and the travel distance from the raw GPS coordinates of an origin, a destination and the timeoftheday. The beauty of the STNN is that it does not require any sort of feature engineering. We then use these estimates to optimize carpooling in simulation to minimize the extra travel time traveled by each of the passengers on board. Throughout the paper, we denote vectors as bold case letters.
Summary of results: We summarize our main technical contributions as follows:

In Section IV, we develop a carpooling simulation environment using NYC taxi trip dataset to generate training experiences for the RL framework.

In Section V, we develop a deep neural network model (STNN) which predicts the travel time and distance directly from the GPS coordinates of origin and destination locations in the city, without building any route or map between the locations.

In Section VI, we present an RL framework to obtain a datadriven carpooling policy for the taxi drivers in order to maximize the longterm transportation efficiency and reduce traffic congestion based on the time of the day and the day of the week.
The rest of the paper is organized as follows: Section II describes the background and related work on carpooling, travel time estimation and provides a brief introduction to reinforcement learning. In Section III, we first describe the problem and the MDP formulation of the problem for reinforcement learning. We empirically evaluate the learned RL policy and the performance of STNN approach in Section VII, and Section VIII concludes the work with future directions.
Ii Background and Related Work
Iia Carpooling
Carpooling in taxis and rideshare services has been widely studied. At first [9] presented fair share carpool scheduling algorithm. In classical carpool settings, various assumptions were made such as fixed and regular travel path of passengers [9]. [10] showed that with a small increase in travel time it is possible for almost of the taxi trips in Manhattan to be shared by two riders. [11] considered solutions to ridesharing which is scalable to a large number of passengers. Other works which explore carpooling involve realtime carpooling on mobilecloud architecture [12], social effects of carpooling [13], etc. Our focus is on a datadriven approach to optimize central carpool decision policies through reinforcement learning and building a training simulation environment based on historical taxi trip data.
IiB Travel Time Estimation
Most of the studies in literature for travel time estimation are focused on predicting the travel time for a sequence of locations for a fixed route. Commonly used techniques include (1) estimating travel time using historical trips; (2) using real time road speed information [14]. The two common approaches for route travel time estimation include segmentbased approaches and pathbased approaches. In a segmentbased approach, travel time is estimated on links (straight subsections of a travel path with no intersections) first and then summed up to estimate the overall travel time. The link travel time is generally calculated by using loop detector data and floating car data [15] [16]. Whereas in floating car data, GPS enabled cars are used to collect timestamped GPS coordinates. The available dataset, in STNN, can think of as the special case of floating car data, where only the origin and destination GPS coordinates are recorded.
One of the major drawbacks of the segmentbased approach is that it is unable to capture the waiting times of a vehicle at the traffic lights, which is a very important factor for estimating the accurate travel time. Therefore, some methods are developed to consider the waiting time at the intersections for travel time estimation [17]. In pathbased methods, subpaths (links + waiting time at intersections) are concatenated to predict the most accurate travel time [18]. Our method is the special case of the pathbased method, where subpath is the entire path from origin to destination containing information about the waiting times at all the intersections. In addition to these methods, [19] proposes a neighborbased method for travel time estimation by averaging the travel time for all the samples in training data having the same origin, destination, and timeofday.
In this paper, we jointly predict the travel time and distance from an origin to a destination as a function of the timeofday using the historical NYC travel trips data. Since the available NYC taxi trip dataset does not contain full trip trajectories, we treat it as a full path travel time estimation problem. As an alternative solution for travel time estimation, one can first find the specific trajectory path (route) between origin and destination and then estimate the travel time for that route [20]. Although, obtaining the travel route information is important, we can think of a certain real scenario where route information is not as much important as travel time. For example, the travel route is of much less concern than the travel time to a nondriving taxi passenger.
IiC Reinforcement Learning
Reinforcement learning (RL) is used for learning an optimal policy in a dynamic environment. In RL, an agent observes a state , takes an action in the state , receives a reward from the environment, transitions to the next state and keeps repeating this procedure until it reaches a terminal state which ends the episode. Initially, the agent randomly picks an action from the action space given a state because the agent has no knowledge of which action has to be taken in a given state. This means that the agent is exploring its environment by taking random actions. As the time proceeds, the agent gets more confidence on its predicted actions and starts exploiting its knowledge by taking an action with highest estimated value and produce the greatest reward. In RL, tradeoff between exploration and exploitation is crucial.
RL methods can be broadly divided into modelbased and modelfree learning methods. The model consists of the knowledge of the environment: the state transition probabilities and the reward function. In both types of methods, the model is not known in advance. In modelbased RL methods, the transition model is first learned and then used to derive an optimal policy
[21]. Learning a model requires exhaustive exploration which is very costly for a large state space. However, it is possible to learn an optimal policy without even knowing the model using modelfree RL methods such as temporal difference learning and Monte Carlo methods [22]. RL is primarily concerned with these modelfree methods where an optimal policy is learned from the samples of experience obtained from interacting with the environment. Modelfree methods often require a large number of experiences to learn an optimal policy. In this work, we develop a carpooling simulator which generates a lot of experiences for modelfree RL methods.QLearning [23] is a widely used modelfree RL method because of its computational simplicity. The simplest method to obtain a policy is tabular Qlearning where the algorithm keeps a record of the value function in a tabular form [22]. However, when the state and/or action space is large, maintaining such a big table is expensive and is sometimes even infeasible. Therefore, function approximation techniques are used to approximately learn this table. For example, deep RL methods use deep neural networks to approximate the Qvalue function (Deep QNetworks (DQN)) [24]. Deep RL has become popular because of its success in playing games [25, 26] where the state space has hundreds of features. In carpooling, the state space is huge, as the state is composed of latitude and longitude coordinates along with a continuous variable — time of day. Therefore, DQN is suitable in this problem for generating an optimal policy.
Iii Problem Definition
In this section, we set the basic terminology, Markov Decision Process (MDP) formulation and the travel time estimation problem that is integral to building the simulation environment for reinforcement learning.
Iiia Data Mapping
A publicly available gigantic taxi trip dataset contains M taxi trips for the New York City during the year 2013 [27]. This dataset describes every single trip by 21 different variables. Fig. 1 outline the provided GPS coordinates where Fig. 0(a) and 0(b) show the density of pickup and dropoff GPS coordinates, respectively.
Geocoordinates are continuous variables. In the urban cities like NYC, because of tall buildings and dense areas, it is quite possible to get the erroneous GPS coordinates while reporting the data. Other sources of erroneous recording of GPS coordinates include atmospheric effects, multipath effects and clock errors. For more information, we refer the reader to [28]. Therefore, to combat the uncertainties in GPS recording, a data preprocessing step is needed to process the raw GPS data. Hence, we discretized the GPS coordinate into 2D square cells, say of longitude and latitude. All the GPS coordinates of a square cell are represented by the lower left corner of that square cell.
Similar to location mapping, we also discretized the timeofday as a 1D time cell. From the NYC dataset, we observe that the average travel time of a taxi for weekday differs from the weekend. Therefore, we differentiate the timeofday of weekdays from weekends. The timeofday of the weekend is incremented by seconds of timeofday of the weekday. For a time cell of minutes we obtain a total time cells.
IiiB MDP Formulation
We model the carpooling problem from a driver’s perspective through the following MDP.
State, : represents the th state of an agent (taxi). Here, is 2D tuple, represents the GPS coordinates , and denotes the time of the day in seconds. One should not confuse the state of the taxi with the actual origin of a taxi trip. State of the taxi can be different from the origin of a trip.
Action, : Here, represents the wait action, is the action of assigning single passenger that is noncarpool, and we call it take one action and corresponds to carpool. Any one of the actions from this set of actions can be assigned to a taxi. In this work, we assume that at most two taxi calls can be assigned to a taxi. As we will explain later, corresponds to the toplevel carpool decisions. We leave the lowlevel trips assignment to the environment in this work.
Reward, : We define the reward as the effective distance traveled by the taxi throughout a transition. The effective distance of any trip is defined as the sum of actual distances between the origin of the trips to the destination of the individual trips, obtained from the historical dataset. For instance, when action is assigned to taxi, the effective distance is the actual distance between origin and destination of the trip whereas in action, the effective distance is always zero. A good action will yield an effective trip distance longer than the distance actually traveled by taxi.
We choose effective distance as a reward because for a fixed interval of time if all the taxis can cover more effective distance then a large demand of taxi rides can be fulfilled by a few number of drivers on the road. In general, the ideal situation is that through carpooling, the entire group of drivers can cover more effective trip distance than they actually travel, thus reducing the possibility of traffic congestion.
Episode: One episode is one complete day, from 0:00 AM to 23:59 PM. Hence, one episode completes when the component of the state of the taxi reaches 23:59 PM.
State Transition: When a taxi completes an assigned action, the state of the taxi gets updated and this change in the state is termed as state transition. In Fig. 2, is one state transition and defines one episode. Here, denotes the start of the day, denotes the completion of one state transition and represents the end of the day or end of the episode. These state transitions continue until reaching the termination state. After a state transition the taxi can either pause for a while (driver wants to take rest) or opens to other assignments immediately, but we assume that the taxi is willing to get the assignments all the time.
IiiC Travel Time Estimation
We define the travel time as the time taken by a vehicle in moving from one location to another. Similarly, travel distance is defined as the distance transversed by a vehicle between two locations. In simple words, one can think of STNN as to estimate the travel distance and time between an origin and a destination at a particular time timeofday.
We define a taxi trip , as a 5tuple , starting from the origin at timeofday heading to the destination , where , is the travel distance and represents the travel time. Both the origin and destination are 2tuple GPS coordinates, that is and , and timeofday () is in seconds. An intuitive reason to include timeofday as part of a taxi trip is due to different traffic conditions at different times. For example, one can encounter heavy traffic at peak hours than offpeak hours. The traffic patterns on weekdays is also different from weekends. Similar to [29], we assume that the intermediate location or travel trajectory is not known and only the end locations are available. We define a query as a pair origin, destination, timeofday input to the system and corresponding pair travel time, travel distance as an output. Therefore, for the travel time estimation, the only input query is , and the network estimates . Given the historical database of taxi trips , our goal is to estimate the travel distance and time, for a query .
Iv Training environment: Carpooling Simulator
In order to train an RL agent that makes optimal carpooling decisions, we develop a carpooling simulation environment from a single taxi driver’s perspective, corresponding to our MDP formulation. In our training environment, the transition dynamics is divided into two levels: the action space defined in Section IIIB and the more granular decision on trips assignment. We assume that the system performs the decisionmaking at both levels for the taxi driver. The RL learned policy makes only the firstlevel decisions (assigning an action to the taxi which maximizes the long term transportation efficiency) whereas the secondary decisions are determined by a fixed algorithm described below in this section.
From Fig. 2, at the start of the episode, is the initial state of the taxi, this should not be confused with the actual origin of the taxi trip which is . is the intermediate state of the taxi when it picks up the first passenger. Now, we define all the actions.
Wait Action : When a wait action is assigned to the taxi at state , taxi stays at the current location while the time advances by , where is the delay time. Therefore, the next state of the driver would be as described in Algorithm 1.
Take 1 Action : Given the initial state of the taxi and the action, at first the taxi trip search area is reduced by finding all the taxi trips having pickup time in the range to irrespective of the origin of the taxi trips, where defines the search time window and is fixed, say ten minutes. The taxi trip search area is further reduced by finding all the taxi trips where the taxi can reach before the pickup time from its initial state . If there is no such trip origin, the taxi continues waiting at its current location but the time advances to and the state of the taxi becomes . Whereas, if there exist such taxi trips, then a taxi trip with minimum pick up time is assigned to the taxi. Finally, the taxi picks up the passenger from the origin of taxi trip and drops the passenger at the destination and updates its state to and completes the state transition. Here, represent the dropoff location and time of first passenger, respectively. Take 1 action is described in Algorithm 2.
Take 2 Action : Now, if the take 2 action (corresponds to carpool) is assigned to a taxi, given the initial state , first taxi call is assigned to the taxi similar to the action. At this intermediate state , a second taxi call is assigned to the driver by following the same procedure of assigning the first taxi call. The only difference is the taxi trip’s pickup time range. For the second taxi call, the taxi trip search area is reduced by selecting all the taxi calls in pickup time range to irrespective of the origin locations of the taxi trips. This means that the taxi has to wait at the intermediate state for seconds while the search for another taxi call is being made. Here, is an important parameter which controls the taxi trip search area for the second taxi call assignment.
In carpooling scenarios, for first passenger/customer satisfaction, we can’t fix the taxi call search area for second taxi call assignments. For instance, let us fix the size of search time window . Similar to first taxi call. The pickup time search range for second call becomes . From the historical dataset, let’s suppose, we know that the taxi can complete the trip for the first passenger, that is from to , in . In this case, it is obvious to assign take 1 action to the taxi rather than take 2. Therefore, we definitely need a dynamic pickup time search range for selecting the second taxi call. After reducing the pickup time search area for second taxi calls, we further reduce the search area by selecting all the taxi trips where taxi can reach before their pickup time from its intermediate state . Finally, a second taxi call with the minimum total extra travel time, described in following section, is assigned to a taxi.
Now, the taxi has two different passengers on board with different destinations and . The next question is which passenger to drop first? This is a routing problem and for simplicity, we consider the solution of this problem is deterministic and embed this decision into the environment, i.e. once an action is assigned to the taxi, the secondary level decision is executed automatically by the environment. The two possible solutions to this routing problem are depicted in Fig. 3, that is the taxi can either follow (Path I) in Fig. 2(a) or (Path II) in Fig. 2(b). The final state of the taxi corresponds to the passenger’s destination whom is dropped at last, shown in green color for both the solutions.
Since the NYC datasets contains trip information only for a selected number of origin and destination pairs, we develop in Section V STNN, a travel time estimation method, which takes raw GPS coordinates of origin and destination and timeoftheday as input and predicts the travel time.
To choose among these paths, We define the notion of extra travel time traveled by the taxi going from to when a path P is chosen. Extra travel time is an estimation of extra time each passenger would travel during carpool which otherwise is zero when no carpool. For instance, in Fig. 2(a) the actual travel time for passenger 1, corresponding to , is and for passenger 2, corresponding to , is when they travel alone. On the other hand, in carpool the travel time for passenger 1, corresponding to , is and for passenger 2, corresponding to , is . Therefore, the extra travel time for passenger 1 and passenger 2, when path I is followed, are
Similarly, when path II is followed by the taxi, the extra travel time for both the passengers are given as:
Now, calculating the individual extra travel time for each of the onboard passengers for both the paths we calculate the total extra travel time as, for path I and for path II . Thus, path I is followed by the driver if otherwise path II is followed. When take 1 action is assigned, extra travel time is always zero.
V StNn
We have seen from Section IV that travel time estimation between two points is a key quantity to ensure accurate simulation in the carpool training environment. In this section, we describe our approach based on deep neural networks for learning travel time for origindestination pairs that are not part of the NYC dataset.
Deep neural networks are known for solving very difficult computational tasks like object recognition [30, 31], regression [32] and other predictive modeling tasks. They do so, because of their high ability to learn feature representations from the data and best map the input features to the output variables.
In Fig. 4, we describe the STNN architecture. In this architecture, we define two different deep neural network (DNN) module both for travel distance and travel time estimation as “DistDNN Module” and “TimeDNN Module”, respectively. First, we describe the input to both the two modules. The input to distDNN module is only the origin and destination binned GPS coordinates. This module is not exposed to timeofday information because the timeofday information is irrelevant to the travel distance estimation and might misguide the network. Any taxi service platform, because of usual reasons, always routes a driver on to a path of shortest length. As the route planning is not a part of this work, we assume that the all the taxis in the available taxi trip dataset have chosen the shortest path for a trip irrespective of the timeofday. Therefore, the input dimension to distDNN module is a 4D vector, that is OriginLatBin, OriginLonBin, DestLatBin and DestLonBin. The input to timeDNN module is the activations of last hidden layer of the distDNN module, which encodes the raw GPS coordinates into a feature vector, concatenated with the timeofday information.
Here, both the distDNN module and timeDNN module are threelayer MLPs with different numbers of neurons per layer. We crossvalidated the parameters and found the ones with the best performance. The best performance configuration of the number of layers and number of neurons per layer for both the module is shown in Fig.
4 where, andare the predicted distance and time from distDNN module and timeDNN module, respectively. The STNN architecture is then trained via stochastic gradient descent jointly for both travel distance and time according to the loss function:
Vi QLearning for Carpooling
In this work, we consider that the taxi is completely relying on RL in order to decide on carpooling by learning the value function of a taxi’s stateaction pair from the gathered experience generated from the carpooling simulator. We adopt a modelfree RL approach to learn an optimal policy as the agent has no knowledge about the state transitions and reward distributions. A policy is a map which models the agent’s action selection given a state where the value of a policy is determined by the stateaction value function . Here, denotes the sum of discounted reward. The value function estimates how good for an agent to be in a given state following the policy . Given an optimal policy and an action in a given state , the actionvalue under an optimal policy is defined by . The optimal action can be found by . With tabular Qlearning, where the Qvalue function is estimated by updating the lookup table as
(1) 
Here, is the discount rate, modeling the behavior of the agent when to prefer long term reward than immediate reward and is the step size parameter which controls the learning rate. In training, we use the epsilongreedy policy, where with probability , an agent in state selects an action having the highest value (exploitation), and with probability choose a random action to ensure exploration.
Tabular Qlearning is good for small MPD problems but with the huge stateaction space or when the state space is continuous we use a function approximator to model the . The best example of function approximator is neural networks (universal function approximator). Here, we adopt the basic neural network architecture in [24], where the neural network takes the state space (longitude, latitude, time of day) as input and output multiple Q values corresponding to the actions . To approximate the Q function we use a threelayer deep neural network which learns the stateaction value function. As in [24], we stored the state transitions (experiences) in a replay memory and each iteration samples a minibatch from this replay memory. In the DQN framework, the minibatch update through backpropagation is essentially a step for solving a bootstrapped regression problem with the loss function
(2) 
where is the parameters for the Qnetwork of the previous iteration.
Here the max operator is used both for selecting and evaluating an action which makes the Qnetwork training unstable. To improve the training stability we use DoubleDQN as proposed in [33] where a target network is maintained and synchronized periodically with the original network. Thus the modified minibatch target is
(3) 
To maximize total effective trip distance, we set the discount factor for all the experiments. We summarize the DQN algorithm in Algorithm 4. We compare the performance of DQN learned policy with respect to a fixed policy which always favors carpooling. Details of fixed policy generation is described in Algorithm 5.
Vii Performance Evaluation
Based on the NYC dataset, we have conducted experiments first on travel time estimation using STNN and then on firstlevel carpool policy optimization using reinforcement learning. We report the detailed results below.
Viia STNN Results
We divide the entire dataset into training and test subsets in the ratio 80:20. All the parameters of STNN network architecture such as the number of layers per module and the number of units per hidden layer are shown in the Fig. 4. We crossvalidated the hyperparameters to achieve the best performance. We also use data mapping as described in Section IIIA. For location mapping, we use 2D square cell and for time mapping we use 10 minutes as 1D time cell. All the parameters of STNN are kept fixed throughout all the experiments.
ViiA1 Outliers Rejection
From the initial exploration of NYC taxi trip data we find that the dataset contains a number of anomalous taxi trips termed as outliers, for example having more than 7 passengers in a taxi and no passenger, missing pickup and dropoff GPS coordinates, travel time of zero seconds while the corresponding travel distance is nonzero, travel distance of zero miles while corresponding travel time is nonzero. These outliers can cause huge mistakes in our estimations, so we experimentally detected the anomalous trips and removed them from the dataset.

MAE  MRE  MedAE  MedRE  
Time  LRT  1.84  724.14  1.01  638.52  1.10  
TimeNN  0.71  158.29  0.22  100.24  0.18  
STNN  0.75  145.9  0.20  91.48  0.16 
ViiA2 Evaluation Methods
Here we list the methods compared to STNN:

Linear Regression for Time (LRT): We implement a simple linear regression method for time estimation.

Unified learning (STNN): This is the proposed approach described in Section V.

TimeDNN module (TimeNN): When only the timeDNN module of the STNN is used to learn the travel time. Inputs to this module are the origin and destination GPS coordinates along with timeofday.

BTE : We also compare the performance of STNN with the best method introduced in [29].
ViiA3 Evaluation Measures
We evaluate the performance of STNN on five different metrics, Mean Absolute Error (MAE) defined as the mean of the absolute difference between the estimated travel time and the ground truth ,
and, Mean Relative Error (MRE) is defined as:
Since the dataset contains anomalous taxi trip entries we also measure Median Absolute Error (MedAE) and Median Relative Error (MedRE) as
where has its usual meaning. Finally, to measure how close the data are to the fitted hyper surface, we also use the coefficient of determination to evaluate the performance of STNN: where , is the mean of the observed data.
Table I compares the performance of proposed approach for travel time estimation. From Table I, we observe that TimeNN is far better than the simple linear regression method for travel time estimation, e.g. about improvement in MAE. This is expected because the simple linear regression does not consider the uncertain traffic conditions and simply tries to find the linear relationship between the raw origindestination GPS coordinates and the travel time.
With encoded travel distance information, STNN further improves the performance for travel time estimation, that is MAE is improved by 13 seconds in comparison to TimeNN. To investigate further, we plot the MAE for all the approaches in Fig. 4(c) to know in which regimes the STNN is better than the TimeNN. It is clear from the plots that the slope of the orange curve is larger than the green curve, which means that the longer a trip lasts, the more significant gap in performance is noticed. We also plot the MAE and predicted travel time for STNN network as a function of taxi travel time in Fig. 4(a). As expected, for the shorter taxi trips, STNN succeeds in predicting the actual travel time but for the longer travel trips, it encounters a larger MAE, around minutes. In Fig. 4(b), we show the performance of STNN for travel time estimation with respect to the trip distance. We obtain similar observation in performance.
MAE  MRE  MedAE  MedRE  

BTE [34]  170.04  0.2547  97.435  0.196  
STNN  123.13  0.2282  81.21  0.183  

BTE [34]  142.73  0.2173  90.046  0.1874  
STNN  121.48  0.2155  80.77  0.182 
We also compare the performance of STNN with the best approach in [34] and study the impact of outliers on the performance of STNN for travel time estimation in Table II. For a fair comparison we mask the training dataset confined only to Manhattan region and use the same data mapping parameters as described in [34]. In Section VIIA1, we studied the types of outliers present in dataset and applied certain filters on the dataset such as filters using time and distance, GPS coordinates etc to remove the outliers. To analyze the robustness of STNN with respect to outliers, we train the STNN on the cleaned training data and test the network on uncleaned (with outliers) data. Without outliers, we observe a clear performance improvement of STNN for travel time estimation, in terms of MAE, by 17%. We found that even when the outliers are prevalent in the data, our proposed approach not only outperformed [34] but also appeared to be more robust to outliers. We observe a negligible difference in the performance of STNN with or without outliers ( seconds in MAE).
ViiB Carpooling Results
With STNN, we deployed the trip time estimation module to our carpool training environment developed in Section IV. We trained an RL agent using experiences generated from the simulation environment to optimize the carpooling policy of a single taxi driver. In this work, we consider a single agent carpooling policy search where the decision taken by an agent (taxi) is independent of the other agents. In a single agent or multiagent RL learning framework agent is a ridesharing platform which takes decision for the taxis. In our problem when ridesharing platform takes decision for only a single taxi then taxi itself acts as an agent. For learning a tabularQ policy, we discretized the selected geographical region into square cells of 0.0002 degree latitude 0.0002 degree longitude (about 200 mt. 200 mt.) forming a 2D grid and also discretized the time of day with 600s as sampling period, whereas for learning a DQN policy we do not discretize any of the variables. The original variable values are used as input to the agent neural network.
We evaluate the performance of DQN learned policy both on weekday and weekend by comparing the mean cumulative reward with respect to the fixed policy (baseline) that always favors carpooling and the tabularQ policy. By far, fixed policy is the greedy policy in the sense that the agent always chooses an action which always accepts a carpool (associated with the maximum immediate reward) as described in Algorithm 5.
We generate the samples of experience in realtime from the carpool simulation environment described in Section IV. We study the performance of learned RL policy for two different taxi call densities regions in NYC, Uptown Manhattan and Downtown Manhattan in Fig. 6.
ViiB1 Uptown Manhattan
We select a square region in northern Manhattan in longitude and in latitude as shown in Fig. 5(a) where binned red dots represent the selected region (about grid).
In table III, the first row compares the performance of DQN learned policy both for weekday and weekend with respect to the fixed policy. The DQN learned policy outperforms both the fixed policy and the tabular Q policy both on weekday and weekend. We plot the actionvalues (Qvalue) averaged over minibatches for DQN in Fig. 6(a) and for the Tabular Q, Qvalue is averaged over a number of episodes in Fig. 6(b) for a weekday. In both the cases, mean Q smoothly converged after few thousand episodes and we stop the training of the RL agent.
ViiB2 Downtown Manhattan
We select a square region of Downtown Manhattan in longitude and in latitude as shown in Fig. 5(b).
Similar to uptown Manhattan, we plot the actionvalues in Fig. 6(c), 6(d) for DQN and Tabular Q on a weekday respectively. In table III, second row compares the performance of DQN learned policy both for weekday and weekend with respect to fixed policy and tabular Q learned policy. On weekday DQN and the fixed policy performed equally well. This can be explained by the fact that downtown Manhattan has a high density of taxi calls, and their destinations are usually within the downtown area as well. Hence, an alwayscarpool policy is nearoptimal in optimizing the objective, i.e. the effective total trip distance. On the other hand, during the weekend taxi calls density is reduced, and DQN learned an optimal policy better than the baseline.
Region  Day  Fixed Policy  Tabular Q  DQN 
Uptown  Weekday  41.543  39.17  46.08 
Weekend  25.39  14.37  27.86  
Downtown  Weekday  340.06  186.00  339.42 
Weekend  259.57  145.63  261.23 
TabularQ performance is always worst because the stateaction space is huge and obtaining Q value for such a stateaction space is not practical. In all the experiments, we learned a very sparse Q value table. Therefore, at test time we encounter some states where the Q values for all the actions are equal to zero.
We suspect that in downtown Manhattan where the taxi calls are very frequent, DQN policy always favors for carpool and generate the reward similar to fixed policy. On the other hand, in uptown Manhattan where taxi calls are less frequent, DQN learned policy is able to selectively take or action, leading the taxi into regions with higher longterm values. To get a better understanding of the cumulative reward, we randomly selected a location in uptown Manhattan and ran a full episode to generate the sequence of actions and rewards both for fixed policy and for DQN learned policy. We observed that during morning hours the DQN learned policy and fixed policy followed the same set of action sequences but later in the day, DQN learned policy started to compromise immediate rewards, and in turn, to get more longterm cumulative reward by forcing the taxi to move towards the high actionvalue regions.
Viii Conclusion
We have developed a reinforcement learning system to generate an optimal carpooling policy for a taxi driver to maximize transportation efficiency in terms of fulfilling passenger orders. We have developed a carpool simulation environment using the historical taxi trip data to generate the samples of experience for training RL agent. To support an accurate simulator, we propose STNN, an endtoend deep neural network approach that takes the raw GPS coordinates of origin and destination to estimate the travel time of potential trips. We conducted experiments on two different areas of Manhattan. The results show that the RL learned policy is able to intelligently decide when to accept a carpool trip based on the current driver state and the future prospect of the actions, with demonstrated advantage in optimizing the total effective trip distance of a driver within a day. In this work, we assume that decisionmaking for taxis are independent from each other. One obvious future direction of research is to extend our framework to a multiagent setting. One other potential extension of this work is to make action space more granular to trip assignment.
References
 [1] B. Schaller, “The new york city taxicab fact book,” Schaller Consulting, mars, 2006.
 [2] NYC Taxi and Limousine Commission, “Taxi of tomorrow survey results,” http://www.nyc.gov/html/tlc/downloads/pdf/tot_survey_results_02_10_11.pdf, 2011.
 [3] D. Zhang, T. He, F. Zhang, M. Lu, Y. Liu, H. Lee, and S. H. Son, “Carpooling service for largescale taxicab networks,” ACM Transactions on Sensor Networks (TOSN), vol. 12, no. 3, p. 18, 2016.
 [4] I. Jindal, T. Qin, X. Chen, M. Nokleby, and J. Ye, “A unified neural network approach for estimating travel time and distance for a taxi trip,” arXiv preprint arXiv:1710.04350, 2017.
 [5] Z. Wang, K. Fu, and J. Ye, “Learning to estimate the travel time,” in Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. ACM, 2018, pp. 858–866.
 [6] Y. Li, K. Fu, Z. Wang, C. Shahabi, J. Ye, and Y. Liu, “Multitask representation learning for travel time estimation,” in International Conference on Knowledge Discovery and Data Mining,(KDD), 2018.
 [7] M. Han, P. Senellart, S. Bressan, and H. Wu, “Routing an autonomous taxi with reinforcement learning,” in Proceedings of the 25th ACM International on Conference on Information and Knowledge Management. ACM, 2016, pp. 2421–2424.
 [8] T. Verma, P. Varakantham, S. Kraus, and H. C. Lau, “Augmenting decisions of taxi drivers through reinforcement learning for improving revenues,” 2017.
 [9] R. Fagin and J. H. Williams, “A fair carpool scheduling algorithm,” IBM Journal of Research and development, vol. 27, no. 2, pp. 133–139, 1983.
 [10] P. Santi, G. Resta, M. Szell, S. Sobolevsky, S. H. Strogatz, and C. Ratti, “Quantifying the benefits of vehicle pooling with shareability networks,” Proceedings of the National Academy of Sciences, vol. 111, no. 37, pp. 13 290–13 294, 2014.
 [11] J. AlonsoMora, S. Samaranayake, A. Wallar, E. Frazzoli, and D. Rus, “Ondemand highcapacity ridesharing via dynamic tripvehicle assignment,” Proceedings of the National Academy of Sciences, p. 201611675, 2017.
 [12] S. Ma, Y. Zheng, and O. Wolfson, “Realtime cityscale taxi ridesharing,” IEEE Transactions on Knowledge and Data Engineering, vol. 27, no. 7, pp. 1782–1795, 2015.
 [13] J. Meurer, M. Stein, D. Randall, M. Rohde, and V. Wulf, “Social dependency and mobile autonomy: supporting older adults’ mobility with ridesharing ict,” in Proceedings of the 32nd annual ACM conference on Human factors in computing systems. ACM, 2014, pp. 1923–1932.
 [14] A. Narayanan, N. Mitrovic, M. T. Asif, J. Dauwels, and P. Jaillet, “Travel time estimation using speed predictions,” in Intelligent Transportation Systems (ITSC), 2015 IEEE 18th International Conference on. IEEE, 2015, pp. 2256–2261.
 [15] A. Kesting and M. Treiber, “Traffic flow dynamics: Data, models and simulation,” 2013.
 [16] X. Zhan, S. Hasan, S. V. Ukkusuri, and C. Kamga, “Urban link travel time estimation using largescale taxi data with partial information,” Transportation Research Part C: Emerging Technologies, vol. 33, pp. 37–49, 2013.
 [17] M. Li, A. Ahmed, and A. J. Smola, “Inferring movement trajectories from gps snippets,” in Proceedings of the Eighth ACM International Conference on Web Search and Data Mining. ACM, 2015, pp. 325–334.

[18]
A. Hofleitner, R. Herring, P. Abbeel, and A. Bayen, “Learning the dynamics of arterial traffic from probe data using a dynamic bayesian network,”
IEEE Transactions on Intelligent Transportation Systems, vol. 13, no. 4, pp. 1679–1693, 2012.  [19] E. F. Morgul, K. Ozbay, S. Iyer, and J. HolguinVeras, “Commercial vehicle travel time estimation in urban networks using gps data from multiple sources,” in Transportation Research Board 92nd Annual Meeting, no. 134439, 2013.
 [20] J. Yuan, Y. Zheng, C. Zhang, W. Xie, X. Xie, G. Sun, and Y. Huang, “Tdrive: driving directions based on taxi trajectories,” in Proceedings of the 18th SIGSPATIAL International conference on advances in geographic information systems. ACM, 2010, pp. 99–108.

[21]
D. Chakraborty and P. Stone, “Structure learning in ergodic factored mdps
without knowledge of the transition function’s indegree,” in
Proceedings of the 28th International Conference on Machine Learning (ICML11)
, 2011, pp. 737–744.  [22] R. S. Sutton and A. G. Barto, Reinforcement learning: An introduction. MIT press Cambridge, 1998, vol. 1, no. 1.
 [23] C. J. Watkins and P. Dayan, “Qlearning,” Machine learning, vol. 8, no. 34, pp. 279–292, 1992.
 [24] V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski et al., “Humanlevel control through deep reinforcement learning,” Nature, vol. 518, no. 7540, pp. 529–533, 2015.
 [25] D. Silver, A. Huang, C. J. Maddison, A. Guez, L. Sifre, G. Van Den Driessche, J. Schrittwieser, I. Antonoglou, V. Panneershelvam, M. Lanctot et al., “Mastering the game of go with deep neural networks and tree search,” Nature, vol. 529, no. 7587, pp. 484–489, 2016.
 [26] V. Mnih, K. Kavukcuoglu, D. Silver, A. Graves, I. Antonoglou, D. Wierstra, and M. Riedmiller, “Playing atari with deep reinforcement learning,” arXiv preprint arXiv:1312.5602, 2013.
 [27] C. Whong, “Foiling nyc boro taxi trip data,” http://chriswhong.com/opendata/foilingnycsborotaxitripdata/.
 [28] M. S. Grewal, L. R. Weill, and A. P. Andrews, Global positioning systems, inertial navigation, and integration. John Wiley & Sons, 2007.
 [29] H. Wang, Y.H. Kuo, D. Kifer, and Z. Li, “A simple baseline for travel time estimation using largescale trip data,” in Proceedings of the 24th ACM SIGSPATIAL International Conference on Advances in Geographic Information Systems. ACM, 2016, p. 61.

[30]
D. C. Cireşan, U. Meier, L. M. Gambardella, and J. Schmidhuber, “Deep big multilayer perceptrons for digit recognition,” in
Neural networks: tricks of the trade. Springer, 2012, pp. 581–598.  [31] I. Jindal, M. Nokleby, and X. Chen, “Learning deep networks from noisy labels with dropout regularization,” in Data Mining (ICDM), 2016 IEEE 16th International Conference on. IEEE, 2016, pp. 967–972.
 [32] D. West, “Neural network credit scoring models,” Computers & Operations Research, vol. 27, no. 11, pp. 1131–1152, 2000.
 [33] H. Van Hasselt, A. Guez, and D. Silver, “Deep reinforcement learning with double qlearning.” in AAAI, 2016, pp. 2094–2100.
 [34] Y. Wang, Y. Zheng, and Y. Xue, “Travel time estimation of a path using sparse trajectories,” in Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, 2014, pp. 25–34.
Comments
There are no comments yet.