1 Introduction
Taxi/car on Demand (ToD) services (e.g., UberX, Lyft, Grab) not only provide a comfortable means of transport for customers, but also are good for the environment by enabling sharing of vehicles over time (while being used to serve one request at any one point in time). A further improvement of ToD is ondemand ride pooling (e.g., UberPool, LyftLine, GrabShare etc.), where vehicles are shared not only over time but also in space (on the taxi/car). Ondemand ride pooling reduces the number of vehicles required, thereby reducing emissions and traffic congestion compared to Taxi/car onDemand (ToD) services. This is achieved while providing benefits to all the stakeholders involved: (a) Individual passengers have reduced costs due to sharing of space; (b) Drivers make more money per trip as multiple passengers (or passenger groups) are present; (c) For the aggregation company more customer requests can be satisfied with the same number of vehicles.
In this paper, we focus on this ondemand ride pooling problem at city scale, referred to as RidePool Matching Problem (RMP) [AlonsoMora et al. (2017); Bei and Zhang (2018); Lowalekar et al. (2019)]. The goal in an RMP is to assign combinations of user requests to vehicles (of arbitrary capacity) online such that quality constraints (e.g., delay in reaching destination due to sharing is not more than 10 minutes) and matching constraints (one request can be assigned at most one vehicle, one vehicle must be assigned at most one request combination) are satisfied while maximizing an overall objective (e.g., number of requests, revenue). Unlike the ToD problem that requires solving a bipartite matching problem between vehicles and customers, RMP requires effective matching on a tripartite graph of requests, trips (combinations of requests) and vehicles. This matching on tripartite graph significantly increases the complexity of solving RMP online, especially at city scale where there are hundreds or thousands of vehicles, hundreds of requests arriving every minute and request combinations have to be computed for each vehicle.
Due to this complexity and the need to make decisions online, most existing work related to solving RMP has focused on computing best greedy assignments [Ma et al. (2013); Tong et al. (2018); Huang et al. (2014); Lowalekar et al. (2019); AlonsoMora et al. (2017)]. While these scale well, they are myopic and, as a result, do not consider the impact of a given assignment on future assignments. The closest works of relevance to this paper are by Shah et al. [Shah et al. (2020)] and Lowalekar et al. [Lowalekar et al. (2021)]. We specifically focus on the work by Shah et al., as it has the best performance, while being scalable. That work considers future impact of current assignment from an individual agents’ perspective without sacrificing on scalability (to city scale). However, a key limitation of that work is that they do not consider the impact of other agents (vehicles) actions on an agents’(vehicle) future impact, which as we demonstrate in our experiments can have a major effect (primarily because vehicles are competing for the common demand).
To that end, we develop a conditional expectation based value decomposition approach that not only considers future impact of current assignments but also of other agents state and actions through the use of conditional probabilities and tighter estimates of individual impact. Due to these conditional probability based tighter estimates of individual value functions, we can scale the work by Guestrin
et al. [Guestrin et al. (2002)] and Li et al. [Li et al. (2021)] to solve problems with no explicit coordination graphs and hundreds/thousands of homogeneous agents. Unlike value decomposition approaches [Rashid et al. (2018); Sunehag et al. (2018)] developed for solving cooperative MultiAgent Reinforcement Learning (MARL) with tens of agents and under centralized training and decentralized execution set up, we focus on problems with hundreds or thousands of agents with centralized training and centralized execution (e.g., Uber, Lyft, Grab).
In this application domain of taxi on demand services, where improving 0.5%1% is a major achievement [Lin et al. (2018)], we demonstrate that our approach easily outperforms the existing best approach, NeurADP [Shah et al. (2020)] by at least 3.8% and up to 9.76% on a wide variety of settings for the benchmark real world taxi dataset [NYYellowTaxi (2016)].
2 Background
In this section, we formally describe the RMP problem and also provide details of an existing approach for ondemand ride pooling called NeurADP, which we improve over.
Ridepool Matching Problem (RMP) : We consider a fleet of vehicles/resources with random initial locations, travelling on a predefined road network with intersections : as nodes, road segments :
as edges and weights on edges indicate the travel time on the road segment. Passengers that want to travel from one location to another send requests to a central entity that collects these requests over a timewindow called the decision epoch
. The goal of the RMP is to match these collected requests to empty or partially filled vehicles that can serve them such that an objective is maximised subject to constraints on the delay .We upperbound and consider the objective to be the number of requests served. Thus, RMP is defined using the tuple ^{1}^{1}1Everywhere in the paper is used as the concatenation operator. Please refer Appendix A.1 for a detailed description.
Delay constraints : consider two delays, . denotes the maximum allowed pickup delay which is the difference between the arrival time of a request and the time at which a vehicle picks the user up. denotes the maximum allowed detour delay which is the difference between the time at which the user arrived at their destination in a shared cab and the time at which they would have arrived if they had taken a singlepassenger cab.
Neural Approximate Dynamic Programming (NeurADP) for Solving RMP: Figure 1 provides the overall approach. In this paper, there are two NeurADP [Shah et al. (2020)] contributions of relevance:

To estimate Future Value of current actions, a method for solving the underlying Approximate Dynamic Program (ADP) [Powell (2007)
] by considering neural network representations of value functions.

To ensure scalability, Decomposing the Joint Value function into individual vehicle value functions by extending on the work of Russell et al. [Russell and Zimdars (2003)].
Future Value (FV): ADP is similar to a Markov Decision Problem (MDP) with the key difference that the transition uncertainty is extrinsic to the system and not dependent on the action. The ADP problem for RMP is formulated using the tuple , where :

: The state of the system is represented as where is the state of all vehicles and contains all the requests waiting to be served. The state is obtained in Step A of Figure 1.

: At each time step there are a large number of requests arriving to the taxi service provider, however for an individual vehicle only a small number of such requests are reachable. The feasible set of request combinations for each vehicle at time , is computed in Step B of Figure 1:
(1) is the decision variable that indicates whether vehicle takes action (a combination of requests) at a decision epoch . Joint actions across vehicles have to satisfy matching constraints: (i) each vehicle, can only be assigned at most one request combination, ; (ii) at most one vehicle, can be assigned to a request ; and (iii) a vehicle, can be either assigned or not assigned to a request combination.
(2) 
: denotes the exogenous information – the source of randomness in the system. This would correspond to the user requests or demand. denotes the exogenous information at time .

: denotes the transitions of system state. In an ADP, the system evolution happens as , where denotes the predecision state at decision epoch and denotes the postdecision state [Powell (2007)]. The transition from state to
depends on the action vector
and the exogenous information . Therefore,It should be noted that is deterministic as uncertainty is extrinsic to the system.

: denotes the reward function and in RMP, this will be the revenue from a trip.
Let denotes the value of being in state at decision epoch , then using Bellman equation:
(3) where is the discount factor. Using postdecision state, this expression breaks down nicely:
(4) The advantage of this two step value estimation is that the maximization problem in Equation 4
can be solved using a Integer Linear Program (ILP) with matching constraints indicated in expression
2. Step D of Figure 1) provides this aspect of the overall algorithm. The value function approximation around postdecision state, is a neural network and is updated (Step E of Figure 1) by stepping forward through time using sample realizations of exogenous information (i.e. demand observed in data). However, as we describe next, maintaining a joint value function is not scalable and hence we decompose and maintain individual value functions.Decomposing Joint Value (DJV): Nonlinear value functions, unlike their linear counterparts, cannot be directly integrated into the ILP mentioned above. One way to incorporate them is to evaluate the value function for all possible postdecision states and then add these values as constants. However, the number of postdecision states is exponential in the number of resources/vehicles.
[Shah et al. (2020)] introduced a twostep decomposition of the joint value function that converts it into a linear combination over individual value functions associated with each vehicle. In the first step, following [Russell and Zimdars (2003)], the joint value function is written as the sum over individual value functions : .
In the second step, the individual vehicles’ value functions are approximated. They assumed that the longterm expected reward of a given vehicle is not significantly affected by the specific actions another vehicle makes in the current decision epoch and thereby completely neglect the impact of the actions taken by other vehicles at the current time step. Thus they model the value function using the predecision, rather than postdecision, state of other vehicles which gives :
where refers to all vehicles except vehicle . This allows NeurADP to get around the combinatorial explosion of the postdecision state of all vehicles. NeurADP thus has the joint value function : .
They then evaluate these individual values (Step C of Figure 1) for all possible (from the individual value neural network) and then integrate the overall value function into the ILP as a linear function over these individual values. This reduces the number of evaluations of the nonlinear value function from exponential to linear in the number of vehicles.
3 Conditional Expectation based Value Decomposition, CEVD
One of the fundamental drawbacks in NeurADP is that each agent/vehicle^{2}^{2}2We will use agent and vehicle interchangeably. to a large extent is kept in oblivion about the values of the feasible actions for other agents/vehicles. Since our problem execution (assignment of requests to agents) is centralized, this independence of individual agents (as shown in experimental results) leads to suboptimal actions for the entire system.
While there are dependencies between agents, not all agents are dependent on each other and one mechanism typically employed to represent sparsely connected multiagent systems is through the use of a coordination graph [ Guestrin et al. (2002); Li et al. (2021)], . The joint value of the system with joint state and joint action in the context of a coordination graph is given by:
(5) 
where represents the value of agent and represents the impact of agent ’s actions on agent ’s value. Such an approach is scalable if there are a few agents. However, when considering thousands of agents and a central ILP which requires values for all different joint action pairs, there is a combinatorial explosion making the model non deployable in real time. To put things into perspective, assuming each agent has feasible actions (request combinations) and there are (typically ) agents, the number of value evaluations jumps from in DJV to while using a coordination graph. It should be further noted the corresponds to request combinations and hence can increase combinatorially.
Thus, we need a mechanism that is scalable while considering the impact of
on other agents. In the well known Expectation Maximization algorithm [
Dempster (1977)] for identifying missing data, the likelihood is calculated by introducing a conditional probability of unknown data given known data. In a similar vein, our method to deal with the unknown impact of other agents is by considering conditional probability of agent taking action given agent takes action in state . This will ensure the overall value is dependent on individual agent values and not on joint values. More specifically, the expected value of agent is:To make this broad idea of conditional expectation operational in case of RMP, we have to address multiple key challenges. We describe these key challenges and our ways of addressing them below. Figure 2 provides the overall method, with step (II) outlining the conditional expectation idea and the key difference from NeurADP described in Figure 1.
3.1 No explicit/static coordination graph
While it is clear that agents that are very far apart will not have any dependency, there is no explicit coordination graph that is present in RMP. However, RMP has two characteristics that make it easier to identify neighbouring agents for any given agent:

Agents that are nearby spatially are more probable to compete over the same set of requests and hence would have a dependency.

Agents/vehicles do not have identity, i.e., they are all homogenous.
Due to these characteristics, we can cluster the intersections in the road network (to capture spatial dependencies) and consider agents at a time step in an intersection cluster as neighboring agents. Due to homogeneity of agents, the only aspect of importance is whether there are agents (and not which specific agents) competing for the same requests. Unlike previous works, the coordination graph keeps changing at each time step, as agents move between clusters. At time , an agent placed at an intersection belonging to cluster will coordinate with all other agents in cluster .
We define function to map each intersection of the road network into one of the K clusters : based on the average travel times between intersections. Because of clustering locations and assigning agents to location clusters, the total number of agents becomes less of an issue with respect to scalability.
3.2 How to consider impact of other agents in the cluster?
Let us consider agent present in cluster at time . The other agents in cluster are agents and are termed its neigbours. In NeurADP, agent credited action with value which is oblivious to the presence of neighbour agents. Since the execution is centralized, agent can however weigh the losses/gains its action has on the cluster by getting useful feedback from the individual values of other agents. Let us take a neighbour agent () having feasible actions . From agent
’s perspective a conditional probability distribution
is formed and the feedback term from agent is written as .
We do this for all the neighbours and take an average :
Now to calculate the value of agent on taking action , after getting this feedback, we take an affine combination and write the new individual value as
where is a learnable parameter. It should be noted that this individual value not only considers the future impact of current action, , but also considers the impact on other agents.
Our overall ILP objective thus becomes :
subject to feasibility constraints in expression 2.
What should be the functional form for conditional probabilities? Each request has a pickup location . Recall function which maps each intersection to its cluster. Every action is associated with some user request , we define
which maps each action to the corresponding request ()’s pickup intersection . Define the composition function
that maps each action to a cluster by the pickup location of the action’s corresponding user request. is defined as the average travel time between 2 clusters. We model the conditional probability of agent taking action given agent takes action as :
where is a learnable parameter. The normalizing constant is computed by summing over actions in .
3.3 Over/under estimation of individual values in Fv and Djv:
NeurADP makes an optimistic assumption that individual agent will get to take the best action in the next time step. Since central ILP decides the joint action, this can result in an overestimation or underestimation of individual agent value. Due to this and other issues, in our experiments, we found that NeurADP values can have significant errors compared to the discounted future rewards as shown in Figure 3. We fix this problem through two key enhancements:

Controlling large Variance of exogeneous information : Recall from Equation
4, to calculate values of post action states, NeurADP employsHere the exogenous information is the global demand () at time . Even within a small number of consecutive epochs, the global demand displays a significant variance. This results in individual values showing a large variance, instead of varying smoothly over time. We thus consider expected discounted future demand,
(where is large but finite horizon) which varies smoothly over time unlike the current demand. In our approach, we consider exogenous information as .

Enforcing values to be positive : NeurADP models the value function as a shared parameter Neural Network. The final layer of this Neural Network is a fully connected multi layer perceptron (MLP) having range as the entire real line. However, as our objective function (number of requests served) is nonnegative, negative values (admissible by NeurADP) are not reasonable. We thus use a SoftPlus activation after the final MLP to ensure that the computed values are strictly positive.
To evaluate the impact of the above modifications (calling the model NeurADP+), in Figure 3 we plot the following : ( is the action chosen for vehicle at time ) vs where is the total number of requests served at time for NeurADP, NeurADP+ and CEVD. Notice how the gap between the Estimated Value curve and the Discounted Reward curve is very small in NeurADP+ and CEVD (as it should be by the Bellman Equation 3), whereas the gap is quite significant for NeurADP. Note that the height of the graphs is different and while NeurADP+ improves the quality of the estimation, CEVD is responsible for the bulk of the performance gain.
4 Algorithm
Given a post decision state , we need a paremterized function to compute . Our joint function has 3 parameters : (i) : parameters of a Neural Network Based Individual Agent Value Function Estimator (as in NeurADP)^{3}^{3}3In this section by NeurADP we mean NeurADP updated with the proposals in 3.3, (ii) : parameter to control the importance given to an agent’s neighbours while taking an affine combination, (iii) : parameter to control conditional probabilities P which controls the relative importance given to different feasible actions of neighbours. We infer the parameters step by step. Setting , reduces this function to NeurADP (). We first estimate optimal (following the alogrithm in NeurADP). This gives us the NeurADP parameters, which are a good starting point to estimate values at an individual level. Now to estimate , we set
(this corresponds to uniform distribution over actions), and do a linear search on a set of sampled points on the real line to find the optimal
subject to constraints in expression 2. At this stage, we haven’t changed the preference over actions from an individual perspective, however for the central executor the values are now much more refined as the individual over/under estimates have been smoothened by considering the neigbours. Finally we estimate by linear search on a set of sampled points on the real line to get optimal subject to constraints in expression 2. We can now compute values on unseen data using .5 Experiments
The goal of the experiments is to compare the performance of our approach CEVD to NeurADP[Shah et al. (2020)](henceforth referred to as baseline), which is the current best approach for solving the RMP. We make this comparison on a realworld dataset [NYYellowTaxi (2016)] across different RMP parameter settings. We quantitatively justify our performance by comparing the service rate, i.e., the percentage improvement on the total requests served. We vary the following parameters: the maximum allowed waiting time from 90 seconds to 150 seconds, the number of vehicles from 500 to 1000 and the capacity from 4 to 5. The value of maximum allowable detour delay is taken as . The decision epoch duration is taken as 60 seconds.
Setup: We perform our experiments on the demand distribution from the publicly available New York Yellow Taxi Dataset [NYYellowTaxi (2016)]. The experimental setup is similar to the setup used by [Shah et al. (2020)]. Street intersections are used as the set of locations . They are identified by taking the street network of the city from openstreetmap using osmnx with ’drive’ network type [Boeing (2017)]. Nodes that do not have outgoing edges are removed, i.e., we take the largest strongly connected component of the network. The resulting network has 4373 locations (street intersections) and 9540 edges. The travel time on each road segment of the street network is taken as the daily mean travel time estimate computed using the method proposed in [Santi et al. (2014)]. We further cluster these intersections into
clusters using KMeans Clustering based on the average travel times between different intersections. We choose the value of
based on the number of vehicles. is chosen to be 100,150 and 200 for 500,750 and 1000 vehicles respectively. Similar to previous work, we only consider the street network of Manhattan as a majority (75%) of requests have both pickup and dropoff locations within it. The dataset contains data about past customer requests for taxis at different times of the day and different days of the week. From this dataset, we take the following fields: (1) Pickup and dropoff locations (latitude and longitude coordinates)  These locations are mapped to the nearest street intersection. (2) Pickup time  This time is converted to appropriate decision epoch based on the value of . The dataset contains on an average 322714 requests in a day (on weekdays) and 19820 requests during peak hour.We evaluate the approaches over 24 hours on different days starting at midnight and take the average value over 5 weekdays (4  8 April 2016) by running them with a single instance of initial random location of taxis ^{4}^{4}4All experiments are run on 60 core  3.8GHz Intel Xeon C2 processor and 240GB RAM. The algorithms are implemented in python and optimisation models are solved using CPLEX 20.1. CEVD is trained using the data for 8 weekdays (23 March  1 April 2016) and it is validated on 22 March 2016. For the experimental analysis, we consider that all vehicles have identical capacities.
Results : We compare CEVD to NeurADP (referred to as baseline). Table 1 gives a detailed performance analysis for the service rates of CEVD and baseline. Here are some key observations:
Effect of changing tolerance to delay, : CEVD obtains a 9.37% improvement over the baseline approach for seconds. The difference between the baseline and CEVD decreases as increases. The lower value of makes it difficult for vehicles to accept new requests while satisfying the constraints for already accepted requests. The neighbouring vehicles’ interactions in CEVD prevents a vehicle from picking up requests it values highly however which would have been more suitable given the delay constraints for some other vehicle and instead picks up requests which it might value less but still is feasible. Thus the overall requests served increases.
Effect of changing the capacity, : CEVD obtains a 9.76% gain over baseline for capacity 5. The difference between the baseline and CEVD increases as the capacity increases as for higher capacity vehicles, there is a larger scope for improvement if vehicles cooperate well.
Effect of changing the number of vehicles, : CEVD obtains a 9.37% improvement over the baseline for capacity . The difference between the baseline and CEVD decreases as the number of vehicles increase as in the presence of a large number of vehicles, there will always be a vehicle that can serve the request. As a result, the quality of assignments plays a smaller role.
Varying  Parameters  Baseline  Our Approach  

Number of  Pickup  Capacity  Requests  Requests  Percentage  
Vehicles  Delay  Served  Served  Improvement  
Pickup  500  90  4  90286.82108.78  98748.22449.38  9.370.59 
500  120  4  103933.02604.05  113184.02774.00  8.900.23  
Delay  500  150  4  113051.82771.45  117351.62818.49  3.800.24 
Number  500  90  4  90286.82108.78  98748.22449.38  9.370.59 
of  750  90  4  129791.63516.83  139365.63853.98  7.380.38 
Vehicles  1000  90  4  165110.24677.10  175453.45173.82  6.260.18 
Capacity  500  90  4  90286.82108.78  98748.22449.38  9.370.59 
500  90  5  91509.42144.54  100443.82502.34  9.760.38 
We further analyse the improvements obtained by CEVD over baseline by comparing the number of requests served by both approaches at each decision epoch throughout the day. Figure 4 shows the number of requests served by the baseline and CEVD at different decision epochs^{5}^{5}5Results for other settings shown in appendix. As shown in the figure, initially at night time when the demand is low both approaches serve all available demand. During the transition period from low demand to high demand period, the baseline algorithm starts to choose suboptimal actions without considering the impact of each vehicle’s action on the whole system while CEVD is able to capitalize on the joint action values and serve much more requests than the baseline.
The approach can be executed in realtime settings. The average time taken to compute each batch assignment using CEVD is less than 60 seconds (for all cases) ^{6}^{6}660 seconds is the decision epoch duration considered in the experiments. These results indicate that using our approach can help ridepooling platforms to better meet customer demand.
6 Conclusion
Due to the matching required on a tripartite graph between user requests, trips (combination of user requests) and vehicles, ondemand ride pooling is challenging. Improving on existing methods, we provide a scalable novel value decomposition method based on conditional probabilities, where individual value is not only able to consider future impact of current matches but also impact on other agent values. This new approach is able to outperform the best existing method in all settings of the benchmark taxi data set employed for ondemand ride pooling by margins of up to 9.79%. To put this result in perspective, typically, an improvement of 1% is considered a significant improvement on ToD for an entire city [Xu et al. (2018); Lowalekar et al. (2019)].
References
 Ondemand highcapacity ridesharing via dynamic tripvehicle assignment. Proceedings of the National Academy of Sciences, pp. 201611675. Cited by: item :, §A.2, §1, §1.
 Algorithms for tripvehicle assignment in ridesharing. Cited by: §1.
 OSMnx: new methods for acquiring, constructing, analyzing, and visualizing complex street networks. Computers, Environment and Urban Systems 65, pp. 126–139. Cited by: §5.
 Maximum likelihood from incomplete data via the em algorithm. Journal of the Royal Statistical Society. Series B (Methodological). Cited by: §3.
 Multiagent planning with factored mdps. In Neural Information Processing Systems, Cited by: §A.2, §1, §3.
 Large scale realtime ridesharing with service guarantee on road networks. Proceedings of the VLDB Endowment 7 (14), pp. 2017–2028. Cited by: §A.2, §1.
 Deep implicit coordination graphs for multiagent reinforcement learning. In International Conference on Autonomous Agents and MultiAgent Systems, AAMAS, Cited by: §A.2, §1, §3.
 Efficient largescale fleet management via multiagent deep reinforcement learning. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pp. 1774–1783. Cited by: §1.
 ZAC: A zone path construction approach for effective realtime ridesharing. In Proceedings of the TwentyNinth International Conference on Automated Planning and Scheduling, ICAPS 2018, Berkeley, CA, USA, July 1115, 2019., pp. 528–538. Cited by: §A.2, §1, §1, §6.

Zone path construction (zac) based approaches for effective realtime ridesharing.
Journal of Artificial Intelligence Research, JAIR
. Cited by: §A.2, §1.  Tshare: a largescale dynamic taxi ridesharing service. In Data Engineering (ICDE), 2013 IEEE 29th International Conference on, pp. 410–421. Cited by: §A.2, §1.
 New york yellow taxi dataset. Note: http://www.nyc.gov/html/tlc/html/about/trip_record_data.shtml Cited by: §1, §5, §5.
 A survey on pickup and delivery problems. Journal für Betriebswirtschaft 58 (1), pp. 21–51. Cited by: §A.2.

Approximate dynamic programming: solving the curses of dimensionality
. Vol. 703, John Wiley & Sons. Cited by: item FV:, item . 
QMIX: monotonic value function factorisation for deep multiagent reinforcement learning.
In
International Conference on Machine Learning, ICML
, Cited by: §A.2, §1.  A survey on dynamic and stochastic vehicle routing problems. International Journal of Production Research 54 (1), pp. 215–231. Cited by: §A.2.
 Branch and cut and price for the pickup and delivery problem with time windows. Transportation Science 43 (3), pp. 267–286. Cited by: §A.2.
 Qdecomposition for reinforcement learning agents. In Proceedings of the 20th International Conference on Machine Learning (ICML03), pp. 656–663. Cited by: item DJV:, item .
 Quantifying the benefits of vehicle pooling with shareability networks. Proceedings of the National Academy of Sciences 111 (37), pp. 13290–13294. Cited by: §5.
 Neural approximate dynamic programming for ondemand ridepooling. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 34, pp. 507–515. Cited by: §A.2, §1, §1, Figure 1, item , §2, §5, §5.
 Value decomposition networks for cooperative multiagent learning based on team reward. In International Conference on Autonomous Agents and MultiAgent Systems, AAMAS, Cited by: §A.2, §1.
 A unified approach to route planning for shared mobility. Proceedings of the VLDB Endowment 11 (11), pp. 1633–1646. Cited by: §A.2, §1.
 Largescale order dispatch in ondemand ridehailing platforms: a learning and planning approach. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pp. 905–913. Cited by: §6.
Appendix A Appendix
a.1 Ridepool Matching Problem, RMP
Here, we provide specific details of the RMP problem.

Following AlonsoMora et al. (2017), the road network is represented by a weighted graph where denotes the set of street intersections and defines the adjacency of these intersections which captures the travel time for a road segment. We assume that vehicles only pick up and drop people off at intersections.

, is the combination of requests that we observe at each decision epoch . Each request is represented by the tuple: , where denote the origin and destination and denotes the arrival epoch of the request.

The set of resources/vehicles where each element is represented by the tuple . denotes the capacity of the vehicle, i.e., the maximum number of passengers it can carry simultaneously, its current position and the ordered list of locations that the vehicle should visit next to satisfy the requests currently assigned to it.

denotes the set of constraints on delay. denotes the maximum allowed pickup delay which is the difference between the arrival time of a request and the time at which a vehicle picks the user up. denotes the maximum allowed detour delay which is the difference between the time at which the user arrived at their destination in a shared cab and the time at which they would have arrived if they had taken a singlepassenger cab.

denotes the decision epoch duration.

represents the objective, with denoting the value obtained by serving request at decision epoch . The goal of the online assignment problem is to maximize the overall objective over a given time horizon, .
a.2 Related Work
There are three main threads of existing work in solving RMP problems:
(i) The first set of approaches are traditional planning approaches that model RMP as an optimization problem Ropke and Cordeau (2009); Ritzinger et al. (2016); Parragh et al. (2008). The problem with this class of approaches is that they don’t scale to ondemand cityscale scenarios.
(ii) The second set of approaches are focused on making the best greedy assignments Ma et al. (2013); Tong et al. (2018); Huang et al. (2014); Lowalekar et al. (2019); AlonsoMora et al. (2017). While these scale well, they are myopic and, as a result, do not consider the impact of a given assignment on future assignments.
(iii) The third thread of methods Shah et al. (2020); Lowalekar et al. (2021) are focussed on use of Reinforcement Learning (RL) or online MultiStage Stochastic Optimization to address the myopia associated with approaches from the second category. These set of approaches consider future impact of current matches through the use of individual value function, they achieve this by ignoring the impact of other agents on the value (e.g., future revenue) of a vehicle.
With respect to our technical contributions on value decomposition, we improve the work of Guestrin et al. [Guestrin et al. (2002)] and Li et al. [Li et al. (2021)] that was previously applicable to tens of agents with coordination graphs, to scale to hundreds/thousands of homogeneous agents with no explicit coordination graphs. The key difference is with regards to the use of conditional probability based dependencies amongst neighboring agents.
Another technical contribution that is of relevance is the value decomposition approaches [Rashid et al. (2018); Sunehag et al. (2018)] developed for solving cooperative MultiAgent Reinforcement Learning (MARL). These approaches have been developed to solve problems with tens of agents and under centralized training and decentralized execution set up, we focus on problems with hundreds or thousands of agents with centralized training and centralized execution (e.g., Uber, Lyft, Grab).
a.3 Pseudocode
a.4 NeurADP Algorithm
a.5 Additional Plots
We provide additional graphs of number of requests served in Figure 5 with different settings.
a.6 Parameter Ranges for CEVD
To estimate we uniformly sample points in the range and choose the optimal as the value giving the largest rewards over a horizon. Similarly, to estimate we uniformly sample points in the range and choose the optimal as the value giving the largest rewards over a horizon after choosing .