Tolling for Constraint Satisfaction in Markov Decision Process Congestion Games

by   Sarah H. Q. Li, et al.
University of Washington

Markov decision process (MDP) congestion game is an extension of classic congestion games, where a continuous population of selfish agents solves Markov decision processes with congestion: the payoff of a strategy decreases as more population uses it. We draw parallels between key concepts from capacitated congestion games and MDP. In particular, we show that population mass constraints in MDP congestion games are equivalent to imposing tolls/incentives on the reward function, which can be utilized by social planners to achieve auxiliary objectives. We demonstrate such methods in a simulated Seattle ride-share model, where tolls and incentives are enforced for two separate objectives: to guarantee minimum driver density in downtown Seattle, and to shift the game equilibrium towards a maximum social output.



page 1

page 2

page 3

page 4


Online Constraint Satisfaction via Tolls in MDP Congestion Games

We consider the toll design problem that arise for a game designer of a ...

Sensitivity Analysis for Markov Decision Process Congestion Games

We consider a non-atomic congestion game where each decision maker perfo...

Fast constraint satisfaction problem and learning-based algorithm for solving Minesweeper

Minesweeper is a popular spatial-based decision-making game that works w...

Toward Large-Scale Agent Guidance in an Urban Taxi Service

Empty taxi cruising represents a wastage of resources in the context of ...

Play to Grade: Testing Coding Games as Classifying Markov Decision Process

Contemporary coding education often presents students with the task of d...

Where to Find Next Passengers on E-hailing Platforms? - A Markov Decision Process Approach

Vacant taxi drivers' passenger seeking process in a road network generat...

Formalizing the Problem of Side-Effect Avoidance

AI objectives are often hard to specify properly. Some approaches tackle...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

We consider a class of non-cooperative games, Markov decision process congestion games (MDPCG) [1, 2], which combine features of classic nonatomic routing games [3, 4, 5]—i.e. games where a continuous population of agents each solve a shortest path problem—and stochastic games [6, 7]—i.e. games where each agent solves a Markov decision process (MDP). In MDP congestion games, similar to mean field games with congestion effects [8, 9], a continuous population of selfish agents each solve an MDP with congestion effects on its state-action rewards: the payoff of a strategy decreases as more population mass chooses it. An equilibrium concept for MDPCG’s akin to the Wardrop equilibrium [3] for routing games was introduced in [1].

In this paper, we consider modifying MDPCG’s game rewards to enforce artificial state constraints that may arise from a system level. For example, in a traffic network with selfish users, tolls can be used to lower the traffic in certain neighbourhoods to decrease ambient noise. Drawing on techniques from capacitated routing games [10, 11] and constrained MDPs [12, 13], we derive reward modification methods that shifts the game equilibrium mass distribution. Alternatively, constraints may arise in the following scenario: central agent, which we denote by a social planner, may enforce constraints to improve user performance as measured by an alternative objective. Equilibria of MDPCGs have been shown to exhibit similar inefficiencies to classic routing games [14, 15]. As in routing games, we show how reward adjustments can minimize the gap between the equilibrium distribution and the socially optimal distribution [16, 17].

Since MDPCG models selfish population behaviour under stochastic dynamics, our constraint enforcing methods can be considered as an incentive design framework. One practical application in particular is modifying the equilibrium behaviour of ride-sharing drivers competing in an urban setting. Ride-share has become a significant component of urban mobility in the past decade [18]. As data becomes more readily available and computation more automated, drivers will have the option of employing sophisticated strategies to optimize their profits—e.g. as indicated in popular media, there are a number of mechanisms available to support strategic decision-making by ride-sharing drivers [19, 20, 21]. This provides the need for game theoretic models of ride-sharing competition [22]: while rational drivers only seek to optimize their individual profits, ride-sharing companies may choose to incentivize driver behaviours that are motivated by other objectives, such as maintaining driver coverage over large urban areas with varied rider demand as well as increasing overall profits.

The rest of the paper is organized as follows. Section II provides a discussion of related work. In Section III, we introduce the optimization model of MDPCG’s and highlight the relationship between the classical congestion game equilibrium—i.e. Wardrop equilibrium—and Q-value functions from MDP literature. Section IV-A shows how a social planner can shift the game equilibrium through reward adjustments. Section IV-B adopts the Frank-Wolfe numerical method [23] to solve the game equilibrium and provides an online interpretation of Frank-Wolfe in the context of MDPCG. Section V provides an illustrative application of MDPCG, in which agents repeatedly play a ride-share model game in the presence of population constraints as well as improving the social welfare. Section VI concludes and comments on future work.

Ii Related work

Stochastic population games were first studied in the literature as anonymous sequential games [24, 25, 26, 27]. Recent developments in stochastic population games has been in the mean field game [8, 28] community. Our work is related to potential mean field games [8, 29] in discrete time, discrete state space [30] and mean field games on graphs [31, 9].

Our work can also be thought of as a continuous population potential game [32] where the strategies are policies of an MDP or as a modification of classic nonatomic routing games [5] where routes have been replaced by policies.

Techniques for cost modification to satisfy capacity constraints in nonatomic routing games were developed in [10]. See also [5, Sec. 2.8.2] for a discussion of tolling to enforce side-constraints and [5, Sec. 2.4] for a discussion of tolling to improve social welfare in routing games.

Iii MDP Congestion Games

We consider a continuous population of selfish agents each solving a finite-horizon MDP with horizon of length , finite state space , and finite action space . We use the notation to denote the integer set of length .

The population mass distribution, , is defined for each time step , state , and action . is the population mass in state taking action at time , and is the total population mass in state at time .


be a stochastic transition tensor.

defines the probability of

transitioning to state in stage when action is chosen. The transition tensor is defined such that


The population mass distribution obeys the stochastic mass propogation equation

where is the initial population mass in state .

The reward of each time-state-action triplet is given by a function . is the reward for taking action in state at time for given population distribution . One important case is where simply depends on , i.e. there exists functions such that



is an indicator vector for

such that . We say the game is a congestion game if the rewards have the form of (2) and the functions satisfy the following assumption.

Assumption 1.

is a strictly decreasing continuous function of for each .

Intuitively, the reward of each time-state-action triplet decreases as more members of the population choose that state-action pair at that time. We will use or to refer to the tensor of all reward functions in each case.

Each member of the population solves an MDP with population dependent rewards . As in the MDP literature, we define Q-value functions for each pair as


In the game context, Q-value function represents the distribution dependent payoff that the population receives when choosing action at . The Q-value functions can be used to define an equilibrium akin to the Wardrop equilibrium of routing games [1].

Definition 1 (Wardrop Equilibrium [1]).

A population distribution over time-state-action triplets, is an MDP Wardrop equilbrium for the corresponding MDPCG, if for any , implies


Intuitively, definition 1 amounts to the fact that at every state and time, population members only choose actions that are optimal.

When game rewards satisfy assumption 1, MDPCG can be characterized as a potential game.

Definition 2 (Potential Game [32, 1]).

We say that the MDPCG associated with rewards is a potential game if there exists a continuously differentiable function such that

In the specific case when the rewards have form (2), we can use the potential function


As shown in [1, Theorem 1.3] given a potential function , the equilibrium to the finite horizon MDPCG can be found by solving the following optimization problem for an initial population distribution .


The proof that the optimizer of (6) is a Wardrop equilibrium relies on the fact that the Q-value functions (3) are encoded in the KKT optimality conditions of the problem. The equilibrium condition (4) is then specifically derived from the complementary slackness condition [1]. When has form (5) and Assumption 1 is satisfied, is strictly concave, and MDPCG (6) has a unique Wardrop equilibrium.

Iv Constrained MDPCG

In this section, we analyze the problem of shifting the game equilibrium by augmenting players’ reward functions. In Section IV-A, we show that introducing constraints cause the optimal population distribution to obey Wardrop equilibrium for a new set of Q-value functions. Section IV-B outlines the Frank Wolfe numerical method for solving a constrained MDPCG as well as provides a population behavioural interpretation for the numerical method.

Iv-a Planning Perspective: Model and Constraints

The Wardrop equilibrium of an MDP congestion game is given by (6). The planner may use additional constraints to achieve auxiliary global objectives. For example, in a city’s traffic network, certain roads may pass through residential neighbourhoods. A city planner may wish to artificially limit traffic levels to ensure residents’ wellbeing.

We consider the case where the social planner wants the equilibrium population distribution to satisfy constraints of the form


where are continuously differentiable concave functions.

The social planner cannot explicitly constrain players’ behaviour, but rather seeks to add incentive functions to the reward functions in order to shift the equilibrium to be within the constrained set defined by (7). The modified rewards have form


To determine the incentive functions, the social planner first solves the constrained optimization problem


and then computes the incentive functions as


where are the optimal Lagrange multipliers associated with the additional constraints (7).

The following theorem shows that the Wardrop equilibrium of the MDPCG with modified rewards in (8) satisfies the new constraints in (9a).

Theorem 1.

Let the MDPCG (6) with rewards be a potential game with a strictly concave potential function . If is a Wardrop equilibrium for a modified MDPCG with reward functions


then also solves (9) and thus satisfies the additional constraints (7).


The Lagrangian of (9) is given by


and note that by strict concavity

has unique solution, which we denote by . We then note that


is a potential function for the MDPCG with modified rewards (11). Since is strictly concave, is concave, and is positive, is strictly concave. The equilibrium for the MDPCG with modified rewards can be computed by solving (9) with as the objective.

The Lagrangian for (9) with is given by


Again by strict concavity


has a unique solution which we denote as . It follows that . Thus the equilibrium of the game with modified rewards, satisfies as desired. ∎

For the social planner, Theorem 1 has the following interpretation: in order to impose constraints of form (7) on a MDPCG, the planner could solve the constrained game (9) for optimal dual variables and offer incentives of form (10).

Iv-B Population Perspective: Numerical Method

After the social planner has offered incentives, the population plays the Wardrop equilibrium defined by modified rewards (8); this equilibrium can be computed using the Frank Wolfe (FW) method [23], given in Algorithm 3, with known optimal variables .

FW is a numerical method for convex optimization problems with continuously differentiable objectives and compact feasible sets [33], including routing games. One advantage of this learning paradigm is that the population does not need to know the function . Instead, they simply react to the realized rewards of previous game at each iteration. It also provides an interpretation for how a Wardrop equilibrium might be asymptotically reached by agents in MDPCG in an online fashion.

Assume that we have a repeated game play, where players execute a fixed strategy determined at the start of each game. At the end of each game , rewards of game based on are revealed to all players. FW models the population as having two sub-types: adventurous and conservative. Upon receiving reward information , the adventurous population decides to change its strategy while the conservative population does not. To determine its new strategy, the adventurous population uses value iteration on the latest reward information—i.e. Algorithm 1—to compute a new optimal policy. Their resultant density trajectory is then computed using Algorithm 2. The step size at each iteration is equivalent to the fraction of total population who switches strategy. The stopping criteria for the FW algorithm is determined by the Wardrop equilibrium notion—that is, as the population iteratively gets closer to an optimal strategy, the marginal increase in potential decreases to zero.

, .
for  do
     for each  do
          Eqn. (3)
          Eqn. (3)
     end for
end for
Algorithm 1 Value Iteration Method
, , .
for  do
     for each  do
         if  then
         end if
     end for
end for
Algorithm 2 Retrieving density trajectory from a policy
, , , , .
for  do
      ValueIteration(, ) Alg. 1
      = RetrieveDensity() Alg. 2
     Stop if
end for
Algorithm 3 Frank Wolfe Method with Value Iteration

In contrast to implementations of FW in routing game literature, Algorithm 3’s descent direction is determined by solving an MDP [34, Section 4.5] as opposed to a shortest path problem from origin to destination [5, Sec.4.1.3]. Algorithm 3 is guaranteed to converge to a Wardrop equilibirum if the predetermined step sizes decrease to zero as a harmonic series [23]—e.g., . FW with predetermined step sizes has been shown to have sub-linear worst case convergence in routing games [33]. On the other hand, replacing fixed step sizes with optimal step sizes found by a line search method leads to a much better convergence rate.

V Numerical example

In this section, we apply the techniques developed in Section IV to model competition among ride-sharing drivers in metro Seattle. Using the set up described in Section V-A, we demonstrate how a ride-share company takes on the role of social planner and shifts the equilibrium of the driver game in the following two scenarios:

  • Ensuring minimum driver density in various neighborhoods (Section V-B).

  • Improving the social welfare (Section V-C).

V-a Ride-sharing Model

Fig. 1: State representation of metro Seattle.

Consider a ride-share scenario in metro Seattle, where rational ride-sharing drivers seek to optimize their profits while repeatedly working Friday nights. Assume that the demand for riders is constant during the game and for each game play. The time step is taken to be 15 minutes, i.e. the average time for a ride, after which the driver needs to take a new action.

We model Seattle’s individual neighbourhoods as an abstract set of states, , as shown in Fig 1. Adjacent neighbourhoods are connected by edges. The following states are characterized as residential: ‘Ballard’ (3), ‘Fremont’ (4), ‘Sand Point’ (8), ‘Magnolia’ (9), ‘Green Lake’ (11), ‘Ravenna’ (12). Assume drivers have equal probabilities of starting from any of the residential neighbourhoods.

Because drivers cannot see a rider’s destination until after accepting a ride, the game has MDP dynamics. At each state , drivers can choose from two actions. , wait for a rider in , or , transition to an adjacent state . When choosing , we assume the driver will eventually pick up a rider, though it may take more time if there are many drivers waiting for riders in that neighborhood. Longer wait times decrease the driver’s reward for choosing .

On the other hand, there are two possible scenarios when drivers choose . The driver either drives to and pays the travel costs without receiving a fare, or picks up a rider in . We allow the second scenario with a small probability to model the possibility of drivers deviating from their predetermined strategy during game play.

The probability of transition for each action at state are given in (16). denotes the set of neighbouring states, and the number of neighbouring states for state .


The reward function for taking each action is given by

where is the monetary cost for transitioning from state to , is the travel cost from state to , is the coefficient of the cost of waiting for a rider. We compute these various parameters as


where is the congestion effect from drivers who all decide to traverse from to , and is a time-money tradeoff parameter, computed as

Where the average trip length, , is equivalent to the average distance between neighbouring states. The values independent of specific transitions are listed in the Tab. I.

Rate Velocity Fuel Price Fuel Eff
$6 /mi 8 mph $2.5/gal 20 mi/gal $27 /hr 1.25 mi
TABLE I: Parameters for the driver reward function.

V-B Ensuring Minimum Driver Density

To ensure rider satisfaction, the ride-share company aims for a minimum driver coverage of 10 drivers in ‘Belltown’, , a neighborhood with highly variable rider demand. To this end, they solve the optimization problem in (9) where (9a) for , , take on the form


The modified rewards (11) are given by


where each is the optimal dual variable corresponding to each new constraint.

The optimal population distribution in ‘Belltown’ (state 7) and an adjacent neighbourhood, ‘Capitol Hill’ (state 2), are shown in Fig. 2.

Fig. 2: State density of the optimal trajectory solution to (9). A constraint of form (18) is placed on ‘Belltown’, for . The imposed constraints also affect optimal population distribution of other states, as shown by changes in the population distribution of neighboring state .

Note that the incentive is applied to all actions of state . Furthermore, if the solution to the unconstrained problem is feasible for the constrained problem, then —i.e. no incentive is offered. We simulate drivers’ behaviour with Algorithm 3, as a function of decreasing termination tolerance . In Fig. (4), the result shows that the optimal population distribution from the FW algorithm converge to Wardrop equilibrium as the approximation tolerance decreases.

Fig. 4: Convergence of to . Plot shows difference between constrained optimal solution and FW approximation, , normalized by .

V-C Increasing Social Welfare

In most networks with congestion effects, the population does not achieve the maximum social welfare, which can be achieved by optimizing (6) with objective


In general, a gap exists between and , where is the optimal solution to (6), and is the optimal solution to (6) with objective (20).

The typical approach to closing the social welfare gap is to impose mass dependent incentives. An alternative method, perhaps under-explored, is to impose constraints. As opposed to congestion dependent taxation methods for improving social welfare [16, 4], constraint generated tolls are congestion independent.

We can compare the two distributions and generate upper/lower bound constraints with an threshold—see Algorithm 4 for constraint selection method. The number of constraints increases with decreasing . Since the objective function in (6) is continuous in , as approaches zero, the objective will also approach the social optimal.

, .
for each , ,  do
     if  then
     else if  then
     end if
end for
Algorithm 4 Constraint Generation
Fig. 6: With population of , the social welfare of the user selected equilibrium is shown as a function of the number of imposed constraints; increasing the number of constraints is equivalent to decreasing tolerance.
Fig. 9: Maximum and minimum toll values are shown in (a) as a function of number of constraints. In (b), the income/loss required to increase social welfare is shown as a function of constraints imposed.

In Fig. 6, we compare the optimal social welfare to the social welfare at Wardrop equilibrium of the unconstrained congestion game, modeled in Section (V-A), as a function of the population size. We use CVXPY [35] to solve the optimization problem.

We utilize Algorithm 4 to generate incentives for the congestion game. Then, we simulate (9) and compare the game output to the social objective in Fig. 6. For a population size of , there is a discernible gap between the social and user-selected optimal values. Note that with only constraints, the gap between the social optimal and the user-selected equilibrium is already less than .

An interesting question to ask is how much of the total market worth is affected by the incentives. In Fig. 9, we demonstrate how payouts vary based on the number of constraints imposed. Let and . The total payout from the drivers to the social planner and the social planner to the drivers are given by

The net revenue the social planner receives from tolls is

Fig. 9(b) shows how these quantities change as the total number of constraints is increased.

Vi Conclusions

We presented a method for adjusting the reward functions of a MDPCG in order to shift the Wardrop equilibrium to satisfy population mass constraints. Applications of this constraint framework have been demonstrated in a ride-share example in which a social planner aims to constrain state densities or to maximize overall social gain without explicitly constraining the population. Future work include developing online methods that updates incentives corresponding to constraints while the game population adjusts its strategy.


  • [1] D. Calderone and S. S. Sastry, “Markov decision process routing games,” in Proc. Int. Conf. Cyber-Physical Syst.   ACM, 2017, pp. 273–279.
  • [2] D. Calderone and S. Shankar, “Infinite-horizon average-cost markov decision process routing games,” in Proc. Intell. Transp. Syst.   IEEE, 2017, pp. 1–6.
  • [3] J. G. Wardrop, “Some theoretical aspects of road traffic research,” in Inst. Civil Engineers Proc. London/UK/, 1952.
  • [4] M. Beckmann, “A continuous model of transportation,” Econometrica, pp. 643–660, 1952.
  • [5] M. Patriksson, The traffic assignment problem: models and methods.   Courier Dover Publications, 2015.
  • [6] L. S. Shapley, “Stochastic games,” Proc. Nat. Acad. Sci., vol. 39, no. 10, pp. 1095–1100, 1953.
  • [7] J.-F. Mertens and A. Neyman, “Stochastic games,”

    Int. J. Game Theory

    , vol. 10, no. 2, pp. 53–66, 1981.
  • [8] J.-M. Lasry and P.-L. Lions, “Mean field games,” Japan J. Math., vol. 2, no. 1, pp. 229–260, 2007.
  • [9] O. Guéant, “Existence and uniqueness result for mean field games with congestion effect on graphs,” Appl. Math. Optim., vol. 72, no. 2, pp. 291–303, 2015.
  • [10] T. Larsson and M. Patriksson, “An augmented lagrangean dual algorithm for link capacity side constrained traffic assignment problems,” Transport. Res. B, vol. 29, no. 6, pp. 433–455, 1995.
  • [11] D. Hearn, “Bounding flows in traffic assignment models,” Research Report, pp. 80–4, 1980.
  • [12] E. Altman, Constrained Markov decision processes.   CRC Press, 1999, vol. 7.
  • [13] M. El Chamie, Y. Yu, B. Acikmese, and M. Ono, “Controlled markov processes with safety state constraints,” IEEE Transa. Autom. Control, 2018.
  • [14] T. Roughgarden, Selfish routing and the price of anarchy.   MIT press Cambridge, 2005, vol. 174.
  • [15] D. Calderone, “Models of competition for intelligent transportation infrastructure: Parking, ridesharing, and external factors in routing decisions,” Ph.D. dissertation, U.C. Berkeley, ProQuest ID: Calderone_berkeley_0028E_17079, 5 2017, an optional note.
  • [16] A. Pigou, The economics of welfare.   Routledge, 2017.
  • [17] R. Cole, Y. Dodis, and T. Roughgarden, “How much can taxes help selfish routing?” Journal of Computer and System Sciences, vol. 72, no. 3, pp. 444–467, 2006.
  • [18] M. Furuhata, M. Dessouky, F. Ordóñez, M.-E. Brunet, X. Wang, and S. Koenig, “Ridesharing: The state-of-the-art and future directions,” Transport. Res., vol. 57, pp. 28–46, 2013.
  • [19] J. Knope. Download These 12 Best Apps For Uber Drivers & Lyft Drivers. [Online]. Available:
  • [20] M. Katz. This App Let’s Drivers Juggle Competing Uber and Lyft Rides. [Online]. Available:
  • [21] P. Solman. How Uber drivers game the app and force surge pricing. [Online]. Available:
  • [22] A. Ahmed, P. Varakantham, and S.-F. Cheng, “Uncertain congestion games with assorted human agent populations,” arXiv preprint arXiv:1210.4848 [cs.GT], 2012.
  • [23] R. M. Freund and P. Grigas, “New analysis and results for the frank–wolfe method,” Math. Program., vol. 155, no. 1-2, pp. 199–230, 2016.
  • [24] B. Jovanovic and R. W. Rosenthal, “Anonymous sequential games,” J. Math. Econ., vol. 17, no. 1, pp. 77–87, 1988.
  • [25] J. Bergin and D. Bernhardt, “Anonymous sequential games with aggregate uncertainty,” J. Math. Econ., vol. 21, no. 6, pp. 543–562, 1992.
  • [26] ——, “Anonymous sequential games: existence and characterization of equilibria,” Econ. Theory, vol. 5, no. 3, pp. 461–489, 1995.
  • [27] P. Wiecek and E. Altman, “Stationary anonymous sequential games with undiscounted rewards,” J. Optim. Theory Appl., vol. 166, no. 2, pp. 686–710, 2015.
  • [28] O. Guéant, J.-M. Lasry, and P.-L. Lions, “Mean field games and applications,” in Paris-Princeton lectures on mathematical finance 2010.   Springer, 2011, pp. 205–266.
  • [29] O. Guéant, “From infinity to one: The reduction of some mean field games to a global control problem,” arXiv preprint arXiv:1110.3441 [math.OC], 2011.
  • [30] D. A. Gomes, J. Mohr, and R. R. Souza, “Discrete time, finite state space mean field games,” J. Math. Pures Appl., vol. 93, no. 3, pp. 308–328, 2010.
  • [31] O. Guéant, “Mean field games on graphs,” in NETCO 2014, 2014.
  • [32] W. H. Sandholm, “Potential Games with Continuous Player Sets,” J. Econ. Theory, vol. 97, no. 1, pp. 81–108, 2001.
  • [33] W. B. Powell and Y. Sheffi, “The convergence of equilibrium algorithms with predetermined step sizes,” Transport. Sci., vol. 16, no. 1, pp. 45–55, 1982.
  • [34] M. L. Puterman, Markov decision processes: discrete stochastic dynamic programming.   John Wiley & Sons, 2014.
  • [35] S. Diamond and S. Boyd, “Cvxpy: A python-embedded modeling language for convex optimization,”

    The Journal of Machine Learning Research

    , vol. 17, no. 1, pp. 2909–2913, 2016.