In many large-scale networks with self-motivated users, competition for network resources induces congestion effects and directly couples individual users’ decisions. However, aided by widespread technological advances , decision makers now may adaptively optimize their strategies given the plethora of information on states of the network with which they interact . Knowing such behaviour, a system designer may choose to enforce tolls on the optimal-seeking decision makers to satisfy system objectives. For example, for a ride-share company whose drivers seek to maximize their earnings, the company may impose monetary penalties on drivers who roam in suburban neighbourhoods to reduce traffic noise and promote residential well-being. Equivalently, monetary incentives may be given to drivers in commercial areas where large ride-share demands exist, but are not attractive enough to drivers due to their highly congested traffic.
This paper considers the design of optimal tolls which may be imposed on a stochastic system that experiences congestion to achieve a set of design specifications in an online setting. We model such systems as an MDP congestion game  played by a continuous population of decision makers. We formulate the toll synthesis problem from the perspective of a game designer who does not necessarily know the objectives of individual decision makers a priori. By stating the game designer’s objective as a bilevel optimization, we devise learning algorithms for both the designer and the decision makers, such that both the optimal toll and the induced optimal strategies can be simultaneously achieved.
Our analysis utilizes the MDP congestion game model , which is a class of non-cooperative games where decision makers solve an MDP with congestion effects embedded in its cost structure—i.e. the cost of a state-action pair depends on the population of decision makers who choose that state-action pair. In such a game, an MDP Wardrop equilibrium exists if the MDP cost functions are non-decreasing. The equilibrium is optimal for all decision makers as they cannot achieve a lower expected cost by optimizing their strategies [4, 3].
When the MDP Wardrop equilibrium is not unique, convergence to an optimal strategy is non-trivial and requires coordination between decision makers . One form of coordination is realized when decision makers adapt their strategy profile based on the history of other players’ strategies. We interpret this adaptive behaviour as an online learning process, in which decision makers converge to an MDP Wardrop equilibrium according to myopic update rules—i.e., decision makers tend to select actions with the highest return without considering the effect on congestion in the future.
Under an online setting, we employ tolls on state-action costs as a mechanism to ensure that the system level design specifications are satisfied by the resulting MDP Wardrop equilibrium. Extending our prior work , we formulate tolls for a more general class of state-action cost functions—ones that include the limiting case where costs are independent of population distribution, as in classic MDPs—and show the existence of optimal toll values whose addition to the game guarantees the satisfaction of design specifications for a minimal increase in the cost of playing for decision makers.
When the state-action cost functions of decision makers are unknown to the game designer, the optimal toll values cannot be derived explicitly. Hence we consider a scenario where the game designer must deduce the optimal toll value by observing and interacting with a group of decision makers. By incorporating results from constrained traffic assignment literature [7, 8] with MDP algorithms [9, 10, 11], we formulate a myopic toll update based on the population distribution of decision makers converging to an MDP Wardrop equilibrium. Under the interactive scenario between game designer and decision makers, we show that the toll update converges to the optimal tolling values. In particular, the convergence holds whether the decision makers themselves are converging to Wardrop equilibrium exactly or asymptotically.
The rest of this paper is organized as follows. We summarize related work in Section 2. In Section 3, we introduce preliminaries and background for MDP congestion games. In Section 4, we consider a tolling method to enforce affine constraints, and in Section 5.1 we consider learning dynamics for the game designer. In Section 6, we present an implementation of our adaptive tolling method on a ride-share example, where a central authority utilizes toll updates to enforce constraints on decision makers who are themselves following learning dynamics.
2 Related Work
MDP congestion games [3, 12] combine features of non-atomic routing games [4, 13, 14]— i.e., where decision makers influence each other’s edge costs through congestion effects over a network—and stochastic games [15, 16]—i.e., where each decision maker solves an MDP. MDP Wardrop equilibrium for MDP congestion game, akin to Wardrop equilibrium of routing games, was introduced in .
Our approach to satisfying design constraints is similar to techniques for constraint satisfaction through link cost modification in non-atomic routing games, which were developed in . See also [14, Sec. 2.8.2] for a discussion of tolling to enforce side-constraints and [14, Sec. 2.4] for a discussion of tolling to improve social welfare in routing games. In , computation of tolls without knowledge of a routing game’s latency functions is shown to converge in polynomial time. Despite similarities in our approach, the algorithms developed in this paper depend on the stochastic dynamics underlying MDP congestion games, and therefore differ from those in routing games.
Adaptive incentive design has also been considered in both deterministic and stochastic settings in , and in the partial information setting . By leveraging our game’s network structure and specific design objective, we formulate more explicit algorithms with specific tolling functions. The sensitivity of Nash equilibria to incentive function parameters are analyzed in  to obtain the effects of incentives on equilibrium.
Tolling on routing game link costs to satisfy external objectives can also be interpreted as a Stackelberg game  between a leader and followers, where the followers play a non-cooperative game whose cost is influenced by the leader’s actions. Leader action update techniques are developed for leaders optimizing the social cost of its followers; see  for a recent result.
3 Preliminaries and Background
In this section, we present an overview of mathematical preliminaries relevant to MDP congestion games, including a numerical method for solving for the game. We use the notation to denote an index set of length , to denote the set of real numbers, and
for the set of positive real numbers including zero. For any tensor, we use
to denote its components. To denote a vector of ones of size, we use . The support of a vector , is defined to be the set of indices where .
In an archetypal finite MDP problem, each decision maker solves a finite-horizon MDP  with horizon of length , finite state space and finite action space given by
where the objective is to minimize the total expected cost over the finite time horizon . The optimization variable
defines a time-varying state-action probability distribution, such thatdenotes a decision maker’s probability of taking action at state and time .
The probability kernel defines the MDP transition dynamics, where denotes the time varying transition probability from state to when taking action . The kernel satisfies
Solutions to a finite horizon MDP problem described by (1) can be obtained using Algorithms 1 and 2 . Algorithm 1 iteratively evaluates the Bellman equation temporally backwards to determine the optimal expected cost-to-go, and the optimal policy, , which determines the optimal action to take at time and state . Note the operator in Algorithm 1 returns two values: the minimum value of the input set, assigned to , and the argument of the minimum value, assigned to .
Algorithm 2 propagates an initial probability density forward in time to obtain the probability distribution, , that results from employing policy .
3.2 MDP Congestion Game
As in non-atomic games , we now consider a continuous population of decision makers each solving a finite horizon MDP given by (1). The total population distribution is given by , such that describes the portion of total population taking action at state and time . Given a probability kernel and initial distribution , the set of feasible population distributions is described by
We say the population has mass if .
In a non-atomic MDP congestion game, each decision maker in the total population distribution seeks to minimize their expected total cost in the finite horizon . The state-action cost is given by
where . We use to denote the cost tensor whose component is .
When individual decision makers optimize an MDP with -dependent state-action costs, an MDP Wardrop equilibrium is reached if the population cannot achieve a better expected total cost by choosing a different feasible distribution within .
Definition 1 (MDP Wardrop Equilibrium )
We say a population distribution is a MDP Wardrop equilibrium if it satisfies
3.2.1 Potential Game Formulation
MDP congestion games are a subset of potential games . If an explicit potential function exists for , such that , then all solutions of the following optimization problem are Wardrop equilibria of the MDP congestion game corresponding to :
In the rest of this paper, we specialize to the case where is only a function of . This allows for an explicit potential function in the form of an integral. In particular, the optimization problem given in (4) can be re-formulated as
We also require to satisfy a monotonicity, non-decreasing, property.
Assumption 1 (Monotonicity)
For each , , the component of state-action cost tensor satisfies .
Analogous to the colloquial meaning of congestion, Assumption 1 states that the state-action cost for triplet cannot decrease as the portion of population that chooses it increases. A limiting scenario is the classic MDP, in which is a constant function of .
Assumption 1 implies both non-uniqueness of MDP Wardrop equilibrium and convexity for optimization problem (5). Non-uniqueness of MDP Wardrop equilibrium can be readily observed by noting that the optimal solution of an MDP problem is also not unique in general. Convexity of (5) allows us to leverage existing convex optimization techniques to solve MDP congestion games.
3.3 Frank-Wolfe Algorithm
Although optimization problem (5) is an explicit formulation of the MDP Wardrop equilibrium set and can be solved numerically, arriving at the same equilibrium requires decision makers to anticipate each other’s strategies. Coordination between individual decision makers is difficult even when the set of decision makers is finite . Additionally, decision makers themselves may not know the full congestion cost model and may only observe the resulting costs from previous games.
Despite these difficulties, decision makers who perform myopic updates derived from the Frank-Wolfe algorithm  will achieve asymptotic convergence to MDP Wardrop equilibrium . Frank-Wolfe is a gradient-based iterative algorithm consisting of two steps: 1) compute steepest feasible descent direction using a linearized objective around current iterate, and 2) combine a descent direction and current iterate convexly to produce the next iterate. The full algorithm is given in Algorithm 3. At each iteration, provides a certificate of performance in terms of how close the approximate solution is to the optimal solution —that is,
This gap can be interpreted as the population regret at iteration .
The Frank-Wolfe algorithm leverages the fact that the linear sub-problems may be solved with cheaper computation complexity, and therefore results in faster convergence. Indeed, by leveraging dynamic programming solutions for MDPs (Algorithms 1 and 2), Algorithm 3 has superior computational efficiency compared to off-the-shelf solvers, especially for lower precision requirements .
3.4 Online Learning Framework for Decision Makers
We interpret the Frank-Wolfe algorithm as a stochastic policy seeking strategy that guarantees convergence to an MDP Wardrop equilibrium over repeated game play.
In the repeated game play scenario, each iteration corresponds to a play of the game. The iteration (lines 4–11) describes the policy changes that occur between and play of the game. Before the start of iteration, an oracle provides to the population of decision makers. Then, the decision makers calculate the optimal marginal policy (line 7) based on the marginal state-action costs (line 5) from the previous iteration. Each decision maker then either uses the its previous policy from iteration with probability or they use the current policy determined in line 7 with probability . The population distribution under this policy assignment into two subpopulations (where assignments are made uniformly at random) is described in line 9. With the chosen sequence for , convergence to the optimal policy and population distribution is guaranteed by the convergence properties of the Frank-Wolfe algorithm as described in the preceding subsection.
4 Optimal Toll Synthesis for Constraint Satisfaction
In this section, we formulate toll synthesis as a bilevel optimization problem and show the existence of an optimal tolling value. We show that when the game designer can modify decision makers’ state-action costs, design specifications on the MDP Wardrop equilibrium distribution can be satisfied. Furthermore, we show there exist optimal tolling values for which design specifications are exactly satisfied.
There are design specifications that the game designer must enforce, which are affine constraints on the population density . In practice, affine constraints can appear due to a broad range of design requirements. For example, in a city’s traffic network, artificial limits (expressed as linear combinations of street traffic density) may be imposed in residential areas to minimize noise. When building investment portfolios, banks are faced with stochastic returns based on investment actions , but may be legally required to satisfy minimum thresholds in reserve, which are also affine constraints.
Assumption 2 (Affine State-Action Constraints)
The constraints imposed on the MDP congestion game can be written as a finite set of affine functions specified by
where , , and is a finite set of indices corresponding to the affine constraints.
To enforce constraint , we propose the following toll function:
Tolls are enforced only when the constraint becomes violated, where the toll on time-state-action triplet is proportional to both and the tolling value . Tolling value is a game designer controlled parameter that determines the severity of penalty for the constraint violation. The toll function changes decision maker’s costs as
The game designer must utilize its control to enforce constraints on the MDP Wardrop equilibrium induced by in an MDP congestion game. This can be formulated as a bilevel optimization problem given by
where denotes set of MDP Wardrop equilibria that correspond to ,
We note that a related problem is the synthesis of a policy which maps distributions in to state-action costs, such that the state-action costs induces a Wardrop equilibrium that satisfies the constraint specifications. The synthesis of can be thought of as a reverse Stackelberg game . All possible pairs define the space of allowable tolling functions over which the game designer (leader) is able to optimize to induce a desirable distribution that satisfies the given constraints. Indeed, the problem as viewed from this perspective is a more general ‘toll synthesis’ problem. Synthesizing mappings is a much more difficult problem and the typical approach is to parameterize the space of mappings and then optimize over the parameters.
Given a non-atomic MDP congestion game played by a continuous population of decision makers, and design constraints satisfying Assumption 2, we show the existence of an optimal , such that the MDP Wardrop equilibrium corresponding to costs in (8) satisfies the constraints .
Theorem 1 (Existence of Optimal Toll)
Consider the game designer’s optimization problem formulated in (9) where the followings hold: (i) the decision makers’ state-action cost satisfies Assumption 1; (ii) the constraint set satisfies Assumption 2; and, (iii) the intersection of constraint set and is non-empty. Then, there exists that solves the game designer’s optimal toll synthesis problem given by (9), where belongs to constraint set . Furthermore, is the MDP Wardrop equilibrium induced by state-action costs described by (8).
See Section 8.1.
The optimal toll threshold is exactly the optimal dual variable that corresponds to constraint in the constrained optimization problem given in (A.1).
Under Frank Wolfe-based learning dynamics in Algorithm 3, the penalty cost from (7) can be interpreted as a tolling scheme in the learning process, where the game designer decides to enforce penalties based on last game’s distribution .
When the constraint is violated during the previous game, will include toll during the next game play. Alternatively, when the state constraints are not violated, during the next game play. Intuitively, the additional toll discourages decision makers from choosing . Theorem 1 shows that there exists an optimal level of discouragement, such that the constraint is satisfied by the MDP Wardrop equilibrium.
In (8), it is evident that , unlike , is not only a function of . However, we note that still corresponds to an explicit potential function as shown in the potential game formulation in (10). Additionally, we note that the tolling scheme in (8) is dynamic under online learning framework for decision makers, in the sense that the toll enforced varies between each game play. In our previous work , we proved convergence of an optimal toll method that does not vary in between game plays. This constant tolling method can be applied when state-action costs are strictly increasing functions of .
5 Toll Learning Dynamics
When implementing Algorithm 4, the optimal tolling value is assumed to be known a priori. We now consider the toll design problem: how to adaptively determine the optimal tolling value that guarantees constraint satisfaction.
We approach this under the assumption that the game designer does not have access to the state-action cost models of decision makers, , and therefore cannot obtain the optimal tolling value by utilizing offline solvers to solve the constrained optimization problem (A.1). This is often the case for traffic networks where the toll setting authority does not have a model of drivers’ cost function. Instead, we assume the game designer can interact with an ongoing MDP congestion game by setting tolls on its state-action costs and observe how different tolls modify MDP Wardrop equilibrium of the game.
5.1 Game Designer’s Online Learning Framework
In the game designer’s online learning framework, the game designer repeatedly sets the toll on an existing MDP congestion game. Here, we assume that the decision makers will always converge to an MDP Wardrop equilibrium given any toll. When the decision making population determines a MDP Wardrop equilibrium that corresponds to the toll set by the game designer, the game designer derives a new toll based on the amount of the constraint violations that occurred in the equilibrium distribution.
The goal of the game designer is to enforce the constraint set on MDP Wardrop equilibrium. To achieve this, we consider a tolled state-action cost with a population dependent toll and a dynamic toll which varies between each game play.
where is a constraint-based toll function, is a constant scaling factor of the toll function, and is a toll value updated by the game designer.
When satisfies Assumption 1, is the gradient of a convex potential function for any toll , , that is enforced. The set of MDP Wardrop equilibria induced by state-action costs is defined as follows:
The online learning framework has a feedback control interpretation as shown in Fig. 1. The process by which decision makers achieve MDP Wardrop equilibrium is the plant , a memoryless process that outputs an MDP Wardrop equilibrium that corresponds to state-action costs . The toll setting mechanism utilized by the game designer can be analogously viewed as a feedback controller, , that updates based on the amount of constraint violation in .
We derive a myopic update rule for based on the Alternating Direction Method of Multipliers (ADMM) method. The complete algorithm is given in Algorithm 5.
Theorem 2 (Optimal Toll Estimation)
See Section 8.2
In the online learning framework, Algorithm 5 has the following interpretation. The designer first determines the amount of constraint violation that exists in an MDP Wardrop equilibrium distribution . For any constraint that is violated, is increased by exactly the amount that the mass-dependent toll function, , incurred during the last iteration at every time step. When constraint is not violated, the toll is set to zero. After each toll adjustment, the designer announces a toll update, and the decision makers converge to an MDP Wardrop equilibrium based on the new toll. Algorithm 5 shows that such a toll adjustment algorithm will result in convergence of to the optimal tolling value , convergence of to the MDP Wardrop equilbrium of game with state-action costs given by (11).
Although the constant penalty applied to the MDP congestion game at iteration in Algorithm 5 is , is the parameter that converges to . That is, we can view as an internal variable of the ADMM process.
While Algorithm 5 only provides an estimate of the optimal tolling value at each finite iteration. Theorem
only provides an estimate of the optimal tolling value at each finite iteration. Theorem1 states that any will result in an optimal solution to the bilevel optimization problem given in (9). This implies that at the iteration of Algorithm 5, we obtain a tolling value which ensures constraint satisfaction for the set of MDP Wardrop equilibria of game with state-action costs given by (8), and whose distance to decreases on the order of . A natural question is what benefit does greater accuracy in gain. In Fig. 6 in Section 6, we show that greater accuracy in results in a faster convergence rate for the decision makers when they themselves undergo learning dynamics.
In order to implement Algorithm 5 in an online learning setting, the game designer does not need an explicit model of the congestion effects of the decision makers, as long as the designer can add to the costs of decision makers through imposing monetary tolls on the relevant links. In this way, the toll learning dynamics are effectively decoupled from the state-action cost model utilized by decision makers.
5.2 Inexact ADMM
In the iterative approach outlined in Algorithm 5, the convergence of and constraint satisfaction require lower level decision makers to precisely return the Wardrop equilibrium. In this section, we present convergence results to the case where the decision makers always approximately solves for Wardrop equilibrium.
The assumption that decision maker can always determine exact Wardrop equilibrium is restrictive. In applications on large-scaled networks, it is often difficult to solve for the exact solution. In comparison, approximate solutions are often computationally cheaper to obtain. In using Algorithm 5, an approximate solution of may be preferred in Algorithm 5 to an exact solution if it results in a similar level of accuracy in estimating the optimal tolling threshold . This is especially true when is still far from , where may gain a comparable amount of accuracy from both and an approximation of .
Another motivation is considering learning dynamics of the game designer interacting with decision makers who are following learning dynamic themselves. In most learning dynamics for non-atomic congestion games, decision makers often asymptotically approach the true MDP Wardrop equilibrium [27, 28]. In any finite time, the game designer can only observe an approximate MDP Wardrop equilibrium. In such a set up, it is not obvious that the game designer’s learning dynamics will still converge to the optimal tolling threshold.
In Algorithm 6, we make the assumption that decision makers are using an iterative method to solve for . To measure closeness of an approximation, we consider an MDP Wardrop equilibrium.
Definition 2 (–MDP Wardrop Equilibrium)
Given , , there exists , such that at the iteration, satisfies
where , and defined by (11), then satisfies –equilibrium at penalty , denoted by .
where is the diameter of feasible set .
To show convergence of Algorithm 6, we also assume strictly increasing costs for the MDP congestion game.
For each , , and there exists such that
holds for all , .
Under Assumption 3, the corresponding MDP Wardrop equlibrium set is a singleton for any toll . Although it is possible for decision makers to individually solve for the exact equilibrium, we note that decision makers are still able to save significant computation complexity by utilizing learning dynamics that approximate the equilibrium.
Theorem 3 (Designer-Follower Learning Convergence)
For an MDP congestion game satisfying Assumption 3, suppose the strategies employed by decision makers converge asymptotically, such that at each tolling iteration , the decision makers’ learning dynamics satisfy
Then, the game designer’s updates for in Algorithm 6 converge when the approximate MDP Wardrop equilibrium observed by the game designer satisfies
Furthermore, if also converges, must converge to an optimal solution of (9).
See Section 8.3.
Examples of learning dynamics where the optimal solution converges to are Frank-Wolfe algorithms  as well as projected subgradient algorithms . In the Frank-Wolfe case, it is not obvious when the iteration of the decision makers learning dynamics satisfy , as the convergence bounds are given in terms of the potential function.
If approximate solutions to the MDP Wardrop equilibrium can be obtained for less computation complexity, Theorem 3 shows that approximations of can also be obtained for the game designer with less computations. Algorithm 6
is especially useful for large-scale networks where computing the exact MDP Wardrop equlibrium can be difficult, but methods such as Frank Wolfe offer fast convergence to an approximate equilibria. In the next section, we consider a heuristic for the finite termination of decision makers’ dynamics based on the Frank-Wolfe duality gapdefined in (6), and show that Algorithm 5 and Algorithm 6 achieve similar convergence behaviour while the latter is significantly faster, due to the lack of accuracy required to solve .
In this section, we model competition among ride-sharing drivers in metro Seattle as an MDP congestion game. Section 6.1 describes the model and necessary assumptions. In Section 6.2, we utilize Algorithm 4, Algorithm 5 and Algorithm 6 to demonstrate how the ride-share company can enforce tolls to ensure minimum driver density constraints.
6.1 Ride-sharing Model
Consider a ride-share scenario in metro Seattle, where rational ride-sharing drivers seek to optimize their profits while repeatedly working Friday nights. Assume that the demand for riders is constant during the game and for each game play. The time increment is taken to be 15 minutes, i.e. the average time for a ride, after which the driver needs to take a new action.
We model the Seattle metro area as an abstract set of states, , corresponding to individual neighbourhoods, as shown in Fig. 2. is a neighbouring state of if a direct route from to exists without passing through other states. The following states are characterized as residential: ‘Ballard’ (3), ‘Fremont’ (4), ‘Sand Point’ (8), ‘Magnolia’ (9), ‘Green Lake’ (11), ‘Ravenna’ (12). Assume drivers have equal probabilities of starting from any of the residential neighbourhoods.
Drivers in the same neighbourhood compete for riders. We assume that drivers cannot see the rider’s destination before choosing to picking rider up, nor could he refuse a rider based his or her destination. From the driver’s perspective, the rider’s destination is randomly chosen from a uniform distribution of the neighbouring states. Therefore, the set of ride-sharing drivers can be modeled by congestion game with MDP dynamics. At each state, the following actions are available to each driver:
: wait for a rider, transition to rider’s destination state.
: transition to an adjacent state without waiting for a rider.
When choosing , we assume the driver will eventually pick up a rider, though it may take more time if there are many drivers waiting for riders in that neighborhood. As discussed below, longer wait times decrease the driver’s reward for choosing .
On the other hand, there are two possible scenarios when drivers choose . The driver either drives to and pays the travel costs without receiving a fare, or picks up a rider in . We allow the second scenario with a small probability to model the possibility of drivers deviating from their predetermined strategy during game play.
We denote by the set of neighbouring states for state . The transition probability for state-action pair given current state is given by
The reward function for taking each action is given by
where is the monetary cost for transitioning from state to , is the travel cost from state to , is the coefficient of the cost of waiting for a rider. We estimate various parameters in (6.1) as
where is the mild congestion effect from of drivers who all decide to traverse from to , which we take to be throughout the rest of this section. Parameter is quantifies the time-money tradeoff, computed as
where the average trip length, , is equivalent to the average distance between neighbouring states.
The values independent of specific transitions are listed in the Tab. 1. Distance is the geophysical distance between neighbourhoods.
|Rate||Velocity||Fuel Price||Fuel Eff|
|$6 /mi||8 mph||$2.5/gal||20 mi/gal||$27 /hr||1.25 mi|
6.2 Ensuring Minimum Driver Density
To ensure rider satisfaction for a neighbourhood with high fluctuation in rider demand during the later time steps, the ride-share company aims for a minimum driver coverage of in ‘Belltown’, ,
The ride-share company must determine the optimal toll to charge to ensure that the resulting MDP Wardrop equilibrium satisfies the constraints while maximizing their own profits. Assuming that the company does not have a model of its drivers state-action costs, the optimal toll problem can be expressed under the game designer’s online learning framework: the ride-share company must adaptively determine neighbourhood tolls based on how drivers’ population distribution changes in response to different tolls.
In the interaction with drivers who are the decision makers of the game, we consider two different scenarios: when the drivers themselves undergo learning dynamics, such that only approximate MDP Wardrop equilibrium can be returned in finite time (Algorithm 6) versus when drivers can solve for and form an exact MDP Wardrop equilibrium in finite time (Algorithm 5). We utilize Algorithm 4 to return approximate MDP Wardrop equilibriums of an toll set by the company, and solve optimization problem (12) with CVXPY  directly to simulate when drivers do reach MDP Wardrop equilibrium in finite time. We compare both solutions against the baseline solution we obtain with CVXPY by solving the bilevel optimization problem (9).
Although Algorithm 6 requires an bound on the approximate MDP Wardrop equilibrium in solving the sub-problem in line 4, such a bound does not cannot be explicitly related to the termination criteria for Algorithm 4. Instead we introduce the following heuristic used in lieu of the original termination criteria,
We run Algorithm 6 with , and . The convergence of and optimal tolling threshold for both Algorithm 5 and Algorithm 6 are shown in Fig. 3. As expected, Algorithm 5 has a sublinear convergence rate in both MDP congestion game’s objective value and the optimal tolling threshold. Algorithm 6 does not have any convergence guarantees, however from comparison with Algorithm 5, we observe that it has the same rate of convergence on the optimal tolling threshold, and even slightly better rates of convergence in terms of population regret.
The amount of constraint violation for both algorithms are shown in Fig. 4. During the toll estimation process, we observe that the amount of constraint violation monotonically approaches at similar rates for both algorithms. However, Algorithm 5 has significantly less constraint violation than Algorithm 6.
With the estimated from Algorithm 6, we can apply optimal tolling Algorithm 4 for tolling parameters . We utilize toll function as in (7). Convergence behaviour of primal objective for five different tolling values is shown in Fig. 5.
The toll applied in each data set is , where is given by the legend shown. We see that the convergence rate improves when is closer to , which motivates obtaining an accurate estimate of the optimal tolling value.
The population density as a function of time for the constrained state that results from Algorithm 4 is shown in Fig. 6a in green, where we use a tolling value of and run the algorithm for 500 iterations, the optimal tolling value used is as derived using Algorithm 6. We compare the distribution to the optimal unconstrained density shown in blue, and the constrained MDP Wardrop equilibrium distribution as solved by CVXPY, shown in orange. Because Algorithm 4 produces a population distribution that asymptotically approaches the constrained MDP Wardrop equilibrium, we note that after 500 iterations, the population density is close to but does not satisfy the imposed constraints.
In Fig. 6b, we compare the primal objective achieved for the unconstrained MDP Wardrop equilibrium (uCVX), the constrained equilibrium (cCVX), and approximate equilibrium achieved using Algorithm 4 after 500 iterations (eFW). After 500 iterations, Algorithm 4 returns a population distribution whose potential value is significantly reduced from the unconstrained potential.
Finally, in Fig. 7, we see that Algorithm 6 is much more efficient computationally in comparison to Algorithm 5 for the same accuracy. We simulate both algorithms for increasing accuracy in toll estimation . In Algorithm 5, line 4, each MDP Wardrop equilibrium is computed using CVXPY. Since the potential game optimization formulation (5) has the same structure for all iterations, CVXPY consistently solves it in seconds. In Algorithm 6, the MDP Wardrop equilibrium at iteration is approximated by using Algorithm 4, which terminates at iteration when where is a scaling factor taken to be the primal objective value of the constrained MDP Wardrop equilibrium, as we obtained in Fig. 6b under column cCVX. Since decreases sub-linearly in , we see that the number of iterations that Algorithm 4 requires to satisfy such a bound also increases sub-linearly. Despite the increase in solving time at each iteration , Algorithm 6 is still significantly more efficient than Algorithm 5, especially given that the two algorithms result in similar levels of accuracy in optimal toll estimation.
In this paper, we addressed how to satisfy given state constraints of an MDP congestion game. We provided an online learning interpretation of the Frank Wolfe algorithm for decision makers, as well as present an online learning interpretation of ADMM to determine the optimal tolling value corresponding to a set of design constraints. Future directions involve extending the ADMM algorithm to the distributed scenario by considering its convergence properties when each decision maker can only receive partial information about state-action costs at each game iteration.
8.1 Proof of Theorem 1
The decision makers play an MDP congestion game with state-action costs given by (8), resulting in optimal distribution set which solves the optimization problem in (10). The objective of the optimization problem in (10) can be decomposed into a potential function due to the original cost and a constraint dependent function which satisfies
By construction, is a continuous exterior penalty function relative to . When we take the limit of , becomes an indicator function. The potential game formulation of MDP congestion games would be equivalent to the following constrained optimization problem:
When , the optimization problem in (A.1) has a non-empty solution set. From the exact penalty theorem [34, Prop. 5.4.5], there exists a finite for which if , . Finally we observe that for all such that , therefore solves game designer’s toll synthesis problem given by (9) when .
8.2 Proof of Theorem 2
where the dual parameter is the optimal dual parameter corresponding to constraint (A.2c).
When satisfies non-decreasing property in Assumption 1 and satisfies Assumption 2, (A.2a)–(A.2c) form a convex optimization problem. To iteratively solve for the optimal threshold , we consider using the convex optimization algorithm ADMM , by first noting that (A.2) is equivalent to an equality constrained optimization given by
We use the following variable updates :
by defining the augmented Lagrangian,
Updates for and can be shown to be equivalent to lines 6–11 in Algorithm 5. In particular, the update for is the analytical solution to . The update for is equivalent to finding an MDP Wardrop equilibrium for a game played with state-action costs given by (11) when .
8.3 Proof of Theorem 3
When , from the variational inequality characterization for convex minimization, there exists a , such that