Due in large part to the increasing adoption of digital technologies, many applications that once treated users are passive entities must now consider users as active participants. In many application domains, a planner or coordinator, such as a platform provider (e.g., transportation network companies), is tasked with optimizing the performance of a system that people are actively interacting with, often in real-time. For instance, the planner may want to drive the system performance to a more desirable behavior. While perhaps on competing ends of the spectrum, both revenue maximization and social welfare maximization fall under this umbrella.
A significant challenge in optimizing such an objective is the fact that human preferences are unknown a priori and perhaps their solicited responses, on which the system depends, may not be reported truthfully (i.e. in accordance with their true preferences) due to issues related privacy or trust.
We consider a class of incentive design problems in which a does not know the underlying preferences, or decision–making process, of the agents that it is trying to coordinate. In the economics literature these types of problems are known as problems of asymmetric information—meaning that the involved parties do not possess the same information sets and, as is often the case, one party posses some information to which the other party is not privy.
The particular type of information asymmetry which we consider, i.e. where the preferences of the agents are unknown to the planner, results in a problem of adverse selection. The classic example of adverse selection is the market for lemons [akerlof:1970aa] in which the seller of a used car knows more about the car than the buyer. There are a number of components that are hidden from the buyer such as the maintenance upkeep history, engine health, etc. Hence, the buyer could end up with a lemon instead of a cherry—i.e. a broken down piece of junk versus a sweet ride. Such problems have long been studied by economists.
The incentive design problem has also been explored by the control community, usually in the context of (reverse) Stackelberg games (see, e.g., [Ho:1981aa, Ho:1984aa, Liu:1992aa]). More recently, dynamic incentive design in the context of applications such as the power grid [Zhou:2017aa] or network congestion games [Barrera:2015aa]. We take a slightly different view by employing techniques from learning and control to develop an adaptive method of designing incentives in a setting where repeated decisions are made by multiple, competiting agents whose preferences are unknown to the designer, yet they are subjected to the incentives.
We assume that agents, including the , are cost minimizers111 While in the remainder, we formulate the entire problem given all agents are cost minimizers, the utility maximization formulation is completely analogous.. The decision space of the agents are assumed continuous. We model each agent’s cost as a parametric function that is dependent on the choices of other agents and is modified by an incentive chosen by the planner. The not knowing the underlying preferences of the agents is tantamount to it not knowing the value of the parameters of the agents’ cost functions. Such parameters can be thought of as the type of the agent.
We formulate an adaptive incentive design problem in which the planner iteratively learns the agents’ preferences and optimizes the incentives offered to the agents so as to drive them to a more desirable set of choices. We derive an algorithm to solve this problem and provide theoretical results on convergence for both the case when the agents play according to a Nash equilibrium as well as the case when the agents play myopically—e.g. the agents play according to a myopic update rule common in the theory of learning in games [fudenberg:1998aa]
. Specifically, we formulate an algorithm for iteratively estimating preferences and designing incentives. By adopting tools from adaptive control and online learning, we show that the algorithm converges under reasonable assumptions.
The results have strong ties to both the adaptive control literature [goodwin:1984aa, kumar:1986aa, sastry:1999aa] and the online learning literature [cesa-bianchi:2006aa, nemirovski:2009aa, raginsky:2010aa]. The former gives us tools to do tracking of both the observed output (agents’ strategies) and the control input (incentive mechanism). It also allows us to go one step further and prove parameter convergence under some additional assumptions—persistence of excitation—on the problem formulation and, in particular, the utility learning and incentive design algorithm. The latter provides tools that allow us to generalize the algorithm and get faster convergence of the observed actions of the agents to a more desirable or even socially optimal outcome.
The remainder of the paper is organized as follows. We first introduce the problem of interest in Section 2. In Sections 3 and 4, we mathematically formulate the utility learning and incentive design problems and provide an algorithm for adaptive incentive design. We present convergence results for the Nash and myopic-play cases in Section LABEL:sec:main after which we draw heavily on adaptive control techniques to provide convergence results when the planner receives noisy observations. We provide illustrative numerical examples in Section LABEL:sec:examples and conclude in Section LABEL:sec:discussion.
2 Problem Formulation
We consider a problem in which there is a coordinator or planner with an objective, , that it desires to optimize by selecting ; however, this objective is a function of which is the response of non–cooperative strategic agents each providing response . Regarding the dimension of each , we assume without loss of generality that they are scalars. All the theoretical results and insights apply to the more general setting where each agent’s choice is of arbitrary finite dimension.
The goal of the is to design a mechanism to coordinate the agents by incentivizing them choose an that ultimately leads to minimization of . Yet, the coordinator does not know the decision making process by which these agents arrive at their collective response . As a consequence there is asymmetric information between the agents and the .
Let us suppose that the each agent has some type and a process that determines their choice . This process is dependent on the other agents and any mechanism designed by the . The classical approach in the economics literature is to solve this problem of so-called adverse selection [bolton:2005aa] by designing mechanisms that induce agents’ to take actions in a way that corresponds with their true decision-making process . In this approach, it is assumed that the coordinator has a prior
on the type space of the agents—e.g., a probability distribution on. The coordinator then designs a mechanism (usually static) based on this assumed prior that encourages agents to act in accordance with their true preferences.
We take an alternative view in which we adopt control theoretic and optimization techniques to adaptively learn the agents’ types while designing incentives to coordinate the agents around a more desirable (from the point of view of the ) choice . Such a framework departs from one-shot decisions that assume all prior information is known at the start of the engagement and opens up opportunities for mechanisms that are dynamic and can learn over time.
We thus take the view that, in order to optimize its objective, the must learn the decision-making process and simultaneously design a mechanism that induces the agents to respond in such a way such that the ’s objective is optimized.
The first optimizes its objective function to find the desired response and the desired . That is, it determines the optimizers of its cost as if and are its decision variables. Of course, it may be the case that the set of optimizers of contains more than one pair ; in this case, the coordinator must choose amongst the set of optimizers. In order to realize , the must incentivize the agents to play by synthesizing mappings for each such that and is the collective response of the agents under their true processes .
We will consider two scenarios: (i) Agents play according to a Nash equilibrium strategy; (ii) Agents play according to a myopic update rule—e.g. approximate gradient play or fictitious play[fudenberg:1998aa].
In the first scenario, if the agents are assumed to play according to a Nash equilibrium strategy, then must be a Nash equilibrium in the game induced by . In particular, using the notation , let agent have nominal cost and incentivized cost eq:incentcost f_i^γ_i(x_i, x_-i)=f_i(x_i, x_-i)+γ_i(x_i, x_-i). The desired response is a Nash equilibrium of the incentivized game if
Hence, is a best response to for each . Formally, we define a Nash equilibrium as follows.
[Nash Equilibrium of the Incentivized Game] A point is a Nash equilibrium of the incentivized game if
If, for each , the inequality in (2) holds only for a neighborhood of , then is a local Nash equilibrium.
We make use of a sub-class of Nash equilibria called differential Nash equilibria, as they can be characterized locally and thus, amenable to computation. Let the differential game form [ratliff:2015aa, Definition 2] be defined by . [[ratliff:2015aa, Definition 4]] A strategy is a differential Nash equilibrium of if and is positive definite for each . Differential Nash equilibria are known to be generic amongst local Nash equilibria [ratliff:2014aa], structurally stable and attracting under tâtonnement [ratliff:2015aa].
In the second scenario, we assume the agents play according to a myopic update rule [fudenberg:1998aa] defined as follows. Given the incentive , agent ’s response is determined by the mapping
In addition, function maps the history, from time up to time , of the agents’ previous collective response to the current response where is the product space with copies of the space .
We aim to design an algorithm in which the performs a utility learning step and an incentive design step such that as the iterates through the algorithm, agents’ collective observed response converges to the desired response and the value of the incentive mapping evaluated at converges to the desired value . In essence, we aim to ensure asymptotic or approximate incentive compatibility. In the sections that follow, we describe the utility learning and the incentive design steps of the algorithm and then, present the algorithm itself.
3 Utility Learning Formulation
We first formulate a general utility learning problem, then we give examples in the the and cases.
3.1 Utility Learning Under Nash–Play
We assume that the knows the parametric structure of the agents’ nominal cost functions and receives observations of the agents’ choices over time. That is, for each , we assume that the nominal cost function of agent has the form of a generalized linear model
is a vector of basis functions given by, assumed to be known to the , and is a parameter vector, , assumed unknown to the .
While our theory is developed for this case, we show through simulations in Section LABEL:sec:examples that the can be agnostic to the agents’ decision-making processes and still drive them to the desired outcome.
Let the set of basis functions for the agents’ cost functions be denoted by . We assume that elements of are and Lipschitz continuous. Thus the derivative of any function in is uniformly bounded.
The admissible set of parameters for agent , denoted by , is assumed to be a compact subset of and to contain the true parameter vector . We will use the notation when we need to make the dependence on the parameter explicit.
Note that we are limiting the problem of asymmetric information to one of adverse selection [bolton:2005aa] since it is the parameters of the cost functions that are unknown to the coordinator.
Similarly, we assume that the admissible incentive mappings have a generalized linear model of the form
where is a vector of basis functions, belonging to a finite collection , and assumed to be and Lipschitz continuous, and are parameters.
This framework can be generalized to use different subsets of the basis functions for different players, simply by constraining some of the parameters or to be zero. We choose to present the theory with a common number of basis functions across players in an effort to minimize the amount of notation that needs to be tracked by the reader.
At each iteration , the receives the collective response from the agents, i.e. , and has the incentive parameters that were issued.
We denote the set of observations up to time by —where is the observed Nash equilibrium of the nominal game (without incentives)—and the set of incentive parameters . Each of the observations is assumed to be an Nash equilibrium.
For the incentivized game , a Nash equilibrium necessarily satisfies the first- and second-order conditions and for each (see [ratliff:2015aa, Proposition 1]).
Under this model, we assume that the agents are playing a local Nash—that is, each is a local Nash equilibrium so that
for and , where with denoting the derivative of with respect to and where we define similarly. By an abuse of notation, we treat derivatives as vectors instead of co-vectors.
As noted earlier, without loss of generality, we take . This makes the notation significantly simpler and the presentation of results much more clear and clean. All details for the general setting are provided in [ratliff:2015ab].
In addition, for each , we have
for and where and are the second derivative of and , respectively, with respect to .
Let the admissible set of ’s at iteration be denoted by . They are defined using the second–order conditions from the assumption that the observations at times are local Nash equilibria and are given by
These sets are nested, i.e. since at each iteration an additional constraint is added to the previous set. These sets are also convex since they are defined by semidefinite constraints [boyd:2004aa]. Moreover, for all since, by assumption, each observation is a local Nash equilibrium.
Since the sets the incentives, given the response , they can compute the quantity , which is equal to
by the first order Nash condition (6). Thus, if we let and , we have
Then, the coordinator has observations and regression vectors . We use the notation for the regression vectors of all the agents at iteration .
3.2 Utility Learning Under Myopic–Play
As in the Nash–play case, we assume the knows the parametric structure of the myopic update rule. That is to say, the nominal update function is parameterized by over basis functions and the incentive mapping at iteration is parameterized by over basis functions . We assume the observes the initial response and we denote the past responses up to iteration by . The general architecture for the myopic update rule is given by
Note that the update rule does not need to depend on the whole sequence of past response. It could depend just on the past response or a subset, say for .
As before, we denote the set of admissible parameters for player by which we assume to be a compact subset of . In contrast to the case, our admissible set of parameters is no long time varying so that for all .
Keeping consistent with the notation of the previous sections, we let and so that the myopic update rule can be re-written as Analogous to the previous case, the coordinator has observations and regression vectors . Again, we use the notation for the regression vectors of all the agents at iteration .
Note that the form of the myopic update rule is general enough to accommodate a number of game-theoretic learning algorithms including approximate fictitious play and gradient play [fudenberg:1998aa].
3.3 Unified Framework for Utility Learning
We can describe both the Nash–play and myopic–play cases in a unified framework as follows. At iteration , the receives a response which lives in the set which is either the set of local Nash equilibria of the incentivized game or the unique response determined by the incentivized myopic update rule at iteration . The uses the past responses and incentive parameters to generate the set of observations and regression vectors .
The utility learning problem is formulated as an online optimization problem in which parameter updates are calculated following the gradient of a loss function. For each, consider the loss function given by eq:loss_functionℓ(ik)=12∥yk+1i-ξikik∥22 that evaluates the error between the predicted observation and the true observation at time for each player.
In order to minimize this loss, we introduce a well-known generalization of the projection operator. Denote by the set of subgradients of at . A convex continuous function is a distance generating function with modulus with respect to a reference norm , if the set is convex and restricted to , is continuously differentiable and strongly convex with parameter , that is
The function , defined by
is the Bregman divergence [bregman:1967aa] associated with . By definition, is non-negative and strongly convex with modulus . Given a subset and a point , the mapping defined by
is the prox-mapping induced by on . This mapping is well-defined, the minimizer is unique by strong convexity of , and is a contraction at iteration , [moreau:1965aa, Proposition 5.b].
Given the loss function , a positive, non-increasing sequence of learning rates , and a distance generating function , the parameter estimate of each is updated at iteration as follows
Note that if the distance generating function is , then the associated Bregman divergence is the Euclidean distance, , and the corresponding prox–mapping is the Euclidean projection on the set , which we denote by , so that
4 Incentive Design Formulation
In the previous section, we described the parameter update step that will be used in our utility learning and incentive design problem. We now describe how the incentive parameters for each iteration are selected. In particular, at iteration , after updating parameter estimates for each agent, the data the has includes the past observations , incentive parameters , and has an estimate of each for . The then uses the past data along with the parameter estimates to find an such that the incentive mapping for each player evaluates to at and . This is to say that if the agents are rational and play Nash, then is a local Nash equilibrium of the game where denotes the incentivized cost of player parameterized by . On the other hand, if the agents are myopic, then, for each ,
In the following two subsections, for each of these cases, we describe how is selected.
4.1 Incentive Design: Nash–Play
Given that is parameterized by , the goal is to find for each such that is a local Nash equilibrium of the game
and such that for each .
For every where , there exist for each such that is the induced differential Nash equilibrium in the game and where .
We remark that the above assumption is not restrictive in the following sense. Finding that induces the desired Nash equilibrium and results in evaluating to the desired incentive value amounts to finding such that the first and secondorder sufficient conditions for a local Nash equilibrium are satisfied given our estimate of the agents’ cost functions. That is, for each , we need to find satisfying
If is full rank, i.e. has rank , then there exists a that solves the first equation in (14). If the number of basis functions satisfies , then the rank condition is not unreasonable and in fact, there are multiple solutions. In essence, by selecting to be large enough
, the is allowing for enough degrees of freedom to ensure there exists a set of parametersthat induce the desired result. Moreover, the problem of finding reduces to a convex feasibility problem.
The convex feasibility problem defined by (14) can be formulated as a constrained leastsquares optimization problem. Indeed, for each ,
for some . By Assumption 4.1, for each , there is an such that the cost is exactly minimized.
The choice of determines how well-conditioned the second-order derivatives of agents’ costs with respect to their own choice variables is. In addition, we note that if there are a large number of incentive basis functions, it may be reasonable to incorporate a cost for sparsity—e.g., ; however, the optimal solution in this case is not guaranteed to satisfy (14).
It is desirable for the induced local Nash equilibrium to be a stable, non-degenerate differential Nash equilibrium so that it is attracting in a neighborhood under the gradient flow [ratliff:2015aa]. To enforce this, the must add additional constraints to the feasibility problem defined by (14). In particular, secondorder conditions on player cost functions must be satisfied, i.e. that the derivative of the differential game form is positive–definite [ratliff:2015aa, Theorem 2]. This reduces to ensuring where