## 1 Introduction

Due in large part to the increasing adoption of digital technologies, many applications that once treated users are passive entities must now consider users as active participants. In many application domains, a planner or coordinator, such as a platform provider (e.g., transportation network companies), is tasked with optimizing the performance of a system that people are actively interacting with, often in real-time. For instance, the planner may want to drive the system performance to a more desirable behavior. While perhaps on competing ends of the spectrum, both revenue maximization and social welfare maximization fall under this umbrella.

A significant challenge in optimizing such an objective is the
fact that human preferences are unknown *a priori* and perhaps their
solicited responses, on which the system depends, may not be reported truthfully (i.e. in
accordance with their true preferences) due to issues related privacy or trust.

We consider a class of incentive design problems in which a
does not know the underlying preferences, or decision–making process, of the agents that it is trying to coordinate. In the economics
literature these types of problems are known as problems of *asymmetric
information*—meaning that the involved parties do not possess the same
information sets and, as is often the case, one party posses some information
to which the other party is not privy.

The particular type of information
asymmetry which we consider, i.e. where the preferences of the agents are
unknown to the planner, results in a problem of *adverse selection*. The classic example of adverse selection is the *market for
lemons* [akerlof:1970aa] in which the seller of a used car knows more
about the car than
the buyer. There are a number of components that are hidden from the buyer such
as the maintenance upkeep history, engine health, etc. Hence, the buyer could
end up with a *lemon* instead of a *cherry*—i.e. a broken down
piece of junk versus a sweet ride.
Such problems have long been studied by economists.

The incentive design problem has also been explored by the control community, usually in the context of (reverse) Stackelberg games (see, e.g., [Ho:1981aa, Ho:1984aa, Liu:1992aa]). More recently, dynamic incentive design in the context of applications such as the power grid [Zhou:2017aa] or network congestion games [Barrera:2015aa]. We take a slightly different view by employing techniques from learning and control to develop an adaptive method of designing incentives in a setting where repeated decisions are made by multiple, competiting agents whose preferences are unknown to the designer, yet they are subjected to the incentives.

We assume that
agents, including the , are cost minimizers^{1}^{1}1
While in the remainder, we formulate the entire problem
given all agents are cost minimizers, the utility maximization formulation is
completely analogous.. The decision space of the agents are assumed continuous.
We model each agent’s cost as a parametric function that is dependent on the
choices of other agents and is modified by an incentive chosen by the planner.
The not knowing the underlying preferences of the
agents is tantamount to it not knowing the value of the parameters
of the agents’ cost functions. Such parameters can be thought of as the
*type* of the agent.

We formulate an adaptive incentive design problem in which the planner iteratively learns the agents’ preferences and optimizes the incentives offered to the agents so as to drive them to a more desirable set of choices. We derive an algorithm to solve this problem and provide theoretical results on convergence for both the case when the agents play according to a Nash equilibrium as well as the case when the agents play myopically—e.g. the agents play according to a myopic update rule common in the theory of learning in games [fudenberg:1998aa]

. Specifically, we formulate an algorithm for iteratively estimating preferences and designing incentives. By adopting tools from adaptive control and online learning, we show that the algorithm converges under reasonable assumptions.

The results have strong ties to both the adaptive control
literature [goodwin:1984aa, kumar:1986aa, sastry:1999aa] and the online
learning literature [cesa-bianchi:2006aa, nemirovski:2009aa, raginsky:2010aa]. The former gives us tools to do tracking of both the observed output
(agents’ strategies) and the control input (incentive mechanism). It also allows
us to go one step further and prove parameter convergence under some
additional assumptions—*persistence of excitation*—on the problem formulation and, in particular, the
utility learning and incentive design algorithm. The latter
provides tools that allow us to generalize the algorithm and get faster
convergence of the observed actions of the agents to a more desirable or even
socially optimal outcome.

The remainder of the paper is organized as follows. We first introduce the problem of interest in Section 2. In Sections 3 and 4, we mathematically formulate the utility learning and incentive design problems and provide an algorithm for adaptive incentive design. We present convergence results for the Nash and myopic-play cases in Section LABEL:sec:main after which we draw heavily on adaptive control techniques to provide convergence results when the planner receives noisy observations. We provide illustrative numerical examples in Section LABEL:sec:examples and conclude in Section LABEL:sec:discussion.

## 2 Problem Formulation

We consider a problem in which there is a coordinator or planner with an objective, , that it desires to optimize by selecting ; however, this objective is a function of which is the response of non–cooperative strategic agents each providing response . Regarding the dimension of each , we assume without loss of generality that they are scalars. All the theoretical results and insights apply to the more general setting where each agent’s choice is of arbitrary finite dimension.

The goal of the is to design a mechanism to *coordinate* the
agents by incentivizing them choose an that ultimately leads to minimization of
.
Yet, the
coordinator does not know the decision making process by which these agents
arrive at their collective response . As a consequence there is *asymmetric
information* between the agents and the .

Let us suppose that the each agent has some type and a process
that determines their *choice*
. This process is dependent on the other agents and any mechanism
designed by the . The classical approach in the economics literature
is to solve this problem of so-called *adverse
selection* [bolton:2005aa] by designing mechanisms that induce agents’ to
take actions in a way that corresponds with their true decision-making process
. In this approach, it is assumed that the coordinator has a
*prior*

on the type space of the agents—e.g., a probability distribution on

. The coordinator then designs a mechanism (usually static) based on this assumed prior that encourages agents to act in accordance with their true preferences.We take an alternative view in which we adopt control theoretic and optimization
techniques to adaptively learn the agents’ types while designing incentives to
coordinate the agents around a more desirable (from the point of view of the
) choice .
Such a framework departs from one-shot decisions that assume all prior
information is known at the start of the engagement and opens up opportunities
for mechanisms that are dynamic and can *learn* over time.

We thus take the view that, in order to optimize its objective, the must learn the decision-making process and simultaneously design a mechanism that induces the agents to respond in such a way such that the ’s objective is optimized.

The first optimizes its objective function to find the
*desired* response and the *desired* . That is, it
determines the optimizers of its cost as if and are its decision
variables.
Of course, it may be the case that the set of optimizers of
contains more than one pair ; in this case, the coordinator must
choose amongst the set of optimizers. In order to realize , the
must incentivize the agents to play by synthesizing mappings
for each such that and is the collective response of the agents under their true
processes .

We will consider two scenarios: (i) Agents play
according to a *Nash equilibrium* strategy; (ii) Agents play according to a *myopic*
update rule—e.g. approximate gradient play or fictitious play[fudenberg:1998aa].

In the first scenario, if the agents are assumed to play according to a Nash equilibrium strategy,
then must be a Nash equilibrium in the game induced by
. In particular, using the notation , let agent have
*nominal* cost and *incentivized* cost
eq:incentcost
f_i^γ_i(x_i,
x_-i)=f_i(x_i, x_-i)+γ_i(x_i, x_-i).
The desired response
is a Nash equilibrium of the incentivized game if

(1) |

Hence, is a *best response* to for each
.
Formally, we define a Nash equilibrium as follows.

[Nash Equilibrium of the Incentivized Game] A point is a Nash equilibrium of the incentivized game if

(2) |

If, for each , the inequality in (2) holds only for
a neighborhood of , then is a *local Nash
equilibrium*.

We make use of a sub-class of Nash equilibria called
*differential Nash equilibria*, as they can be characterized locally and
thus, amenable to computation. Let the *differential game form* [ratliff:2015aa, Definition 2]
be defined by .
[[ratliff:2015aa, Definition 4]]
A strategy is a differential Nash
equilibrium of if and is positive definite for
each .
Differential Nash equilibria are known to be generic amongst local Nash
equilibria [ratliff:2014aa], structurally stable and attracting under
tâtonnement [ratliff:2015aa].

In the second scenario, we assume the agents play according to a *myopic
update rule* [fudenberg:1998aa] defined as follows.
Given the incentive , agent ’s response
is determined by the mapping

(3) |

In addition, function maps the history, from time up to time , of the agents’ previous collective response to the current response where is the product space with copies of the space .

We aim to design an algorithm in which the performs a *utility
learning* step and an *incentive design* step such that as the
iterates through the algorithm, agents’ collective observed response converges to the
desired response and the value of the incentive mapping evaluated at converges to
the desired value . In essence, we aim to ensure *asymptotic* or
*approximate* incentive compatibility. In the sections that follow, we describe the utility
learning and the incentive design steps of the algorithm and then, present the
algorithm itself.

## 3 Utility Learning Formulation

We first formulate a general utility learning problem, then we give examples in the the and cases.

### 3.1 Utility Learning Under Nash–Play

We assume that the knows the parametric structure of the agents’ nominal cost functions and receives observations of the agents’ choices over time. That is, for each , we assume that the nominal cost function of agent has the form of a generalized linear model

(4) |

where

is a vector of basis functions given by

, assumed to be known to the , and is a parameter vector, , assumed unknown to the .While our theory is developed for this case, we show through simulations in Section LABEL:sec:examples that the can be agnostic to the agents’ decision-making processes and still drive them to the desired outcome.

Let the set of basis functions for the agents’ cost functions be denoted by . We assume that elements of are and *Lipschitz continuous*. Thus the derivative of any function
in is uniformly bounded.

The admissible set of parameters for agent , denoted by , is assumed to be a compact subset of and to contain the true parameter vector . We will use the notation when we need to make the dependence on the parameter explicit.

Note that we are limiting the problem of asymmetric information to
one of *adverse selection* [bolton:2005aa] since it is the parameters
of the cost functions that are unknown to the coordinator.

Similarly, we assume that the admissible incentive mappings have a generalized linear model of the form

(5) |

where is a vector of basis functions, belonging to a finite collection , and assumed to be and *Lipschitz continuous*, and
are parameters.

This framework can be generalized to use different subsets of the basis functions for different players, simply by constraining some of the parameters or to be zero. We choose to present the theory with a common number of basis functions across players in an effort to minimize the amount of notation that needs to be tracked by the reader.

At each iteration , the receives the collective response from the agents, i.e. , and has the incentive parameters that were issued.

We denote the set of observations up to time by —where is the observed Nash equilibrium of the nominal game (without incentives)—and the set of incentive parameters . Each of the observations is assumed to be an Nash equilibrium.

For the incentivized game , a Nash equilibrium necessarily satisfies the first- and second-order conditions and for each (see [ratliff:2015aa, Proposition 1]).

Under this model, we assume that the agents are playing a local Nash—that is, each is a local Nash equilibrium so that

(6) |

for and , where with denoting the derivative of with respect to and where we define similarly. By an abuse of notation, we treat derivatives as vectors instead of co-vectors.

As noted earlier, without loss of generality, we take . This makes the notation significantly simpler and the presentation of results much more clear and clean. All details for the general setting are provided in [ratliff:2015ab].

In addition, for each , we have

(7) |

for and where and are the second derivative of and , respectively, with respect to .

Let the admissible set of ’s at iteration be denoted by . They are defined using the second–order conditions from the assumption that the observations at times are local Nash equilibria and are given by

(8) |

These sets are nested, i.e. since at each iteration an additional constraint is added to the previous set. These sets are also convex since they are defined by semidefinite constraints [boyd:2004aa]. Moreover, for all since, by assumption, each observation is a local Nash equilibrium.

Since the sets the incentives, given the response , they can compute the quantity , which is equal to

by the first order Nash condition (6). Thus, if we let and , we have

Then, the coordinator has observations and *regression
vectors* .
We use the notation for the regression
vectors of all the agents at iteration .

### 3.2 Utility Learning Under Myopic–Play

As in the Nash–play case, we assume the knows the parametric structure of the myopic update rule. That is to say, the nominal update function is parameterized by over basis functions and the incentive mapping at iteration is parameterized by over basis functions . We assume the observes the initial response and we denote the past responses up to iteration by . The general architecture for the myopic update rule is given by

(9) | ||||

(10) |

Note that the update rule does not need to depend on the whole sequence of past response. It could depend just on the past response or a subset, say for .

As before, we denote the set of admissible parameters for player by which we assume to be a compact subset of . In contrast to the case, our admissible set of parameters is no long time varying so that for all .

Keeping consistent with the notation of the previous sections, we let and so that the myopic update rule can be re-written as Analogous to the previous case, the coordinator has observations and regression vectors . Again, we use the notation for the regression vectors of all the agents at iteration .

Note that the form of the myopic update rule is general enough to accommodate a number of game-theoretic learning algorithms including approximate fictitious play and gradient play [fudenberg:1998aa].

### 3.3 Unified Framework for Utility Learning

We can describe both the Nash–play and myopic–play cases in a unified framework as follows. At iteration , the receives a response which lives in the set which is either the set of local Nash equilibria of the incentivized game or the unique response determined by the incentivized myopic update rule at iteration . The uses the past responses and incentive parameters to generate the set of observations and regression vectors .

The utility learning problem is formulated as an online optimization problem in which parameter updates are calculated following the gradient of a loss function. For each

, consider the loss function given by eq:loss_functionℓ(ik)=12∥yk+1i-ξikik∥22 that evaluates the error between the predicted observation and the true observation at time for each player.In order to minimize this loss, we introduce a well-known generalization of the projection
operator.
Denote by the set of
*subgradients* of at . A convex continuous function is a *distance
generating* function with modulus with respect to a reference norm , if the set
is convex and restricted to , is continuously differentiable and
*strongly convex* with
parameter , that is

The function , defined by

is the Bregman divergence [bregman:1967aa] associated with . By definition, is non-negative and strongly convex with modulus . Given a subset and a point , the mapping defined by

(11) |

is the prox-mapping induced by on . This mapping is well-defined, the minimizer is unique by strong convexity of , and is a contraction at iteration , [moreau:1965aa, Proposition 5.b].

Given the loss function , a positive, non-increasing sequence of learning rates , and a distance generating function , the parameter estimate of each is updated at iteration as follows

(12) |

Note that if the distance generating function is , then the associated Bregman divergence is the Euclidean distance, , and the corresponding prox–mapping is the Euclidean projection on the set , which we denote by , so that

(13) |

## 4 Incentive Design Formulation

In the previous section, we described the parameter update step that will be used in our utility learning and incentive design problem. We now describe how the incentive parameters for each iteration are selected. In particular, at iteration , after updating parameter estimates for each agent, the data the has includes the past observations , incentive parameters , and has an estimate of each for . The then uses the past data along with the parameter estimates to find an such that the incentive mapping for each player evaluates to at and . This is to say that if the agents are rational and play Nash, then is a local Nash equilibrium of the game where denotes the incentivized cost of player parameterized by . On the other hand, if the agents are myopic, then, for each ,

In the following two subsections, for each of these cases, we describe how is selected.

### 4.1 Incentive Design: Nash–Play

Given that is parameterized by , the goal is to find for each such that is a local Nash equilibrium of the game

and such that for each .

For every where , there exist for each such that is the induced differential Nash equilibrium in the game and where .

We remark that the above assumption is not restrictive in the following sense. Finding that induces the desired Nash equilibrium and results in evaluating to the desired incentive value amounts to finding such that the first and secondorder sufficient conditions for a local Nash equilibrium are satisfied given our estimate of the agents’ cost functions. That is, for each , we need to find satisfying

(14) |

where

and

If is full rank, i.e. has rank , then there exists a
that solves the first equation in (14). If the number of
basis functions satisfies , then the rank condition is not
unreasonable and in fact, there are multiple solutions. In essence, by selecting
to be *large enough*

, the is allowing for enough degrees of freedom to ensure there exists a set of parameters

that induce the desired result. Moreover, the problem of finding reduces to a convex feasibility problem.The convex feasibility problem defined by (14) can be formulated as a constrained leastsquares optimization problem. Indeed, for each ,

for some . By Assumption 4.1, for each , there is an such that the cost is exactly minimized.

The choice of determines how well-conditioned the second-order derivatives of agents’ costs with respect to their own choice variables is. In addition, we note that if there are a large number of incentive basis functions, it may be reasonable to incorporate a cost for sparsity—e.g., ; however, the optimal solution in this case is not guaranteed to satisfy (14).

It is desirable for the induced local Nash equilibrium to be a
*stable, non-degenerate
differential Nash equilibrium* so that it is attracting in a neighborhood
under the gradient flow [ratliff:2015aa].
To enforce this, the must add
additional constraints to the feasibility problem defined by
(14). In particular,
secondorder
conditions on player cost functions must be satisfied, i.e. that the
derivative of
the differential game form is positive–definite [ratliff:2015aa, Theorem 2].
This reduces to
ensuring
where

Comments

There are no comments yet.