# Learning Nash Equilibria in Monotone Games

We consider multi-agent decision making where each agent's cost function depends on all agents' strategies. We propose a distributed algorithm to learn a Nash equilibrium, whereby each agent uses only obtained values of her cost function at each joint played action, lacking any information of the functional form of her cost or other agents' costs or strategy sets. In contrast to past work where convergent algorithms required strong monotonicity, we prove algorithm convergence under mere monotonicity assumption. This significantly widens algorithm's applicability, such as to games with linear coupling constraints.

## Authors

• 3 publications
• 17 publications
• ### A distributed algorithm for average aggregative games with coupling constraints

We consider the framework of average aggregative games, where the cost f...
06/14/2017 ∙ by Francesca Parise, et al. ∙ 0

We are concerned with finding Nash Equilibria in agent-based multi-clust...
02/18/2021 ∙ by Jan Zimmermann, et al. ∙ 0

• ### The Nash Equilibrium with Inertia in Population Games

In the traditional game-theoretic set up, where agents select actions an...
10/01/2019 ∙ by Basilio Gentile, et al. ∙ 0

• ### Competitive Safety Analysis: Robust Decision-Making in Multi-Agent Systems

Much work in AI deals with the selection of proper actions in a given (k...
06/22/2011 ∙ by M. Tennenholtz, et al. ∙ 0

• ### Multi-Agent Task Allocation in Complementary Teams: A Hunter and Gatherer Approach

12/12/2019 ∙ by Mehdi Dadvar, et al. ∙ 0

• ### Re-evaluating evaluation

Progress in machine learning is measured by careful evaluation on proble...
06/07/2018 ∙ by David Balduzzi, et al. ∙ 0

• ### Multi-Agent Generalized Recursive Reasoning

We propose a new reasoning protocol called generalized recursive reasoni...
01/26/2019 ∙ by Ying Wen, et al. ∙ 16

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## I Introduction

Game theory is a powerful framework for analyzing and optimizing multi-agent decision making problems. In several such problems, each agent (referred to also as a player) does not have full information on her objective function, due to the unknown interactions and other players’ strategies affecting her objective. Consider for example, a transportation network in which an agent’s objective is minimizing travel time or an electricity network in which an agent’s objective is minimizing own’s electricity prices. In these instances, the travel times and prices, respectively, depend non-trivially on the strategies of other agents. Motivated by this limited information setup, we consider computing Nash equilibria given only the so-called payoff-based information. That is, each player can only observe the values of its objective function at a joint played action, does not know the functional form of her or others’ objectives, nor the strategy sets and actions of other players, and cannot communicate with other players. In this setting, we address the question of how agents should update their actions to converge to a Nash equilibrium strategy.

A large body of literature on learning Nash equilibria with payoff-based information has focused on finite action setting or potential games, see for example, [11, 12, 7] and references therein. For games with continuous (uncountable) action spaces, a payoff-based approach was developed based on the extremum seeking idea in optimization [3, 13], and assuming strongly convex objectives almost sure convergence to the Nash equilibrium was proven. A payoff-based approach, inspired by the logit dynamics in finite action games [2] was extended to continuous action setting for the case of potential games [14]. The work in [16]

considered learning Nash equilibria in continuous action games on networks. Crucially, the work additionally assumed that each player exchanges information with her neighbors, to facilitate estimation of the gradient of her objective function online.

Recently, we proposed a payoff-based approach to learn Nash equilibria in a class of convex games [15]. Our approach hinged upon connecting Nash equilibria of a game to the solution set of a related variational inequality problem. Our algorithm convergence was established for the cases in which the game mapping is strongly monotone or the game admits a potential function. Apart from possibly limited scope of a potential game, strong monotonicity can be too much to ask for. In particular, if the objective function of an agent is linear in her action or in the presence of coupling constraints of the action sets the game mapping will not be strongly monotone.

Our goal here is to extend the existing payoff-based learning approaches to a broader class of games characterized by monotone game mappings. While algorithms for solving monotone variational inequalities exist (see, for example, Chapter 12 in [9]), these algorithms either consist of two timescales (Tikhonov regularization approach) or have an extra gradient step (extra-gradient methods). As such, they require more coordination between players than that possible in a payoff-based only information structure.

Our contributions are as follows. First, we propose a distributed payoff-based algorithm to learn Nash equilibria in a monotone game, extending our past work [15] applicable to strongly monotone games, inspired by the single timescale algorithm for solving stochastic variational inequalities [6]. Second, despite lack of gradients in a payoff-based information, contrary to the setup in [6]

, we show that our proposed procedure can be interpreted as a stochastic gradient descent with an additional biasL and regularization terms. Third, we prove convergence of the proposed algorithm to Nash equilibria by suitably bounding the bias and noise variance terms using established results on boundedness and convergence of discrete-time Markov processes.

Notations. The set is denoted by

. Boldface is used to distinguish between vectors in a multi-dimensional space and scalars. Given

vectors , , ; . and denote respectively, vectors from with non-negative coordinates and non-negative whole numbers. The standard inner product on is denoted by : , with associated norm . Given some matrix , , if and only if for all . We use the big- notation, that is, the function is as , = as , if for some positive constant . We say that a function grows not faster than a function as , if there exists a positive constant such that with .

###### Definition 1

A mapping is monotone over , if for every .

## Ii Problem Formulation

Consider a game with players, the sets of players’ actions , , and the cost (objective) functions , where denotes the set of joint actions. We restrict the class of games as follows.

###### Assumption 1

The game under consideration is convex. Namely, for all the set is convex and closed, the cost function is defined on , continuously differentiable in and convex in for fixed .

###### Assumption 2

The mapping , referred to as the game mapping, defined by

 M(a) =(∇aiJi(ai,a−i))Ni=1=(M1(a),…,MN(a))⊤, where Mi(a)=(Mi,1(a),…,Mi,d(a))⊤, and Mi,k(a) =∂Ji(a)∂aik,a∈A,i∈[N],k∈[d], (1)

is monotone on (see Definition 1).

We consider a Nash equilibrium in game as a stable solution outcome because it represents a joint action from which no player has any incentive to unilaterally deviate.

###### Definition 2

A point is called a Nash equilibrium if for any and

 Ji(ai∗,a−i∗)≤Ji(ai,a−i∗).

Our goal is to learn such a stable action in a game through designing a payoff-based algorithm. We first connect existence of Nash equilibria for with solution set of a corresponding variational inequality problem.

###### Definition 3

Consider a mapping : and a set . A solution to the variational inequality problem is a set of vectors such that , .

###### Theorem 1

(Proposition 1.4.2 in [9]) Given a game with game mapping , suppose that the action sets are closed and convex, the cost functions are continuously differentiable in and convex in for every fixed on the interior of . Then, some vector is a Nash equilibrium in , if and only if .

It follows that under Assumptions 1 and 2 for a game with mapping , any solution of is also a Nash equilibrium in such games and vice versa. While under Assumptions 1 and 2 might admit a Nash equilibrium, these two assumptions alone do not guarantee existence of a Nash equilibrium. To guarantee existence, one needs to consider a more restrictive assumption, for example, strong monotonicity of the game mapping or compactness of the action sets [9]. Here, we do not restrict our attention to such cases. However, to have a meaning discussion, we do assume existence of at least one Nash equilibrium in the game.

###### Assumption 3

The set is not empty.

###### Corollary 1

Let be a game with game mapping for which Assumptions 1, 2, and 3 hold. Then, there exists at least one Nash equilibrium in . Moreover, any Nash equilibrium in belongs to the set .

The following additional assumptions are needed for convergence of the proposed payoff-based algorithm to a Nash equilibrium (see proofs of Lemma 3 and Theorem 2).

###### Assumption 4

Each element of the game mapping , defined in Assumption (2) is Lipschitz continuous on with a Lipschitz constant .

###### Assumption 5

Each cost function , , grows not faster than a linear function of as .

## Iii Payoff-Based Algorithm

Given a payoff-based information, each agent has access to its current action, referred to as its state and denoted by , and the cost value at the joint states , at iteration . Using this information in the proposed algorithm each agent “mixes” its next state . Namely, it chooses

randomly according to the multidimensional normal distribution

with the density:

 pi (xi1,…,xid;μi(t+1),σ(t+1)) =1(√2πσ(t+1))dexp{−d∑k=1(xik−μik(t+1))22σ2(t+1)}.

The initial value of the means , , can be set to any finite value. The successive means are updated as follows:

 μi(t+1)=ProjAi[μi(t) (2) −γ(t)σ2(t)(^Ji(t)xi(t)−μi(t)σ2(t)+ϵ(t)μi(t))]. (3)

In the above, denotes the projection operator on set , is a step-size parameter and is a regularization parameter. We highlight the difference between the proposed approach and that of [15] due to the additional term in (2). In the absence of this term the algorithm would not be convergent under a mere monotonicity assumption on the game mapping (see counterexample provided in [4]).

Let us provide insight into the algorithm by deriving an analogy to a regularized stochastic gradient algorithm. Given , for any define as

 ~Ji (μ1,…,μN,σ)=∫RNdJi(x)p(μ,x,σ)dx, (4)

where . Above, , , can be interpreted as the th player’s cost function in mixed strategies. We can now show that the second term inside the projection in (2) is a sample of the gradient of this cost function with respect to the mixed strategies. Let .

###### Lemma 1

Under Assumptions 1 and 5,

 ∂~Ji(μ(t),σ(t))∂μik=Ex(t){^Ji(t)xik(t)−μik(t)σ2(t)} = E{Ji(x1(t),…,xN(t))xik(t)−μik(t)σ2(t)| xik(t)∼N(μik(t),σ(t)),i∈[N],k∈[d]}. (5)
###### Proof:

We verify that the differentiation under the integral sign in (4) is justified. It can then readily be verified that (1) holds, by taking the differentiation inside the integral. A sufficient condition for differentiation under the integral is that the integral of the formally differentiated function with respect to converges uniformly, whereas the differentiated function is continuous (see [17], Chapter 17). By formally differentiating the function under the integral sign and omitting the arguments , we obtain

 1σ2∫RNdJi(x)(xik−μik)p(μ,x,σ)dx. (6)

Given Assumption 1, is continuous. Thus, it remains to check that the integral of this function converges uniformly with respect to any . To this end, we can write the Taylor expansion of the function around the point with the coordinates and for any , , in the integral (6):

 ∫RNdJi(x)(xik−μik)p(μ,x,σ)dx =∫RNd[Ji(μ(i,k)) +∂Ji(η(x,μ))∂xik(xik−μik)](xik−μik)p(μ,x,σ)dx =∫RNd∂Ji(η(x,μ))∂xik(xik−μik)2p(μ,x,σ)dx =∫RNd∂Ji(η1(y,μ))∂xik(yik)2p(0,y,σ)dy,

where , , , . The uniform convergence of the integral above follows from the fact111see the basic sufficient condition using majorant [17], Chapter 17.2.3. that, under Assumption 5, for some positive constant and for all and . Hence,

 |∂Ji(η1(y,μ))∂xik(yik)2p(0,y,σ)|≤h(y)=l(yik)2p(0,y,σ),

where .

Lemma (1) shows that the second term inside the projection in (2) is a sample of the gradient of the cost function in mixed strategies. Hence, algorithm (2) can be interpreted as a regularized stochastic projection algorithms. To bound the bias and variance terms of the stochastic projection and consequently establish convergence of the iterates , the parameters , , need to satisfy certain assumptions.

###### Assumption 6

Let and choose , and , respectively, such that

a)

b)

c) ,

d) ,

###### Theorem 2

Let the players in game choose the states at time according to the normal distribution , where the mean is arbitrary and is updated as in (2). Under Assumptions 1-6, as , the mean vector converges almost surely to a Nash equilibrium of the game and the joint state

converges in probability to

.

###### Remark 1

As an example for existence of parameters to satisfy Assumption 6, let , , .

## Iv Analysis of the Algorithm

To prove Theorem 2 we first prove boundedness of the iterates . Due to the regularization term , this is done by analyzing distance of from the so-called Tikhonov trajectory. Having established this boundedness, we can readily show that the limit of the iterates exists and satisfies the conditions of a Nash equilibrium of the game . For the boundedness and the convergence proofs, we use established results on boundedness ([8], Theorem 2.5.2) and convergence of a sequence of stochastic processes (Lemma 10 (page 49) in [10]), respectively. For ease of reference, we provide the statement of ([8], Theorem 2.5.2) and (Lemma 10 (page 49) in [10] ) in the appendix.

### Iv-a Boundedness of the Algorithm Iterates

We first show that algorithm (2) falls under the framework of well-studied Robbins-Monro stochastic approximations procedures [1] with an additional regularization . Next, leveraging this analogy and results on stability of discrete-time Markov processes ([8], Theorem 2.5.2) applied to the sequence we prove boundedness of the iterates.

Using the notation , we can rewrite the algorithm step in (2) in the following form:

 μi(t+1)=ProjAi[μi(t)−γ(t)σ2(t) (7) ×(Mi(μ(t))+Qi(μ(t),σ(t))+Ri(μ(t),x(t),σ(t)) (8) +ϵ(t))μi(t)), (9)

for all

 Qi(μ(t),σ(t))=~Mi(μ(t),σ(t))−Mi(μ(t)), Ri(x(t),μ(t),σ(t))=Fi(x(t),μ(t),σ(t))−~Mi(μ(t),σ(t)), Fi(x(t),μ(t),σ(t))=^Ji(t)xi(t)−μi(t)σ2(t),

and is the -dimensional mapping with the following elements:

 ~Mi,k(μ(t),σ(t))=∂~Ji(μ(t),σ(t))∂μik, for% k∈[d]. (10)

The vector corresponds to the gradient term in stochastic approximation procedures, whereas

 Q(μ(t),σ(t))=(Q1(μ(t),σ(t)),…, QN(μ(t),σ(t)))

is a disturbance of the gradient term. Finally,

 R(x(t),μ(t),σ(t))=( R1(x(t),μ(t),σ(t)),…, RN(x(t),μ(t),σ(t)))

is a martingale difference, namely, according to (1),

 Ri(x(t),μ(t),σ(t))=Fi(x(t),μ(t),σ(t)) (11) −Ex(t){Fi(x(t),μ(t),σ(t))},i∈[N].

To ensure boundedness of (Lemma 3) we bound the martingale term above (see Inequality (IV-A)). To bound the disturbance of the gradients (see Equation (42)), we observe that the mapping evaluated at is equivalent to the game mapping in mixed strategies (please see Appendix for the proof of this observation). That is,

 ~Mi( μ(t))=∫RNdMi(x)p(μ(t),x)dx. (12)

In contrast to stochastic approximation algorithms and the proof in [15], we have an addition term to be able to address merely monotone game mappings. As such, to bound we also relate the variations of the sequence to those of the Tikhonov sequence defined below. Let denote the solution of the variational inequality , namely

 y(t)∈SOL(A,M(y)+ϵ(t)y). (13)

The sequence is known as the Tikhonov sequence and enjoys the following two important properties.

###### Theorem 3

(Theorem 12.2.3 in [9]) Under Assumptions 2, 3, and 4, defined in (13) exists and is unique for each . Moreover, for , is uniformly bounded and converges to the least norm solution of .

###### Lemma 2

(Lemma 3 in [6]) Under Assumption 2

 ∥y(t)−y(t−1)∥≤My|ϵ(t−1)−ϵ(t)|ϵ(t),∀t≥1,

where is a uniform bound on the norm of the Tikhonov sequence, i.e. for all .

With the results above in place, we connect the squared distance to the squared distance for any and . Due to the triangle inequality,

 ∥μ−y(t)∥ ≤∥μ−y(t−1)∥+∥y(t−1)−y(t)∥ (14) ≤∥μ−y(t−1)∥+My|ϵ(t−1)−ϵ(t)|ϵ(t),

where in the last inequality we used Lemma 2. Hence, by taking into account that for any and

 2ab≤θa2+b2θ,

we conclude from (14) that for any

 ∥μ−y(t)∥2≤ (1+θ)∥μ−y(t−1)∥2 (15) +(1+1θ)M2y|ϵ(t−1)−ϵ(t)|2ϵ2(t). (16)

The above bound serves as the main new inequality in order to show almost-sure boundedness of in comparison to non-regularized stochastic gradient procedures.

###### Lemma 3

Let Assumptions 2-6 hold in and be the vector updated in the run of the payoff-based algorithm (7). Then, .

In the following, for simplicity in notation, we omit the argument in the terms , , and . In certain derivations, for the same reason we omit the time parameter as well.

###### Proof:

Define , where is the Tikhonov sequence defined by (13). We consider the generating operator of the Markov process

 LV(t,μ)=E[V(t+1,μ(t+1))∣μ(t)=μ]−V(t,μ),

and aim to show that satisfies the following decay

 LV(t,μ)≤−α(t+1)ψ(μ)+ϕ(t)(1+V(t,μ)), (17)

where on , , , , , . This enables us to apply Theorem 2.5.2 in [8] to directly conclude almost sure boundedness of .

Let us bound the growth of in terms of . Let in (15). From Assumption 6 b), as . Hence,

 V(t+1,μ)= ∥μ−y(t)∥2 (18) ≤ (1+β(t)ϵ(t))∥μ−y(t−1)∥2 +(1+1β(t)ϵ(t))M2y|ϵ(t−1)−ϵ(t)|2ϵ2(t) (19) = O(1+∥μ−y(t−1)∥2)=O(1+V(t,μ)).

From the procedure for the update of , the non-expansion property of the projection operator, the fact that belongs to , namely, that

 yi(t)=ProjAi[yi(t)−β(t)(Mi(y(t))+ϵ(t)yi(t)],

we obtain that for any

 ∥ μi(t+1)−yi(t)∥2 (20) (21) +(Mi(μ(t))−Mi(y(t))+Qi(μ(t))+Ri(x(t),μ(t))]∥2 (22) =∥μi(t)−yi(t)∥2 (23) −2β(t)(Mi(μ(t))−Mi(y(t)),μi(t)−yi(t)) (24) −2β(t)ϵ(t)(μi(t)−yi(t),μi(t)−yi(t)) (25) −2β(t)(Qi(μ(t))+Ri(x(t),μ(t)),μi(t)−yi(t)) (26) +β2(t)∥Gi(x(t),μ(t))∥2, (27)

where, for ease of notation, we have defined

 Gi(x(t),μ(t))= ϵ(t)(μi(t)−yi(t)) (28) +Mi(μ(t))−Mi(y(t)) (29) +Qi(μ(t))+Ri(x(t),μ(t)). (30)

Our goal is to bound above, and use this bound in constructing Inequality (17). As such, we expand as below and bound the terms in the expansion.

 ∥Gi(x(t),μ(t))∥2=ϵ2(t)∥μi(t)−yi(t)∥2 (31) +∥Mi(μ(t))−Mi(y(t))∥2 (32) +∥Qi(μ(t))∥2+∥Ri(x(t),μ(t))∥2 (33) +2(Qi(μ(t)),Ri(x(t),μ(t))) (34) +2ϵ(t)(Mi(μ(t))−Mi(y(t)),μi(t)−yi(t)) (35) +2(ϵ(t)(μi(t)−yi(t))+Mi(μ(t))−Mi(y(t)), (36) Qi(μ(t))+Ri(x(t),μ(t))), (37)

Due to Assumption 4, we conclude that

 ∥Mi(μ)−Mi(y(t))∥2≤L2i∥μ−y(t)∥2=O(V(t+1,μ)) ≤O(1+V(t,μ)), (38) (Mi(μ)−Mi(y(t)),μi−yi(t)) (39) ≤∥Mi(μ)−Mi(y(t))∥∥μi−yi(t)∥ (40) ≤O(1+V(t+1,μ))≤O(1+V(t,μ)), (41)

where in the last inequalities in (IV-A)-(39) we used (18). Let us analyze the terms containing the disturbance of gradient, namely , in Equation (31). Since , due to Assumption 2 and Equation (12), we obtain

 ∥Qi(μ)∥ =∥∫RNd[Mi(x)−Mi(μ)]p(μ,x)dx∥ (42) ≤∫RNd∥Mi(x)−Mi(μ)∥p(μ,x)dx (43) ≤∫RNdLi∥x−μ∥p(μ,x)dx (44) ≤∫RNdLi(N∑i=1d∑k=1|xik−μik|)p(μ,x)dx =O(N∑i=1σ), (45)

where the last equality is due to the fact that the first central absolute moment of a random variable with a normal distribution

is . The estimation above and (18) imply, in particular, that for any

 ∥Qi(μ)∥∥μi−yi(t)∥≤O(N∑i=1σ)(1+V(t,μ)) (46) ∥Qi(μ)∥∥Mi(μ)−Mi(y(t))∥≤Li∥Qi(μ)∥∥μ−y(t)∥ ≤O(N∑i=1σ)(1+V(t,μ)). (47)

Finally, we bound the martingale term .

 E{∥Ri(x(t),μ(t))∥2|μ(t)=μ} ≤d∑k=1∫RNdJi2(x)(xik−μik(t))2σ4(t)p(μ,x)dx ≤fi(μ,σ(t))σ4(t)≤O(1+V(t,μ))σ4(t), (48)

where the first inequality is due to the fact that and taking into account (11), the second inequality is due to Assumption 5, with being a quadratic function of and , . Bringing the inequalities (IV-A)-(IV-A) in the inequality (20), taking into account (18), the Cauchi-Schwarz inequality, and the martingale properties in (11) of , , we get

 E{∥ μi(t+1)−yi(t)∥2|μ(t)=μ} (49) ≤(1−2β(t)ϵ(t))∥μi−yi(t)∥2 (50) −2β(t)(Mi(μ)−Mi(y(t)),μi−yi(t)) (51) −2β(t)(Qi(μ),μi−yi(t)) (52) +β2(t)E{∥Gi(x(t),μ