Game theory is a powerful framework for analyzing and optimizing multi-agent decision making problems. In several such problems, each agent (referred to also as a player) does not have full information on her objective function, due to the unknown interactions and other players’ strategies affecting her objective. Consider for example, a transportation network in which an agent’s objective is minimizing travel time or an electricity network in which an agent’s objective is minimizing own’s electricity prices. In these instances, the travel times and prices, respectively, depend non-trivially on the strategies of other agents. Motivated by this limited information setup, we consider computing Nash equilibria given only the so-called payoff-based information. That is, each player can only observe the values of its objective function at a joint played action, does not know the functional form of her or others’ objectives, nor the strategy sets and actions of other players, and cannot communicate with other players. In this setting, we address the question of how agents should update their actions to converge to a Nash equilibrium strategy.
A large body of literature on learning Nash equilibria with payoff-based information has focused on finite action setting or potential games, see for example, [11, 12, 7] and references therein. For games with continuous (uncountable) action spaces, a payoff-based approach was developed based on the extremum seeking idea in optimization [3, 13], and assuming strongly convex objectives almost sure convergence to the Nash equilibrium was proven. A payoff-based approach, inspired by the logit dynamics in finite action games  was extended to continuous action setting for the case of potential games . The work in 
considered learning Nash equilibria in continuous action games on networks. Crucially, the work additionally assumed that each player exchanges information with her neighbors, to facilitate estimation of the gradient of her objective function online.
Recently, we proposed a payoff-based approach to learn Nash equilibria in a class of convex games . Our approach hinged upon connecting Nash equilibria of a game to the solution set of a related variational inequality problem. Our algorithm convergence was established for the cases in which the game mapping is strongly monotone or the game admits a potential function. Apart from possibly limited scope of a potential game, strong monotonicity can be too much to ask for. In particular, if the objective function of an agent is linear in her action or in the presence of coupling constraints of the action sets the game mapping will not be strongly monotone.
Our goal here is to extend the existing payoff-based learning approaches to a broader class of games characterized by monotone game mappings. While algorithms for solving monotone variational inequalities exist (see, for example, Chapter 12 in ), these algorithms either consist of two timescales (Tikhonov regularization approach) or have an extra gradient step (extra-gradient methods). As such, they require more coordination between players than that possible in a payoff-based only information structure.
Our contributions are as follows. First, we propose a distributed payoff-based algorithm to learn Nash equilibria in a monotone game, extending our past work  applicable to strongly monotone games, inspired by the single timescale algorithm for solving stochastic variational inequalities . Second, despite lack of gradients in a payoff-based information, contrary to the setup in 
, we show that our proposed procedure can be interpreted as a stochastic gradient descent with an additional biasL and regularization terms. Third, we prove convergence of the proposed algorithm to Nash equilibria by suitably bounding the bias and noise variance terms using established results on boundedness and convergence of discrete-time Markov processes.
Notations. The set is denoted by
. Boldface is used to distinguish between vectors in a multi-dimensional space and scalars. Givenvectors , , ; . and denote respectively, vectors from with non-negative coordinates and non-negative whole numbers. The standard inner product on is denoted by : , with associated norm . Given some matrix , , if and only if for all . We use the big- notation, that is, the function is as , = as , if for some positive constant . We say that a function grows not faster than a function as , if there exists a positive constant such that with .
A mapping is monotone over , if for every .
Ii Problem Formulation
Consider a game with players, the sets of players’ actions , , and the cost (objective) functions , where denotes the set of joint actions. We restrict the class of games as follows.
The game under consideration is convex. Namely, for all the set is convex and closed, the cost function is defined on , continuously differentiable in and convex in for fixed .
The mapping , referred to as the game mapping, defined by
is monotone on (see Definition 1).
We consider a Nash equilibrium in game as a stable solution outcome because it represents a joint action from which no player has any incentive to unilaterally deviate.
A point is called a Nash equilibrium if for any and
Our goal is to learn such a stable action in a game through designing a payoff-based algorithm. We first connect existence of Nash equilibria for with solution set of a corresponding variational inequality problem.
Consider a mapping : and a set . A solution to the variational inequality problem is a set of vectors such that , .
(Proposition 1.4.2 in ) Given a game with game mapping , suppose that the action sets are closed and convex, the cost functions are continuously differentiable in and convex in for every fixed on the interior of . Then, some vector is a Nash equilibrium in , if and only if .
It follows that under Assumptions 1 and 2 for a game with mapping , any solution of is also a Nash equilibrium in such games and vice versa. While under Assumptions 1 and 2 might admit a Nash equilibrium, these two assumptions alone do not guarantee existence of a Nash equilibrium. To guarantee existence, one needs to consider a more restrictive assumption, for example, strong monotonicity of the game mapping or compactness of the action sets . Here, we do not restrict our attention to such cases. However, to have a meaning discussion, we do assume existence of at least one Nash equilibrium in the game.
The set is not empty.
Each element of the game mapping , defined in Assumption (2) is Lipschitz continuous on with a Lipschitz constant .
Each cost function , , grows not faster than a linear function of as .
Iii Payoff-Based Algorithm
Given a payoff-based information, each agent has access to its current action, referred to as its state and denoted by , and the cost value at the joint states , at iteration . Using this information in the proposed algorithm each agent “mixes” its next state . Namely, it chooses
randomly according to the multidimensional normal distributionwith the density:
The initial value of the means , , can be set to any finite value. The successive means are updated as follows:
In the above, denotes the projection operator on set , is a step-size parameter and is a regularization parameter. We highlight the difference between the proposed approach and that of  due to the additional term in (2). In the absence of this term the algorithm would not be convergent under a mere monotonicity assumption on the game mapping (see counterexample provided in ).
Let us provide insight into the algorithm by deriving an analogy to a regularized stochastic gradient algorithm. Given , for any define as
where . Above, , , can be interpreted as the th player’s cost function in mixed strategies. We can now show that the second term inside the projection in (2) is a sample of the gradient of this cost function with respect to the mixed strategies. Let .
We verify that the differentiation under the integral sign in (4) is justified. It can then readily be verified that (1) holds, by taking the differentiation inside the integral. A sufficient condition for differentiation under the integral is that the integral of the formally differentiated function with respect to converges uniformly, whereas the differentiated function is continuous (see , Chapter 17). By formally differentiating the function under the integral sign and omitting the arguments , we obtain
Given Assumption 1, is continuous. Thus, it remains to check that the integral of this function converges uniformly with respect to any . To this end, we can write the Taylor expansion of the function around the point with the coordinates and for any , , in the integral (6):
where , , , . The uniform convergence of the integral above follows from the fact111see the basic sufficient condition using majorant , Chapter 17.2.3. that, under Assumption 5, for some positive constant and for all and . Hence,
Lemma (1) shows that the second term inside the projection in (2) is a sample of the gradient of the cost function in mixed strategies. Hence, algorithm (2) can be interpreted as a regularized stochastic projection algorithms. To bound the bias and variance terms of the stochastic projection and consequently establish convergence of the iterates , the parameters , , need to satisfy certain assumptions.
Let and choose , and , respectively, such that
As an example for existence of parameters to satisfy Assumption 6, let , , .
Iv Analysis of the Algorithm
To prove Theorem 2 we first prove boundedness of the iterates . Due to the regularization term , this is done by analyzing distance of from the so-called Tikhonov trajectory. Having established this boundedness, we can readily show that the limit of the iterates exists and satisfies the conditions of a Nash equilibrium of the game . For the boundedness and the convergence proofs, we use established results on boundedness (, Theorem 2.5.2) and convergence of a sequence of stochastic processes (Lemma 10 (page 49) in ), respectively. For ease of reference, we provide the statement of (, Theorem 2.5.2) and (Lemma 10 (page 49) in  ) in the appendix.
Iv-a Boundedness of the Algorithm Iterates
We first show that algorithm (2) falls under the framework of well-studied Robbins-Monro stochastic approximations procedures  with an additional regularization . Next, leveraging this analogy and results on stability of discrete-time Markov processes (, Theorem 2.5.2) applied to the sequence we prove boundedness of the iterates.
Using the notation , we can rewrite the algorithm step in (2) in the following form:
and is the -dimensional mapping with the following elements:
The vector corresponds to the gradient term in stochastic approximation procedures, whereas
is a disturbance of the gradient term. Finally,
is a martingale difference, namely, according to (1),
To ensure boundedness of (Lemma 3) we bound the martingale term above (see Inequality (IV-A)). To bound the disturbance of the gradients (see Equation (42)), we observe that the mapping evaluated at is equivalent to the game mapping in mixed strategies (please see Appendix for the proof of this observation). That is,
In contrast to stochastic approximation algorithms and the proof in , we have an addition term to be able to address merely monotone game mappings. As such, to bound we also relate the variations of the sequence to those of the Tikhonov sequence defined below. Let denote the solution of the variational inequality , namely
The sequence is known as the Tikhonov sequence and enjoys the following two important properties.
With the results above in place, we connect the squared distance to the squared distance for any and . Due to the triangle inequality,
where in the last inequality we used Lemma 2. Hence, by taking into account that for any and
we conclude from (14) that for any
The above bound serves as the main new inequality in order to show almost-sure boundedness of in comparison to non-regularized stochastic gradient procedures.
In the following, for simplicity in notation, we omit the argument in the terms , , and . In certain derivations, for the same reason we omit the time parameter as well.
Define , where is the Tikhonov sequence defined by (13). We consider the generating operator of the Markov process
and aim to show that satisfies the following decay
where on , , , , , . This enables us to apply Theorem 2.5.2 in  to directly conclude almost sure boundedness of .
From the procedure for the update of , the non-expansion property of the projection operator, the fact that belongs to , namely, that
we obtain that for any
where, for ease of notation, we have defined
Our goal is to bound above, and use this bound in constructing Inequality (17). As such, we expand as below and bound the terms in the expansion.
Due to Assumption 4, we conclude that
where in the last inequalities in (IV-A)-(39) we used (18). Let us analyze the terms containing the disturbance of gradient, namely , in Equation (31). Since , due to Assumption 2 and Equation (12), we obtain
Finally, we bound the martingale term .
where the first inequality is due to the fact that and taking into account (11), the second inequality is due to Assumption 5, with being a quadratic function of and , . Bringing the inequalities (IV-A)-(IV-A) in the inequality (20), taking into account (18), the Cauchi-Schwarz inequality, and the martingale properties in (11) of , , we get