In multi-agent reinforcement learning in games, an agent repeatedly adjusts his strategy in response to a stream of payoffs which are in turn dependent on the actions of the other agents. The objective is for agents to arrive at a strategy profile that yields the best obtainable outcome for each of them, under information-constraints or in the presence of noise. Whether such an action profile can be reached depends on the learning process used by the collection of these agents. Learning in finite games typically involves discrete-time stochastic processes, where stochasticity arises from the agents’ randomized choices, [1, 2, 3, 4, 5, 6, 7]
. A key approach to analyzing such processes is based on the ordinary differential equation (ODE) method of stochastic approximation, a method which relates their behaviour to that of a “mean field” ODE,. Motivated by this, we follow , [9, 10, 12] and consider a reinforcement learning scheme directly in continuous-time. By so doing, we are able to focus on the relationship between reinforcement learning, convex analysis, and passivity. 111One could use the results developed in this paper to analyze discrete-time learning schemes as in , , but we leave this for future work.
Our starting point is a variant of the continuous-time exponentially-discounted learning (EXP-D-RL) in , versions of which are also known as the exponential-weight algorithm , or Q-learning, , 
. Under this continuous-time learning process, each agent maintains a vector of scores for all his actions, based on aggregation of his exponentially-discounted stream of payoffs. These scores are then converted into mixed strategies using a static logit (soft-max) rule which assigns choice probabilities proportionally to the exponential of each action’s score. The resulting learning dynamics describes the evolution of the scores (dual variables), rather than of the mixed strategies (primal variables). The same score dynamics results also from stochastic-approximation of a Q-learning scheme, which was shown in to converge to a Nash distribution (logit equilibria) in 2-player zero-sum and partnership (potential) games. It is also related to the scheme proposed in  for traffic games in a repeated-game setup and shown to converge to a Nash distribution in -player potential games. The strategy dynamics induced by this score dynamics is analyzed in  and proved to converge towards logit equilibria in potential games. With the exception of , the majority of these works have focused exclusively on convergence in potential games, [4, 5, 6], . In contrast, less attention has been paid to stable games [14, 15]
, which encompass zero-sum games, potential games with concave payoffs and include well-known examples from evolutionary game theory such as the Rock-Paper-Scissors (RPS) game.
Motivated by the above, in this paper, we exploit passivity techniques and the natural monotonicity property associated with this class of games to show convergence to a Nash distribution. The use of passivity to investigate game dynamics was first proposed in  for population games, based on the notion of -passivity. The authors showed that certain game dynamics and the class of stable games, naturally satisfy this type of passivity. The coupling between a -passive system with a stable game implies a stable behaviour in the closed-loop solution.  showed that if an evolutionary dynamic does not satisfy a passivity property, then it is possible to construct a higher-order stable game that results in instability. Here we use an alternative concept, that of equilibrium-independent passivity, . This allows us to directly address convergence towards equilibria. We note that equilibrium-independent passivity was used recently in  for continuous-kernel games, to relax the assumption on perfect-information on other players’ actions.
Contributions. Our contributions are twofold: we show (1) that a passivity framework can be used to prove convergence of reinforcement learning in finite games, and (2) that its principles can be used towards designing higher-order learning dynamics that preserve convergence to a Nash distribution. Our approach is based on reformulating the overall learning dynamics of all agents as a feedback interconnected system. We show that the continuous-time EXP-D-RL scheme, , can be naturally posed as a payoff-feedback system in the dual space, where the forward system satisfies more than passivity, namely it is output strictly (equilibrium-independent) passive. We exploit its particular storage function, related to a Bregman divergence, as a Lyapunov function and show convergence of EXP-D-RL to a Nash distribution (logit equilibrium) in any -player game for which the (negative) payoff is merely monotone, not necessarily strict. To the best of our knowledge, this is the first convergence result for such games. This class of games corresponds to the class of stable games in population games , and to the class of monotone games in continuous games, . It subsumes, in the case of finite-action games, potential games with concave potential, 2-player zero-sum games, as in , as well as the standard 2-player RPS game. Key to our approach is the fact that we analyze convergence based on the natural, score dynamics, which are in the dual (payoff) space, as in . This is contrast to analyzing the induced strategy dynamics as done in , and is unlike the indirect analysis done in , via connection to the smooth best-response. Unlike [5, 6], we show convergence to a Nash distribution in games that go beyond the class of potential games, for example the RPS games or the Shapley game. Furthermore, we exploit cocoercivity of the soft-max to show convergence even for hypo-monotone games (corresponding to unstable games in ), such as unstable RPS games and Shapley’s game, is the temperature parameter is above a certain threshold. To achieve this, we balance the game shortage of passivity (hypo-monotonicity) by the excess passivity coming from the soft-max map (cocoercivity).
In the second part of the paper, we build on the passivity interpretation and propose a method to design higher-order extensions of EXP-D-RL. Higher order dynamics, via the introduction of auxiliary states, can have different properties. They can have significant benefits fostering convergence in larger classes of games, as shown in [10, 11] for fictitious-play/gradient-play and in evolutionary games. In reinforcement learning, higher-order extensions of the un-discounted reinforcement learning have been proposed in  based on second or -th order payoff-integration (equivalent to a cascade modification of the first-order replicator dynamics by a chain of integrators). These dynamics have been shown to lead to the elimination of weakly dominated strategies, followed by the iterated deletion of strictly dominated strategies, a property not exhibited by standard replicator dynamics. However, as shown in , such cascade augmentation does not guarantee that passivity/convergence properties are preserved when extending from first to higher-order dynamics. Our second result shows that if higher-order dynamics are built by feedback modification via a passive system that preserves the equilibrium point, convergence to a Nash distribution can be guaranteed in the same class of games. We explicitly build a second-order learning scheme based on this method, by specifying a particular LTI positive-real system for the feedback modification path. We show numerically that these higher-order dynamics can converge faster and, in some cases, can converge in larger classes of games (more hypo-monotone) than the first-order scheme. A short version of this paper will appear in .
The paper is organized as follows. Section II provides background material. Section III introduces the continuous-time score-based EXP-D-RL reinforcement learning scheme. Section IV provides convergence analysis of the first-order EXP-D-RL scheme. Section V proposes and analyzes a class of higher-order dynamics. Section VI discusses connections to population games. Section VII discusses several examples and presents simulation results. Section VIII provides the conclusions.
Ii-a Convex Optimization and Monotone Operator Theory
An operator (or mapping) is said to be monotone on if . It is strictly monotone if strict inequality holds . is -strongly monotone if , for some . We note that a function is convex if and only if , where is its gradient, and strictly convex if and only if . is L-Lipschitz if there exists a such that . is nonexpansive if , and contractive if . is -cocoercive if there exists a such that . is referred to as firmly nonexpansive for .
Ii-B Equilibrium Independent Passivity
with , , , locally Lipschitz, continuous. Consider a differentiable function . The time derivative of along solutions of (1) is or just . Let , , be an equilibrium condition, such that , . Assume there exists and a continuous function such that for any constant , (basic assumption).
System (1) is Equilibrium Independent Passive (EIP) if it is passive with respect to and ; that is for every there exists a differentiable, positive semi-definite storage function such that and, for all , ,
is output-strictly EIP (OSEIP) if there is a such that
EIP requires that (2) holds for every ( to be passive independent of the equilibrium point), while traditional passivity,  requires that it holds only for a particular (usually associated with the origin as equilibrium). EIP properties help in deriving stability and convergence properties for feedback systems without requiring exact knowledge of an equilibrium point, but rather only that it exists. The parallel interconnection of two EIP systems is an EIP system, and the feedback interconnection of two EIP systems that satisfies the basic assumption is an EIP system (cf. Property 2 and 3 in ). When system is just a static map, EIP is equivalent to incrementally passivity. and to monotonicity. A static nonlinear function is defined to be EIP (OSEIP) if is monotone (-cocercive). A linear (output strictly) passive system, , , with controllable, observable, and invertible is (OS)EIP (cf. Ex. 1 in ). This can be shown using , where is the quadratic storage function associated with the passivity of the linear system relative to the origin equilibrium, by direct application of the KYP lemma (cf. Section 6.4 of ). The additional requirement of invertibility of is necessary to satisfy the basic assumption on the existence and continuity of , which is defined by . The equilibrium input-output map is defined by .
Ii-C Games in Normal Form
Consider a game between a set of players (agents) , where each player has a finite set of actions (or pure strategies) , and a payoff , with the overall action set of all players, .
Let and . Without loss of generality we identify as the corresponding index set, i.e., and denote a generic action as . Let denote the mixed strategy of player
, a probability distribution over his set of actions. Then , where is the set of mixed strategies for player . A mixed strategy profile is denoted as , where is called the game’s strategy space. We also use the shorthand notation where is the strategy profile of the other players except . Player ’s expected payoff to using in the mixed strategy profile is
where denotes his payoff in the pure (action) profile , with . We denote by , his expected payoff corresponding to using pure strategy in the mixed profile . Note that we can write (4) as
or , where is called the payoff vector of player at , indicating the duality pairing between and , . With the players’ (expected) payoffs , the tuple is called the mixed extension of also denoted by .
A mixed strategy profile is a Nash equilibrium of game if
Define the mixed best-response map of player , , . Then satisfies
Iii Exponentially-Discounted Learning (EXP-D-RL)
In the following, we describe the score-based reinforcement learning (RL) scheme, modeled in continuous-time as in [5, 9, 7]. Each player keeps a score based on his received payoff , and maps it into a strategy .
He plays the game according to the strategy . This process is repeated indefinitely, with an infinitesimal time-step between each stage; hence can be modeled in continuous-time,
as a three stage process, described as follows:
(1) Assessment Stage: Each player keeps a vector score variable , , with each -th action having a score . Starting from an initial score, he updates it based on exponentially-discounted aggregation (learning) (EXP-D-RL) of his own payoff stream,
or, in differential form,
where , is the learning rate and is the bias towards strategy at the beginning of the game. EXP-D-RL can be represented by a scalar-valued transfer function , similar to the scheme studied in , where . The case , i.e., integration of the (un-discounted) payoff is studied in ,.
(2) Choice Stage: Each player maps his own score , into a mixed strategy using a choice map, e.g. the best-response choice, . To ensure that is single-valued, an at-least strictly convex function called regularizer is used, , which yields the regularized/smooth best-response choice,
Depending on the context, the regularizer is also referred to as admissible deterministic perturbation, penalty function, smoothing function, barrier function or Bregman function. For detailed construction of the regularizer, . We consider the commonly used (negative) Gibbs entropy,
where . Note that , where is the (relative) interior of the simplex . is typically referred to as the temperature parameter. For , (12) is known as the standard soft-max function. As , actions are selected with uniform probability, and as , the soft-max function selects the action associated with the highest score, provided that the difference between any two scores is not too small. With , the mixed strategy for player is taken as,
Similar forms of “exponentially-discounted” score dynamics have been investigated in ,. When , (14) coincides with the learning rule studied in  for the discount factor . When there is a slight difference: we use not only as a discount factor (as in ), but also as a multiplicative factor (learning rate), and by doing so the rest points of (14) are independent of . Structurally, EXP-D-RL is similar to online mirror descent (OMD) in convex optimization, recently studied for continuous games in ,. In particular, the score is the dual variable to the primal variable . Therefore, (14) describes the evolution of learning in the dual space , whereas the induced strategy trajectory describes the evolution in the primal space . This type of duality is discussed in . Lastly, we note that using stochastic approximation, (14) corresponds to the individual Q-learning algorithm, which has been shown to converge in 2-player zero-sum games and 2-player partnership games (Proposition 4.2, ).
Note that a first-order Euler discretization of the dynamics (14), with discretization step , is
where is the mixed-strategy and the probability of playing at the -th instance of play. This recursion tracks (14) arbitrarily well over finite-time horizons when the discretization step
is sufficiently small, but requires perfect monitoring of the mixed strategies of the other players. This can be relaxed if players are assumed to possess a bounded, unbiased estimate of their actions’ payoffs, or if they observe their in-game realized payoffs. In fact, (14) is the mean field of the discrete-time stochastic process, ,
where is an unbiased estimator of , i.e., such that , and is a diminishing sequence of step-sizes, e.g. . If player can observe the action profile played by his opponents (or can otherwise calculate his strategies’ payoffs), such an estimate is provided by . If instead player can only observe the payoff of his chosen action, then a typical choice for is , if , [3, 5, 9], where division by compensates for the infrequency with which the score of the -th strategy is updated. Results from the theory of stochastic approximation  can be used to tie convergence of such discrete-time algorithms to the asymptotic behaviour of (14), (see ). In this paper we restrict our focus to the continuous-time learning scheme, as in [9, 10, 12].
Iv Convergence of EXP-D-RL Dynamics
In this section we analyze the convergence of EXP-D-RL, (14). Let , and denote the score vector, stacked mixed-strategies and the overall payoff vector for all players, respectively. With (14), EXP-D-RL is written all players as
where , .
Given that the payoff functions are Lipschitz and bounded on , the scores will remain finite for all , so will be defined for all and for all . Moreover, , , hence only strategy trajectories on the interior of the simplex are obtained.
and , where is the game (static) map. In the following we characterize the asymptotic behaviour of solutions of (15).
Iv-a Equilibrium Points of EXP-D-RL
System (15) is written as,
where . Note that is Lipschitz (we show in Proposition 2) and since is Lipschitz and bounded, existence and uniqueness of global solutions of (17) follows from standard arguments. An equilibrium (rest) point is , hence a fixed point of the map . The existence of a fixed point is guaranteed by Brouwer’s fixed-point theorem provided that is a continuous function with bounded range . Based on (16), such a can be represented as,
From (18), it follows that, over the set of rest points of (17), the function has an inverse given by . As shown next, any corresponding to (18) is a Nash equilibrium of a game with perturbed payoff (Nash distribution), and for small it approximates the Nash equilibria of , (see also  and Proposition 2, ). Note that as , all actions of each player are selected with uniform probability and there exists a unique at the centroid of the simplex .
Any , where is a rest point of (17), is a Nash equilibrium of game with perturbed payoff,
where , .
Recall from (10)-(12), that for all , On the other hand, by (7) and (5), a Nash equilibrium of with perturbed payoff (19), is a fixed-point of the (perturbed) best-response function , where and is independent of (by (5)). Thus, , we can write , where , or, concatenating all relations, , where . A fixed-point of satisfies , where , which are exactly (18). ∎
We note that corresponding to (18) is referred to as a Nash distribution, , a (perturbed) logit equilibrium of the game [15, p. 191], or a type of quantal response equilibrium (QRE), . It satisfies , hence is a fixed-point of , which also exists by Brouwer’s fixed point theorem (Theorem 1, ) and is parameterized by . At a Nash distribution each player plays a smooth best-response () to payoffs arising from the others’ play.
Iv-B Soft-max Choice Map and Passivity of
the log-sum-exp function, ( and convex by [22, p. 72]).
Since the following properties are valid for , (12), for all , without loss of generality, we consider (drop the superscript) and denote and . Note that is not injective and , for all , , where .
The soft-max function satisfies the following properties:
(i) is gradient of , that is, .
(ii) is monotone, i.e.,
(iii) is -Lipschitz, that is,
where is the temperature constant.
(iv) is -cocoercive, that is,
(i) Evaluating the partial derivative of in (20) at each component of yields , for all , hence .
(ii) The log-sum-exp function is and convex, based on positive semi-definiteness of [22, p. 74]. Together with (i), this implies that is monotone.
(iii) For the Hessian of , it can be shown ([22, p. 74]) that
for all , hence, Since and is positive semidefinite, this implies that
By Proposition 2(iv), is -cocoercive for all , therefore, is -cocoercive,
or, equivalently, output-strictly EIP (OSEIP).
Next, we characterize (16).
System (16) is output-strictly EIP (OSEIP).
At an equilibrium condition of (16), and . Consider the following candidate storage function,
Using (21) in the foregoing yields
Iv-C Convergence Analysis
Next, based on the representation in Figure 1 (feedback interconnection between , (16) and the static game map on the feedback path), we analyze the asymptotic behaviour of EXP-D-RL (15) by leveraging passivity properties of .
First, solutions of (15) remain bounded, (see also proof of Lemma 3.2, ). To see this, note that since is continuous, and , is bounded, i.e., there is some such that for any , for all , . From the integral form of EXP-D-RL, we can write for all , , hence , . Also, is a compact, positively invariant set.
We make the following assumption on the payoff .
(i) The negative payoff, , is monotone,
(ii) The negative payoff, , is -hypo-monotone, i.e.,