On the Convergence of Competitive, Multi-Agent Gradient-Based Learning

04/16/2018 ∙ by Eric Mazumdar, et al. ∙ berkeley college University of Washington 0

As learning algorithms are increasingly deployed in markets and other competitive environments, understanding their dynamics is becoming increasingly important. We study the limiting behavior of competitive agents employing gradient-based learning algorithms. Specifically, we introduce a general framework for competitive gradient-based learning that encompasses a wide breadth of learning algorithms including policy gradient reinforcement learning, gradient based bandits, and certain online convex optimization algorithms. We show that unlike the single agent case, gradient learning schemes in competitive settings do not necessarily correspond to gradient flows and, hence, it is possible for limiting behaviors like periodic orbits to exist. We introduce a new class of games, Morse-Smale games, that correspond to gradient-like flows. We provide guarantees that competitive gradient-based learning algorithms (both in the full information and gradient-free settings) avoid linearly unstable critical points (i.e. strict saddle points and unstable limit cycles). Since generic local Nash equilibria are not unstable critical points---that is, in a formal mathematical sense, almost all Nash equilibria are not strict saddles---these results imply that gradient-based learning almost surely does not get stuck at critical points that do not correspond to Nash equilibria. For Morse-Smale games, we show that competitive gradient learning converges to linearly stable cycles (which includes stable Nash equilibria) almost surely. Finally, we specialize these results to commonly used multi-agent learning algorithms and provide illustrative examples that demonstrate the wide range of limiting behaviors competitive gradient learning exhibits.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

As machine learning algorithms are increasingly being deployed in real world settings, understanding how they interact, and the dynamics that can arise from their interactions is becoming crucial. In recent years, there has been a resurgence in research efforts on multi-agent learning, and learning in games. Indeed, game theoretic tools are even being used to

robustify and improve the performance of machine learning algorithms (see, e.g., [21]). Despite this activity, there is still a lack of understanding of the dynamics and limiting behaviors of machine learning algorithms in competitive settings and in games in particular.

Concurrently, there has been no shortage of papers on the convergence of gradient descent and its avoidance of saddle points (see, e.g., [19, 28, 45, 17]). Due to their versatility and ease of implementation, gradient descent, and gradient-descent based algorithms are extremely popular in a variety of machine learning and algorithmic decision-making problems. The advantages of gradient-based methods have lead to them being widely adopted in multi-agent and adversarial learning problems  [15, 24, 25, 21, 12, 34]. However, a thorough understanding of their convergence and limiting behaviors is still lacking.

Inspired by the recent works that focus on the single-agent case such as [30, 39] and long existing work in dynamical systems theory including convergence of stochastic approximation [3, 4, 7, 8] and urn processes [6, 40], we investigate the convergence of competitive gradient-based learning to Nash equilibria and other limiting behaviors. Specifically, we are interested in settings where there are two or more competing agents, in potentially uncertain environments. Each agent optimizes their individual objective which depends on the decisions of all the other agents and possibly an external environmental signal. This scenario can be modeled most naturally as a game.

It is common in these settings to consider agents which adopt a learning algorithm for determining a strategy (policy) that governs their decisions. There are many different types of learning algorithms that have been proposed in the literature, several of which have their origins in the single agent-case. We consider the case where the agents adopt a gradient-based learning algorithm, one of the more common approaches in a number of different domains. In fact, in support of the latter point, we show that a wide variety of learning algorithms from different fields fit into the gradient learning framework we analyze.

We remark that there is a fundamental difference between the dynamics that are analyzed in much of the single-agent gradient-based learning and optimization literature and the ones we analyze in the competitive multi-agent case. As we show in the following sections, the combined dynamics of gradient-based learning schemes in games does not necessarily correspond to a gradient flow. This may seem a subtle point, but it it turns out to be extremely important. Gradient flows are a very narrow class of flows admitting nice convergence guarantees—e.g., almost sure convergence to local minimizers—due to the fact that they preclude flows with the worst geometries [41]. In particular, they do not exhibit non-equilibrium limiting behavior such as periodic orbits. Gradient-based learning in games, on the other hand, does not preclude such behavior. This makes the analysis more challenging.

Given the prominence of gradient-based learning schemes in multi-agent reinforcement learning, online optimization, and other machine learning contexts where game-theoretic ideas are being employed, it is important to understand and be able to interpret the limiting behavior of these coupled algorithms. Recent works have noted that limit cycles emerge in gradient-based learning algorithms. For instance, in [12], it is demonstrated that limit cycles abound in gradient descent for training generative adversarial networks (GANs). Other very recent works have explored the existence of cycles in adversarial learning when the problem is considered in a game-theoretic context [35], thereby highlighting the importance of understanding limiting behaviors other than equilibria of competitive learning algorithms. Dynamical systems exhibiting periodic orbits and other limiting behaviors, have long been studied and we borrow tools from dynamical systems theory in order to characterize the limiting behavior of competitive gradient-based learning.

1.1 Contributions

The high level contributions of this paper are two-fold. We first provide guarantees that competitive gradient-based learning algorithms (both when the agents have oracle access to their gradients and in the stochastic gradient setting) avoid linearly unstable critical points (i.e. strict saddle points) of the dynamics. This has positive implications for zero-sum games where such saddle points are generically not local Nash equilibria. For gradient-based learning in potential games and general-sum games, however, this is a strongly negative result. Indeed, this result implies that gradient-based learning algorithms will almost surely avoid a subset of the Nash equilibria of the game. This is a particularly interesting observation for potential games, which are ostensibly the nicest games one could hope to face when employing gradient-based learning as they admit a transformation of coordinates under which the agents can be viewed as optimizing a single objective function. Unlike single-agent gradient-descent, however, saddle points can be critical points of interest—i.e. Nash equilibria.

Secondly, by viewing gradient-based learning in games through the lens of dynamical systems theory, we highlight many of the problems plaguing such algorithms in practice. Specifically, we show that dynamics formed from the gradients of individual agent cost derivatives are not gradient flows and thus, competitive gradient-based learning may converge to periodic orbits. Further, we highlight the existence of non-Nash locally asymptotically stable equilibria of these dynamics. Such equilibria arise in both zero-sum and general-sum games and can be seen as artifacts of the choice of algorithm, and their relevance to the underlying game is unknown. Hence, some care needs to be taken regarding the implementation and the interpretation of the limit points of gradient-based learning algorithms in competitive settings. We provide examples in Appendix A demonstrating various limiting behaviors of gradient-based learning algorithms in competitive settings.

Concretely, the contributions of this paper are summarized as follows:

  • [itemsep=-3pt, topsep=0pt, leftmargin=12pt]

  • A characterization of the limiting behaviors of competitive gradient-based learning when all players either have access to their gradient, or to an unbiased estimate of their gradient. This is done by leveraging dynamical systems theory and the theory of stochastic approximations.

  • A new class of games, namely Morse-Smale games, for which the dynamics correspond to a Morse-Smale vector field. There are a couple points to make here on the significance of this class of games and the results we have for them. First, it is well-known that Morse-Smale vector fields are

    generic [26, 38]—that is, almost all smooth vector fields are Morse-Smale. Morse-Smale vector fields (on compact manifolds) also have a finite number of critical points; hence, our results imply that almost all games on compact smooth manifolds admit gradient-like flows with a finite number of critical points (candidate Nash equilibria). Further linearly unstable cycles (which includes some Nash equilibria) are almost surely avoided under the gradient-like flow.

  • A general framework for modeling competitive gradient-based learning that applies to a broad swath of learning algorithms. Gradient-based learning is a favored approach to multi-agent reinforcement learning, multi-armed bandits, adversarial learning, and multi-agent online learning and optimization. Such algorithms are increasingly being employed in these domains without a solid understanding of how to formally interpret the results. We provide a general framework for analyzing competitive gradient-based learning and a taxonomy of limiting behavior, that allows us to apply our theoretical results to commonly used learning algorithms.

  • Illustrative examples highlighting the frequency of non-Nash critical points, and Nash equilibria that are saddle points of the gradient dynamics.

1.2 Organization

The remainder of this paper is organized as follows. In Section 2, we introduce our framework for analyzing competitive gradient-based learning algorithms, as well as some mathematical and game theoretic preliminaries. In Section 3, we provide a brief taxonomy by drawing connections between the limiting behavior of gradient-based learning algorithms and game-theoretic and dynamical systems notions of equilibria. In Section 4, we present our main theoretical results for competitive gradient-based learning in both the deterministic (where agents have oracle access to their gradients at each iteration) and stochastic (where agents have an unbiased estimator of their gradient at each iteration) settings. In the second case, we include a high-level overview of the categories of commonly used learning algorithms that fit into the framework we consider. We present empirical evidence that shows that gradient-play will avoid Nash equilibria in a potentially large subset of linear quadratic (LQ) games in Section 5. We conclude with discussion of the results and provide comments on future directions in Section 6. In Appendix A, we specialize our results to a number of very popular multi-agent learning algorithms and we provide several illustrative examples that highlight the different kinds of limiting behavior that gradient-based learning admits.

2 Preliminaries

Consider agents, indexed by . Each agent has their own decision variable , where is their finite-dimensional strategy space of dimension . Define to be the finite-dimensional joint strategy space with dimension . Each agent is endowed with a cost function , such that where we use the notation to make the dependence on the action of the agent , and the actions of all agents excluding agent , explicit. Each agent seeks to minimize their own cost, but only has control over their own decision variable . In this competitive setting, agents’ costs are not necessarily aligned with one another.

Given the game , our focus is on settings in which agents employ gradient-based learning algorithms in the search for an equilibrium. In particular, agents are assumed to update their strategies simultaneously according to a gradient based learning algorithm of the form

(1)

where agents either have oracle access to the gradient of their cost with respect to their own choice variable—i.e.  where denotes the derivative of with respect to —or they have an unbiased estimator for their gradient—i.e.  where

is a zero mean, finite variance stochastic process. We refer to the former setting as

deterministic gradient-based learning and we refer to the latter setting as stochastic gradient-based learning.

Assuming that agents are employing an algorithm such as (1), our goal is to analyze the stationary behavior of these coupled algorithms leveraging the following game-theoretic notion of a Nash equilibrium. A strategy is a local Nash equilibrium for the game if for each there exists an open set such that that and for all . If the above inequalities are strict, then we say is a strict local Nash equilibrium. If for each , then is a global Nash equilibrium.

Another important and useful characterization of Nash leverages first and second order conditions on player cost functions. [ [42, 43]] A point is said to be a differential Nash equilibrium for the game if and for each . Define

(2)

to be the vector of player derivatives of their own cost functions with respect to their own choice variables. A point is said to be a critical point for the game if . Note that and are necessary conditions for a point to be a local Nash equilibrium [43]. Hence, all local Nash equilibria must be critical points.

Differential Nash need not be isolated, as the simple illustrative example in [42] shows. However, for a differential Nash , if

is non-degenerate—i.e. —then is an isolated strict local Nash equilibrium.

3 Links Between Dynamical System and Game-Theoretic Notions of Equilibria

Before continuing, we remark that non-degenerate differential Nash equilibria are structurally stable [43]. Structural stability ensures that equilibria are stable and persist under small perturbations. We define stability of non-degenerate differential Nash equilibria as follows. [[43]] A differential Nash equilibrium is stable if the spectrum of is in the open right-half plane—i.e. . If agents initialize in a neighborhood of a stable differential Nash equilibrium and follow the flow defined by , then they will asymptotically converge to . Specifically, if the spectrum of is strictly in the right-half plane, then the differential Nash equilibrium is locally (exponentially) attracting under the flow of  [43, Proposition 2]. This, in turn, implies that a discretized version of , namely

(3)

converges locally for appropriately selected step size . Such results motivate the study of the continuous time dynamical system in order to understand convergence properties of gradient-based learning algorithms of the form (1).

Along this line of thinking, let us draw a few more links between equilibria of and characterizations of local Nash equilibria. To do so, we characterize the critical points of by their properties under the flow of .

A point is a locally asymptotically stable equilibrium of the continuous time dynamics if and for all . Let for

denote the eigenvalues of

at where —that is, is the eigenvalue with the smallest real part. A point is a saddle point of the dynamics if and is such that . A saddle point such that for and for with and is a strict saddle point or linearly unstable critical point—we use these terms interchangeably—of the continuous time dynamics .

We now present a few preliminary propositions that highlight the links between critical points under the flow , and those critical points that have a particular game theoretic relevance.

A non-degenerate differential Nash equilibrium is either a locally asymptotically stable equilibrium or a strict saddle point of . Suppose that is a non-degenerate differential Nash equilibrium. We claim that . Since is a differential Nash equilibrium, for each ; these are the diagonal blocks of . Further implies that . Since , . Thus, it is not possible for all the eigenvalues to have negative real part. Since is non-degenerate, so that none of the eigenvalues can have zero real part. Hence, at least one eigenvalue has strictly positive real part.

To finish, we show that the conditions for non-degenerate differential Nash equilibrium are not sufficient to guarantee that is locally asymptotically stable for the gradient dynamics—that is, not all eigenvalues of have strictly positive real part. We do this by constructing a class of games with the strict saddle point property. Consider a class of two player games on such that has the form

(4)

with . If is a non-degenerate differential Nash equilibria, and which implies that . Choosing such that will guarantee that one of the eigenvalues of is negative and the other is positive, making a strict saddle points. This shows that non-degenerate differential Nash equilibria can be strict saddle points of the combined gradient dynamics.

Hence, for any game , a non-degenerate differential Nash equilibrium is either a locally asymptotically stable equilibrium or a strict saddle point, but it not strictly unstable or strictly marginally stable (i.e. having eigenvalues all on the imaginary axis).

Another important point to make is that not every locally asymptotically stable equilibrium of is a non-degenerate differential Nash equilibrium. Indeed, the following example provides an entire class of games whose corresponding dynamics admit locally asymptotically stable equilibrium that are not even local Nash equilibria. Consider the same class of games presented in proof of Proposition 3. That is, two player games on with as in (4). Suppose that is a critical point such that , or ,—i.e.  is not Nash since it violates the necessary conditions for a local Nash equilibrium. Then as long as , . Thus, the set of locally asymptotically stable equilibria that are not Nash equilibria may be arbitrarily large.

We now momentarily constrain ourselves to 2-player zero-sum games. Such games arise when training GANs, in adversarial learning, and in multi-agent reinforcement learning [22][10][37].

For an arbitrary two-player zero-sum game, on , if is a differential Nash equilibrium, then is both a non-degenerate differential Nash equilibrium and a locally asymptotically stable equilibrium of .

Consider a two player game on with . For such a game,

Note that . Suppose that is a differential Nash equilibrium and let with and . Then,

where the second line follows since and for , a differential Nash equilibrium. Since is arbitrary, this implies that is positive definite and hence, clearly non-degenerate. Thus, for two-player zero-sum games, all differential Nash equilibria are both non-degenerate differential Nash equilibria and locally asymptotically stable equilibria of

The preceding proposition shows that all non-degenerate differential Nash equilibria in two-player zero-sum games are locally asymptotically stable equilibria under the flow of . This has been shown before as a consequence of the results in [42]. However, we again remark that the converse is not true. Not every locally asymptotically stable equilibrium in two-player zero-sum games are non-degenerate differential Nash equilibria. Indeed, there may be many locally asymptotically stable equilibrium in a zero-sum game that are not local Nash equilibria. The following example highlights this fact. Consider a two player game with and of the form

where . If and , then has eigenvalues strictly negative real part. Thus there exists a continuum of zero-sum games with a large set of locally asymptotically stable equilibria of the corresponding dynamics that are not differential Nash.

We now briefly focus on a particularly nice set of games known as potential games [36]. These are games for which corresponds to a gradient flow under a coordinate transformation—that is, there exists a function such that for each , for all . Note that a necessary and sufficient condition for to be a potential game is that is symmetric [36]—that is, . This gives potential games the desirable property that the only locally asymptotically stable equilibria of the gradient dynamics in potential games are local Nash equilibria.

For an arbitrary potential game, on , if is a locally asymptotically stable equilibrium of then is a non-degenerate differential Nash equilibrium.

The proof follows from the definition of a potential game. Since is a potential game, it admits a potential function such that for all . This, in turn, implies that at a locally asymptotically stable equilibrium of , , where is the Hessian matrix of the function . Further must have strictly positive eigenvalues for to be a locally asymptotically stable equilibrium of . Since the Hessian matrix of a function must be symmetric, , must be positive definite, which through Sylvester’s criterion ensures that each of the diagonal blocks of is positive definite. Thus, we have that the existence of a potential function guarantees that the only locally asymptotically stable equilibria of , are differential Nash equilibria.

The preceding proposition rules out non-Nash locally asymptotically stable equilibria of the gradient dynamics in potential games. However, the following example shows that the existence of a potential function is not enough to rule out local Nash equilibria that are saddle points of the dynamics.

Consider a two player potential game with . At a differential Nash equilibrium, has the form:

where , and . If , has one positive eigenvalue and one negative eigenvalue. Thus there exists a continuum of potential games with a large set of differential Nash equilibria of the corresponding dynamics that are strict saddle points.

We finish our mathematical preliminaries with a note on the relationship between non-degenerate differential Nash equilibria and local Nash equilibrium. It turns out that non-degenerate differential Nash equilibria are generic among local Nash equilibria. Hence, by proving statements about convergence to non-degenerate differential Nash equilibria, we are able to make statements about convergence to local Nash equilibria for almost all games in a formal mathematical sense. The following theorem first appeared in [44] for the two-player case, and while the extension to the –player case is straightforward, we provide the proof in Section B for completeness. Non–degenerate differential Nash equilibria are generic among local Nash equilibria: for any smooth boundaryless manifolds there exists an open–dense subset with such that for all , if is a local Nash equilibrium for , then is a non–degenerate differential Nash equilibrium for . We provide the proof of this result in Appendix B.

Genericity implies that local Nash equilibria in an open-dense set of continuous games (in the topology on agent costs) are non-degenerate differential Nash equilibria. Thus, for almost all games, the set of local Nash equilibria coincides exactly with the set of non-degenerate differential Nash equilibria. This also implies that saddle points for the dynamics induced by the flow of that are local Nash equilibria, are also generically strict saddle points.

For a game , denote the set of strict saddle points and the set of locally asymptotically stable equilibria of the corresponding dynamics as and respectively. Similarly, denote the set of local Nash equilibria, differential Nash equilibria, and non-degenerate differential Nash equilibria of as , , and , respectively. Combining the comments on genericity with the observations on stability gives us the following key takeaways:

  • [itemsep=-3pt, topsep=0pt, leftmargin=12pt]

  • If is a generic -player general-sum game, then

  • If is a generic 2-player zero-sum game, then

  • If is a generic -player potential game, then

The inclusions are strict due to the existence of non-Nash locally asymptotically stable equilibria in both the general-sum and zero-sum cases.

4 Main Theoretical Results

In this section, we provide the main theoretical results and differentiate those results from existing work. We also include a high-level overview of well-known algorithms that fit into class of learning algorithms we consider—and hence, to which our theory applies.

4.1 Deterministic Competitive Gradient-Based Learning

Let us first consider the deterministic setting in which agents have oracle access to their gradients at each time step. This setting encapsulates the case where agent know their own cost functions and observe their own actions as well as their competitors’ actions—and hence, can compute the gradient of their cost with respect to their own choice variable—as well as the setting where agents do not necessarily know their cost or observe their competitors’ actions, but rather some external oracle provides to them at each iteration .

Each agent has their own learning rate (i.e. step sizes ) so that the joint dynamics of all the players are given by

(5)

where and, by a slight abuse of notation, is defined to be elementwise multiplication of and where is multiplied by the first components of , is multiplied by the next components, and so on. We make the following assumptions on the cost functions and learning rates .

Assumption 1

For each , with , , and .

Note that the norm is the induced -norm. We rewrite the game dynamics in the following form

(6)

where and element-wise.

Gradient-based optimization schemes always correspond to gradient flows whereas games are not afforded this luxury—indeed, is not in general symmetric—and hence, this makes them a very interesting class of problems to study. In particular, the vector field defined by is not necessarily a vector field defined by the gradient of a function, and thus, the dynamics admit limit cycles amongst their periodic orbits. This distinguishes gradient-based optimization from gradient-based learning in a multi-agent or game-theoretic setting. The following example is of a class of games admitting a variety of periodic orbits.

Consider the game on with defined by

(7)

where agents are minimizers and is a parameter. Then,

Transforming the dynamics to radial coordinates, and , it is easy to see that there is a periodic orbit on a circle with unit radius for any . Moreover, the periodic orbit is a stable limit cycle for and unstable limit cycle if . When , on the other hand, there are an infinite number of periodic orbits and no limit cycles. Moreover, when , is a local, stable Nash equilibrium (and a locally stable critical point for the dynamics).

Having shown that gradient-based learning can exhibit limit cycles, the question remains: what are the limiting behaviors of gradient-based learning in competitive environments? The following result states that the set of initial conditions leading to linearly unstable equilibria is of measure zero. Let and satisfy Assumption 1. Suppose that is open and convex. If , the set of initial conditions so that competitive gradient-based learning converges to linearly unstable critical points (strict saddle points) is of measure zero.

First, we note that the above theorem holds for , in particular, since holds trivially in this case. It is also important to note that differential Nash equilibria can be linearly unstable critical points—that is, they can be strict saddle points of the dynamics—and due to Theorem 3, generically, so can local Nash equilibria. The above theorem says that all local Nash equilibria that are linearly unstable critical points for the discretization of are avoided almost surely.

The proof of Theorem 1 relies on the stable manifold theorem [46, Theorem III.7], [47]. We provide its statement in Theorem C in Appendix C. Some parts of the proof follow similar arguments to the proofs of results in [30, 39] which apply to (single-agent) gradient-based optimization. Due to the different learning rates employed by the agents and the introduction of the differential game form , the proof differs.

[of Theorem 1] We claim the mapping is a diffeomorphism. If we can show that is invertible and a local diffeomorphism, then the claim follows. Let us first prove that is invertible.

Consider and suppose so that . The assumption implies that satisfies the Lipschitz condition on . Hence, . Let where —that is, is an diagonal matrix with repeated on the diagonal times. Then, since .

Now, observe that . If is invertible, then the implicit function theorem [31, Theorem C.40] implies that is a local diffeomorphism. Hence, it suffices to show that does not have an eigenvalue of . Indeed, letting be the spectral radius of a matrix , we know in general that for any square matrix and induced operator norm so that Of course, the spectral radius is the maximum absolute value of the eigenvalues, so that the above implies that all eigenvalues of have absolute value less than .

Since is injective by the preceding argument, its inverse is well-defined and since is a local diffeomorphism in , it follows that is smooth in . Thus, is a diffeomorphism.

Consider all critical points to the game—i.e. . For each , let be the open ball derived from Theorem C and let . Since , Lindelõf’s lemma [29]—every open cover has a countable subcover—gives a countable subcover of . That is, for a countable set of critical points with , we have that .

Starting from some point , if gradient-based learning converges to a strict saddle point, then there exists a and index such that for all . Again, applying Theorem C and using that —which we note is obviously true if —we get that .

Using the fact that is invertible, we can iteratively construct the sequence of sets defined by and . Then we have that for all . The set contains all the intial points in such that gradient-based learnign converges to a strict saddle.

Since is a strict saddle, has an eigevalue greater than . This implies that the co-dimension of is strictly less than (i.e. ). Hence, has Lebesgue measure zero in .

Using again that is a diffeomorphism, so that it is locally Lipschitz and locally Lipschitz maps are null set preserving. Hence, has measure zero for all by induction so that is a measure zero set since it is a countable union of measure zero sets.

For the class of potential games, agents employing a gradient-based learning scheme converge to differential Nash equilibria almost surely. Consider a potential game on open, convex and where each for . Let be a prior measure with support which is absolutely continuous with respect to the Lebesgue measure and assume exists. Then, under the assumption of Theorem 1, competitive gradient-based learning converges to non-degenerate differential Nash equilibria almost surely. Moreover, the non-degenerate differential Nash to which it converges is generically a local Nash equilibrium. Since the game admits a potential function ,

so that analysis of the gradient-based learning scheme reduces to analyzing gradient-based optimization of . Moreover, existence of a potential function also implies that so that is symmetric. Indeed, writing as the differential form and noting that for the differential operator , we have that

(8)

Symmetry of implies that all periodic orbits are equilibria—i.e. the dynamics do not possess any limit cycles. By Theorem 1, the set of initial points that converge to linearly unstable critical points is of measure zero. Since all the stable critical points of the dynamics are equilibria, with the assumption that exists for all , we have that where is a non-degenerate differential Nash equilibrium which, by Theorem 3, is generically a local Nash equilibrium.

The interesting thing to point out here is that the agents do not need to be doing gradient-based learning on to converge to Nash almost surely. That is, they do not need to know the function ; they simply need to follow the derivative of their own cost with respect to their own choice variable. The potential function is also generically a Morse function111Morse theory states that there is an open dense (in the topology) set of functions called Morse functions where the Hessian is non-degenerate. and, as such, the number of critical points are finite. Thus, Corollary 1 implies that competitive gradient-based learning converges generically to one of finitely many local Nash equilibria in potential games.

Theorem 1 (and Corollary 1) is important since it suggests that gradient-play in multi-agent settings, avoids strict saddles almost surely even in the deterministic setting. For zero-sum games, this is a positive result since it suggests that eventually gradient-based learning algorithms will escape strict saddle points of the dynamics. For general-sum and potential games, as we noted in Section 2, strict saddle points can be local Nash equilibria. Therefore, this is a negative result since a potentially large subset of the local Nash equilibria in a game will be avoided almost surely under gradient-based learning.

Since games generally do not admit purely gradient flows, other types of limiting behavior such as limit cycles can occur in gradient-based learning dynamics. Theorem 1 says nothing about other limiting behavior. In the stochastic setting, we state stronger results on avoidance of linearly unstable periodic orbits which include limit cylces.

4.2 Stochastic Competitive Gradient-Based Learning

We now analyze the stochastic case in which agents are assumed to have an unbiased estimator for their gradient. The results in this section are significant as they allow us to extend the results from the deterministic setting to a setting where each agent builds an estimate of the gradient of their loss at the current set of strategies from potentially noisy observations of the environment. This setup allows us to analyze the limiting behavior of gradient-based learning schemes in multi-agent reinforcement learning, multi-armed bandits, generative adversarial networks, and online optimization. In particular, we extend our results from the previous section to show that with unbiased estimates of the gradient of their loss function with respect to their own choice variable, agents will almost surely not converge to linearly unstable critical points and cycles. We also construct a new class of games, namely Morse-Smale games, and show that for such games, using gradient-based learning algorithms, agents will converge to locally asymptotically stable equilibria, including, but not limited to, stable local Nash equilibria, or stable limit cycles. This second result implies that gradient-based learning algorithms can converge to non-Nash locally asymptotically stable equilibrium, which is again a negative result for gradient-based learning algorithms in games.

Each agent updates their strategy using

(9)

where is an unbiased estimator for and hence, we can write it as

for some zero-mean, finite-variance stochastic process . Before developing our theoretical results for stochastic case, let us comment on the different learning algorithms that fit into this framework.

4.2.1 Example Classes of Gradient-Based Learning Algorithms

The stochastic gradient-based learning setting we study is general enough to include a variety of commonly used multi-agent learning algorithms. The classes of algorithms we include is hardly an exhaustive list, and indeed many extensions and altogether different algorithms exist that can be considered members of this class.

In Table 1, we provide the gradient-based update rule for six different example classes of learning problems: (i) gradient play in non-cooperative continuous games, (ii) GANs, (iii) multi-agent policy gradient, (iv) individual Q-learning, (v) multi-agent gradient bandits, and (vi) multi-agent experts. We provide a detailed analysis of these different algorithms including the derivation of the gradient-based update rules along with some interesting numerical examples in Appendix A. In each of these cases, one can view an agent employing the given algorithm as building an unbiased estimate of their gradient from their observation of the environment.

For example, in policy gradient (see, e.g., [48, Chapter 13]), agents’ costs are defined as functions of a parameter vector that parameterize their policies . The parameters are agent ’s choice variable; through the learning scheme they aim to tune the parameter via following the gradient of their loss function in order to converge on an optimal policy . Perhaps surprisingly, it is not necessary for agent to have access to or even in order for them to construct an unbiased estimate of the gradient of their loss with respect to their own choice variable as long as they observe the sequence of actions, say , of all other agents generated. These actions are implicitly determined by the other agents’ policies . Hence, in this case if agent observes , where are the reward, action, and state of agent , then this is enough to construct an unbiased estimate of their gradient.

Problem Class Gradient Learning Rule
Gradient Play
GANs
Multi-Agent Policy Gradient
Individual Q-Learning
Multi-Agent Gradient Bandits ,
Multi-Agent Experts ,
Table 1: Example problem classes that fit into competitive gradient-based learning rules. Details on the derivation of these update rules as gradient-based learning schemes is provided in Appendix A.

4.2.2 Stochastic Gradient Results

Coming back to the analysis of

(10)

we make the following assumptions.

Assumption 2

The stochastic process satisfies the assumptions , and a.s., for , where is an increasing family of

-fields—filtration, or history generated by the sequence of random variables—given by

.

Assumption 3

For each , with , is Lipschitz for some constant , the step-sizes satisfy for all and and , and a.s.

Let and denotes the inner product. Consider a game on . Suppose each agent adopts a gradient-free learning algorithm that satisfies Assumptions 2 and 3. Further, suppose that for each , there exists a constant such that for every unit vector . Then competitive gradient-free learning converges to linearly unstable critical points of the game on a set of measure zero. The above theorem implies that the stochastic approximation dynamics in (10), describing the competitive gradient-free learning corresponding to a game, avoid critical points of the game corresponding to strict saddles. The proof follows directly from showing under the assumptions that (10) satisfies Theorem C.

We again point out that the set of linearly unstable critical points of the game includes a non-negligible subset of the local Nash equilibria. Thus, the previous theorem says that stochastic gradient-based learning will avoid a non-negligible subset of the local Nash equilibria of the game almost surely. We note that the assumption , can be interpreted as a requirement on the noise having a nonzero component in each direction without which it is possible to converge to unstable points [40].

Under the assumptions of Theorem 3, if (10) converges to a critical point, then that critical point is a locally asymptotically stable equilibrium of . In particular, not all points that are attracting under the flow of are local Nash equilibria.

As we did with potential games in the preceding section, we can state stronger results for certain nice classes of games. As we have noted, games not admitting potential functions may lead to limit cycles. Hence, we use the expanded theory in [3, 6] to show that stochastic gradient-based learning algorithms avoid repelling sets.

To do so, we need further assumptions on our underlying space—i.e. we need the underlying decision spaces of agents to be smooth, compact manifolds without boundary where before we simply required them to either be Euclidean space—i.e. —or some open, convex subset of satisfying . In particular, let be a smooth, compact manifold without boundary for each . The stochastic process which follows (10) is defined on —that is, for all . As before, it is natural to compare sample points to solutions of where we think of (10) as a noisy approximation. As in the previous section, the asymptotic behavior of can indeed be described by the asymptotic behavior of the flow generated by .

A non-stationary periodic orbit of is called a cycle. Let be a cycle of period . Denote by the flow corresponding to . For any , where is the set of characteristic multipliers. We say is hyperbolic if no element of is on the complex unit circle. Further, if is strictly inside the unit circle, is called linearly stable and, on the other hand, if has at least one element on the outside of the unit circle—that is, for has an eigenvalue with real part strictly greater than —then is called linearly unstable. The latter is the analog of linearly unstable critical points in the context of periodic orbits.

We denote by sample paths of the process (10) and is the limit set of any sequence which is defined in the usual way as all such that for some sequence . It was shown in [3] that under less restrictive assumptions than Assumptions 2 and 3, is contained in the chain recurrent set of and is a non-empty, compact and connected set invariant under the flow of . Consider a game where each is a smooth, compact manifold without boundary. Suppose each agent adopts a stochastic gradient-based learning algorithm that satisfies Assumptions 2 and 3 and is such that sample points for all . Further, suppose that for each , there exist a constant such that for every unit vector . Then competitive gradient-free learning converges to linearly unstable cycles on a set of measure zero—i.e. where is a sample path. As we noted, periodic orbits are not necessarily excluded from the limiting behavior of gradient-based learning in games. We leave out the proof of Theorem 3 since, other than some algebraic manipulation, it is a direct application of [6, Theorem 2.1] (which we provide in Theorem C in Appendix C).

The above theorem simply states competitive stochastic gradient-based learning avoids linearly unstable cycles on a set of measure zero. Of course, we can state stronger results for a more restrictive class of games admitting gradient-like vector fields. Specifically, analogous to [6], we can consider Morse-Smale vector fields. We introduce a new class of games, which we call Morse-Smale games, that are a generalization of potential games. These are a very important class of games as they correspond to Morse-Smale vector fields which are known to be generic and in the case that the joint strategy space is a compact manifold, this implies Morse-Smale games are open, dense in the set of games.

A game with for some and where strategy spaces is a smooth, compact manifold without boundary for each is a Morse-Smale game if the vector field corresponding to the differential is Morse-Smale—that is, the following hold: (i) all periodic orbits (i.e. equilibria and cycles) are hyperbolic and (i.e. the stable and unstable manifolds of intersect transversally), (ii) every forward and backward omega limit set is a periodic orbit, (iii) and has a global attractor. The conditions of Morse-Smale in the above definition ensure that there are only finitely many periodic orbits. The simplest example of a Morse-Smale vector field is a gradient flow. However, not all Morse-Smale vector fields are gradient flows and hence, not all Morse-Smale games are potential games. Consider the -player game with for each and This is a Morse-Smale game that is not a potential game. Indeed, where is a dynamical system with a Morse-Smale vector field that is not a gradient vector field [11].

Essentially, in a neighborhood of a critical point for a Morse-Smale game, the game behavior can be described by a Morse function such that near critical points can be written as and away from critical points points in the same direction as —i.e. . Specializing the class of Morse-Smale games, we have stronger convergence guarantees. Consider a Morse-Smale game on smooth boundaryless compact manifold . Suppose Assumptions 2 and 3 hold and that is defined on . Let denote the set of periodic orbits in . Then and implies is linearly stable. Moreover, if the periodic orbit with is an equilibrium, then it is either a non-degenerate differential Nash equilibrium—which is generically a local Nash—or a non-Nash locally asymptotically stable equilibrium. The proof of Theorem 3 utilizes [6, Corollary 2.2] (which we provide in Corollary C in Appendix C).

If we further restrict the class of games to potential games, the above theorem implies convergence to Nash in almost surely. Consider the game on smooth boundaryless compact manifold admitting potential function . Under the assumptions of Theorem 3, competitive stochastic gradient-based learning converges to a non-degenerate differential Nash equilibrium almost surely. Moreover, the differential Nash to which it converges is generically a local Nash equilibrium.

[Corollary 3] Consider a potential game where each is a smooth, compact boundaryless manifold. Then for some which implies that is a gradient flow and hence, does not admit limit cycles. Let be the set of equilibrium points in . Under the assumptions of Theorem 3,