# Policy-Gradient Algorithms Have No Guarantees of Convergence in Continuous Action and State Multi-Agent Settings

We show by counterexample that policy-gradient algorithms have no guarantees of even local convergence to Nash equilibria in continuous action and state space multi-agent settings. To do so, we analyze gradient-play in N-player general-sum linear quadratic games. In such games the state and action spaces are continuous and the unique global Nash equilibrium can be found be solving coupled Ricatti equations. Further, gradient-play in LQ games is equivalent to multi-agent policy gradient. We first prove that the only critical point of the gradient dynamics in these games is the unique global Nash equilibrium. We then give sufficient conditions under which policy gradient will avoid the Nash equilibrium, and generate a large number of general-sum linear quadratic games that satisfy these conditions. The existence of such games indicates that one of the most popular approaches to solving reinforcement learning problems in the classic reinforcement learning setting has no guarantee of convergence in multi-agent settings. Further, the ease with which we can generate these counterexamples suggests that such situations are not mere edge cases and are in fact quite common.

Comments

There are no comments yet.

## Authors

• 9 publications
• 16 publications
• 185 publications
• 27 publications
• ### Multi-Agent Reinforcement Learning in Cournot Games

In this work, we study the interaction of strategic agents in continuous...
09/14/2020 ∙ by Yuanyuan Shi, et al. ∙ 0

read it

• ### Newton-based Policy Optimization for Games

Many learning problems involve multiple agents optimizing different inte...
07/15/2020 ∙ by Giorgia Ramponi, et al. ∙ 0

read it

• ### On the Convergence of Competitive, Multi-Agent Gradient-Based Learning

As learning algorithms are increasingly deployed in markets and other co...
04/16/2018 ∙ by Eric Mazumdar, et al. ∙ 0

read it

• ### Convergence Analysis of Gradient-Based Learning with Non-Uniform Learning Rates in Non-Cooperative Multi-Agent Settings

Considering a class of gradient-based multi-agent learning algorithms in...
05/30/2019 ∙ by Benjamin Chasnov, et al. ∙ 0

read it

• ### Deep Q-Learning for Nash Equilibria: Nash-DQN

Model-free learning for multi-agent stochastic games is an active area o...
04/23/2019 ∙ by Philippe Casgrain, et al. ∙ 0

read it

• ### Disturbance Decoupling for Gradient-based Multi-Agent Learning with Quadratic Costs

Motivated by applications of multi-agent learning in noisy environments,...
07/14/2020 ∙ by Sarah H. Q. Li, et al. ∙ 0

read it

• ### Efficient Competitive Self-Play Policy Optimization

Reinforcement learning from self-play has recently reported many success...
09/13/2020 ∙ by Yuanyi Zhong, et al. ∙ 11

read it

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Interest in multi-agent reinforcement learning has seen a recent surge of late, and policy gradient algorithms are championed due to their potential scalability. Indeed, recent impressive successes of multi-agent reinforcement learning have made use of policy optimization algorithms such as multi-agent actor-critic (Lowe et al., 2017; Srinivasan et al., 2018; Jaderberg et al., 2019), multi-agent proximal policy optimization (Bansal et al., 2018), and even simple multi-agent policy gradients (Lanctot et al., 2017) in problems where the various agents have continuous state and action spaces.

Despite these successes, a theoretical understanding of these algorithms in multi-agent settings is still lacking. Missing perhaps, is a tractable yet sufficiently complex setting in which to study these algorithms. Recently, there has been much interest in analyzing the convergence and sample complexity of policy-gradient algorithms in the classic linear quadratic regulator (LQR) problem from optimal control (Kalman, 1960). The LQR problem is a particularly apt setting to study the properties of reinforcement learning algorithms due to the existence of an optimal policy which is a linear function of the state and which can be found by solving a Ricatti equation. Indeed, the relative simplicity of the problem has allowed for new insights into the behavior of reinforcement learning algorithms in continuous action and state spaces (Dean et al., 2017; Fazel et al., 2018; Malik et al., 2019).

An extension of the LQR problem to the setting with multiple agents, known as a linear quadratic (LQ) game, has also been well studied in the literature on dynamic games and optimal control (Basar and Olsder, 1998). As the name suggests, an LQ game is a setting in which multiple agents attempt to optimally control a shared linear dynamical system subject to quadratic costs. Since the players have their own costs, the notion of ‘optimality’ in such games is a Nash equilibrium.

Like LQR for the classical single-agent setting, LQ games are an appealing setting in which to analyze the behavior of multi-agent reinforcement learning algorithms in continuous action and state spaces since they admit a unique global Nash equilibrium in the space of linear feedback policies. Moreover, this equilibrium can be found by solving a coupled set of Ricatti equations. As such, LQ games are a natural benchmark problem on which to test policy-gradient algorithms in multi-agent settings. In the single-agent setting, it was recently shown that policy gradient has global convergence guarantees for the LQR problem (Fazel et al., 2018). These results have recently been extended to projected policy-gradient algorithms in zero-sum LQ games (Zhang et al., 2019).

#### Contributions.

We present a negative result, showing that policy gradient in general-sum LQ games does not enjoy even local convergence guarantees, unlike in LQR and zero-sum LQ games. In particular, we show that if each player randomly initializes their policy and then uses a policy-gradient algorithm there exists an LQ game in which the players would almost surely fail to converge to any single set of policies (including the unique Nash equilibrium). Further, our numerical experiments indicate that LQ games in which this occurs may be quite common. We also observe empirically that when players fail to converge to the Nash equilibrium they do converge to stable limit cycles. These cycles do not seem to have any readily apparent relationship to the Nash equilibrium of the game.

#### Organization.

The paper is organized as follows. In Section 2, we introduce the setting of -player general-sum LQ games and present previous results on the existence and uniqueness of the Nash equilibrium in such games. We show in Section 3 that policy gradient in the general class of LQ games that admit a unique feedback Nash equilibrium has no other stationary points than the Nash equilibrium. In Section 4, we give sufficient conditions under which policy gradient will almost surely avoid the Nash equilibrium. Given these theoretical results, we perform a random search and find a large number of 2-player LQ games that satisfy these sufficient conditions. These findings are presented in Section 5. We also present numerical experiments showing the existence of limit cycles in the gradient dynamics of general-sum LQ games. We conclude in Section 6 with a discussion of our findings.

## 2 Preliminaries

We consider -player LQ games subject to a discrete-time dynamical system defined by

 z(t+1)=Az(t)+∑Ni=1Biui(t)   z(0)=z0∼Do, (1)

where is the state at time , is the initial state distribution, and is the control input of player . For LQ games, it is known that under reasonable assumptions, linear feedback policies for each player that constitute a Nash equilibrium exist and are unique (Basar and Olsder, 1998). Thus, we consider that each player searches for a linear feedback policy of the form that minimizes their loss, where . We use the notation for the combined dimension of the players’ parameterized policies.

As the name of the game implies, the players’ loss functions are quadratic functions given by

 fi(u1,…,uN)=Ez0∼Do[∑∞t=0z(t)TQiz(t)+ui(t)TRiui(t)],

where and are the cost matrices for the state and input, respectively.

###### Assumption 1

For each player , the state and control cost matrices satisfy and .

We note that the players are coupled through the dynamics since is constrained to obey the update equation given in (1). We focus on a setting in which all players randomly initialize their strategy and then perform gradient descent simultaneously on their own cost functions with respect to their individual control inputs. That is, the players use policy-gradient algorithms of the following form:

 Ki,n+1 =Ki,n−γiDifi(K1,n,…,KN,n) (2)

where denotes the derivatives of with respect to the –th argument, and are the step-sizes of the players. We note that there is a slight abuse of notation here in the expression of as functions of the parameters as opposed to the control inputs . To ensure there is no confusion between and , we also point out that indexes the policy gradient algorithm iterations while indexes the time of the dynamical system.

To simplify notation, define

 ΣK=Ez0∼Do[∑∞t=0z(t)z(t)T],

where we use the subscript notation to denote the dependence on the collection of controllers . Define also the initial state covariance matrix

 Σ0=Ez0∼Doz(0)z(0)T.

Direct computation verifies that for player , is given by:

 Difi(K1,…,KN)=2(RiKi−BTiPi¯A)ΣK, (3)

where , is the closed–loop dynamics given all players’ control inputs and, for given , the matrix is the unique positive definite solution to the Bellman equation defined as

 Pi =¯ATPi¯A+KTiRiKi+Qi,  i∈{1,…,N}. (4)

Given that the players may have different control objectives and do not engage in coordination or cooperation, the best they can hope to achieve is a Nash equilibrium. A feedback Nash equilibrium is a collection of policies such that:

 fi(K∗1,…,K∗i,…,K∗N)≤fi(K∗1,…,Ki,…,K∗N),  ∀ Ki∈Rdi×m.

for each . Under suitable assumptions on the cost matrices, the Nash equilibrium of an LQ game is known to exist and is unique in the space of linear policies Basar and Olsder (1998). It can be found by solving coupled Ricatti equations using the method of Lyapunov iterations; e.g., the method is outlined in Li and Gajic (1995) for continuous time LQ games, and an analogous procedure can be followed for discrete time. Convergence requires the following assumption.

###### Assumption 2

For at least one player , is stabilizable.

Assumption 2 is a necessary condition for the players to be able to stabilize the system. Indeed, the player’s costs are finite only if the closed loop system is asymptotically stable, meaning that for all , where denotes the real part of and is the spectrum of a matrix .

## 3 Uniqueness of Critical Points of the Gradient Dynamics in General-Sum LQ Games

Having introduced the class of games we analyze we now comment on the critical points of gradient-play in general LQ games. Letting , the object of interest is the map defined as follows:

 ω(x)=⎡⎢ ⎢⎣D1f1(K1,…,KN)⋮DNfN(K1,…,KN)⎤⎥ ⎥⎦.

Note that has been converted to an

dimensional vector and each

has also been vectorized. This is a slight abuse of notation and throughout we treat the ’s as both vectors and matrices; in general, the shape should be clear from context, and otherwise we make comments where necessary to clarify.

Critical points of gradient-play are strategies such that . Recent work has shown that when players perform gradient descent on their own cost functions in general-sum games they may converge to critical points that are not Nash equilibria (Mazumdar et al., 2018). The following theorem shows that such non-Nash equilibria cannot exist in the gradient dynamics of general-sum LQ games.

Consider the set of stabilizing policies such that . If for each , then is a Nash equilibrium.

We prove the statement by contradiction. Suppose the claim does not hold so that and for each , yet is not a Nash equilibrium. That is, without loss of generality, there exists a such that:

 f1(¯K1,K∗2,…,K∗N)

Now, fixing , player 1 can be seen as facing an LQR problem. Indeed, letting be fixed, player 1 aims to find a ‘best response’ in the space of linear feedback policies of the form with that minimizes subject to the dynamics defined by:

 z(t+1)=(A−∑Ni=2BiKi)z(t)+B1u1(t).

Note that this system is necessarily stabilizable since is stable. Hence, the discrete algebraic Riccati equation for player 1’s LQR problem has a positive definite solution such that since by assumption. Since and , applying Corollary 4 of Fazel et al. (2018), we have that must be optimal for player 1’s LQR problem so that

 f1(K∗1,…,K∗N)≤f1(K,K∗2,…,K∗N),  ∀ K∈Rd1×m.

In particular, the above inequality holds for , which leads to a contradiction.

Theorem 3 shows that, just as in the single-player LQR setting and zero-sum LQ games, the critical points of gradient-play in –player general-sum LQ games are all Nash equilibria. We note that the condition can be satisfied by choosing an initial state distribution with a full-rank covariance matrix.

A simple consequence of Theorem 3 and the uniqueness of the Nash equilibrium given Assumptions 1 and 2 is that the gradient dynamics admit a unique critical point.

Under Assumption 1 and 2, if , then the map admits a unique critical point.

Given that the unique critical point of the gradient dynamics in LQ games is the Nash equilibrium, the aim is to show, via constructing an example, that games in which the gradient dynamics avoid the Nash equilibrium do in fact exist. A sufficient condition for this would be to find a game in which gradient-play diverges from neighborhoods of Nash equilibria. It is demonstrated in Mazumdar et al. (2018) that there may be Nash equilibria that are not even locally attracting under the gradient dynamics in –player general-sum games in which the players’ costs are sufficiently smooth (i.e., at least twice continuously differentiable). In games that admit such Nash equilibria, the agents could initialize arbitrarily close to the Nash equilibrium, simultaneously perform individual gradient descent with arbitrarily small step sizes, and still diverge.

The class of –player LQ games we consider does not, however, satisfy the smoothness assumptions necessary to simply invoke the results in Mazumdar et al. (2018). Indeed, the cost functions are non-smooth and, in fact, are infinite whenever the players have strategies that do not stabilize the dynamics. Further, the set of stabilizing policies for a dynamical system is not even convex (Fazel et al., 2018). Despite these challenges, in the sequel we show that the negative convergence results in Mazumdar et al. (2018) extend to the general-sum LQ setting. In particular, we show that even with arbitrarily small step sizes, players using policy gradient in LQ games may still diverge from neighborhoods of the unique Nash equilibrium.

## 4 Policy Gradient Avoids Nash Equilibria that are Saddle Points of the Dynamics

Given that the Nash equilibrium is the unique critical point of the gradient dynamics in –player LQ games, we now give sufficient conditions under which gradient-play has no guarantees of even local, much less global, convergence to a critical point. Towards this end, we first show that is sufficiently smooth on the set of stabilizing policies.

Let be the subset of stabilizing –dimensional matrices. Consider an –player LQ game. The vector-valued map associated with the game is twice continuously differentiable on —i.e., .

Using our notation, Lemma 6.5 in Zhang et al. (2019) shows for two-player zero-sum LQ games that , and are continuously differentiable with respect to and when is stable. This, in turn, implies that is continuously differentiable with respect to and when the closed loop system is stable. The result follows by a straightforward application of the implicit function theorem (Abraham et al., 1988). We utilize the same proof technique here in extending the result to –player general-sum LQ games and, in fact, the proof implies that has even stronger regularity properties. Since the proof follows the same techniques as in Zhang et al. (2019), we defer it to Appendix A.

Given that is continuously differentiable over the set of stabilizing joint policies , the following result gives sufficient conditions such that the set of initial conditions in a neighborhood of the Nash equilibrium from which gradient-play converges to the Nash equilibrium is of measure zero. This implies that the players will almost surely avoid the Nash equilibrium even if they randomly initialize in a uniformly small ball around it.

Let the Jacobian of the vector field be denoted by . Given a critical point , let

be the eigenvalues of

, for , where . Recall that the state is dimension .

Suppose that . Consider any –player LQ game satisfying Assumptions 1 and 2 that admits a unique Nash equilibrium that is a saddle point of the policy gradient dynamics—i.e., LQ games for which the Jacobian of evaluated at the unique Nash equilibrium has eigenvalues such that for and for for some such that . Then there exists a neighborhood of such that policy gradient converges on a set of measure zero.

The proof is made up of three parts: (i) we show the existence of an open-convex neighborhood of on which is locally Lipschitz with constant ; (ii) we show that the map is a diffeomorphism on ; and, (iii) we invoke the stable manifold theorem to show that the set of initializations in on which policy gradient converges is measure zero.

#### (i) ω is locally Lipschitz.

Proposition 4 shows that is continuously differentiable on the set of stabilizing policies . Given Assumptions 1 and 2, the Nash equilibrium exists and . Thus, there must exist an open convex neighborhood of such that for some .

#### (ii) g is a diffeomorphism.

By the preceding argument, is locally Lipschitz on with Lipschitz constant . Consider the policy gradient algorithm with for each . Let where —that is, is an diagonal matrix with repeated on the diagonal times. Now, we claim the mapping is a diffeomorphism on . If we can show that is invertible on and a local diffeomorphism, then the claim follows. Let us first prove that is invertible.

Consider and suppose so that . Since on , since .

Now, observe that . If is invertible, then the implicit function theorem (Abraham et al., 1988) implies that is a local diffeomorphism. Hence, it suffices to show that does not have an eigenvalue equal to one. Indeed, letting be the spectral radius of a matrix , we know in general that for any square matrix and induced operator norm so that . Of course, the spectral radius is the maximum absolute value of the eigenvalues, so that the above implies that all eigenvalues of have absolute value less than one.

Since is injective by the preceding argument, its inverse is well-defined and since is a local diffeomorphism on , it follows that is smooth on . Thus, is a diffeomorphism.

#### (iii) Local convergence occurs on a set of measure zero.

By Corollary 3, is unique. Let be the open ball derived from Theorem B in Appendix B.

Starting from , if gradient-based learning converges to a strict saddle point, then there exists an such that for all . Applying Theorem B (Appendix B), we get that . Now, using the fact that is invertible, we can iteratively construct the sequence of sets defined by and . Then we have that for all . The set contains all the initial points in such that gradient-based learning converges to a strict saddle.

Since is a strict saddle, has an eigenvalue greater than one. This implies that the co-dimension of the unstable manifold is strictly less than so that . Hence, has Lebesgue measure zero in . Using again that is a diffeomorphism, so that it is locally Lipschitz and locally Lipschitz maps are null-set preserving. Hence, has measure zero for all by induction so that is a measure-zero set since it is a countable union of measure-zero sets.

Theorem 4 gives sufficient conditions under which, with random initializations of , policy gradient methods would almost surely avoid the unique critical point and therefore fail to converge to any single set of policies. Let each players’ initial strategy be sampled from a distribution for , and let

be the resulting the joint distribution of

.

Suppose is chosen such that , and consider an –player LQ game satisfying Assumptions 1 and  2 in which the Nash equilibrium is a saddle point of the policy gradient dynamics. If each player performs policy gradient with a random initial strategy such that the support of is , they will almost surely avoid the Nash equilibrium.

Corollary 4 shows that even if the players randomly initialize in a neighborhood of the Nash equilibrium in a LQ game in which the Nash is a saddle point of the joint gradient dynamics they will almost surely avoid it. The proof follows trivially from the fact that the set of initializations that converge to the Nash equilibrium is of measure zero in .

In the next section, we generate a large number of LQ games that satisfy the conditions of Corollary 4. Taken together, these theoretical and numerical results imply that policy-gradient algorithms have no guarantees of local, and consequently global, convergence in general-sum LQ games.

Theorem 4 gives us sufficient conditions under which policy gradient in general-sum LQ games does not even have local convergence guarantees, much less global convergence guarantees. We remark that this is very different from the single-player LQR setting, where policy gradient will converge from any initialization in a neighborhood of the optimal solution (Fazel et al., 2018). In zero-sum LQ games, the structure of the game also precludes any Nash equilibrium from satisfying the conditions of Theorem 4 (Mazumdar et al., 2018), meaning that local convergence is always guaranteed. In Zhang et al. (2019), the guarantee of local convergence is strengthened to that of global convergence for a class of projected policy gradient algorithms in zero-sum LQ games.

## 5 Generating Counterexamples

Since it is difficult to find a simple closed form for the Jacobian of due to the fact that the matrices implicitly depend on all the , we perform random search to find instances of LQ games in which the Nash equilibrium is a strict saddle point of the gradient dynamics. For each LQ game we generate, we use the method of Lyapunov iterations to find the global Nash equilibrium of the LQ game and numerically approximate the Jacobian to machine precision. We then check whether the Nash equilibrium is a strict saddle. Surprisingly, such a simple search procedure finds a large number of LQ games in which policy gradient avoids the unique Nash equilibrium.

For simplicity, we focus on two-player LQ games where and . Thus, each player has two parameters to learn, which we denote , .

In the remainder of this section, we detail our experimental setup and then present our findings.

### 5.1 Experimental setup

To search for examples of LQ games in which policy gradient avoids the Nash equilibrium, we fix , , and and parametrize , , and by , , and , respectively. For various values of the parameters , , and , we uniformly sample different dynamics matrices such that satisfies Assumption 2. Then, for each of the different LQ games we find the optimal feedback matrices using the method of Lyapunov iterations (i.e., a discrete time variant of the algorithm outlined in Li and Gajic (1995)), and then numerically approximate using auto-differentiation111We use auto-differentiation due to the fact that finding an analytical expression for is unduly arduous even in low dimensions due to the dependence of and on , both of which are implicitly defined. tools and check its eigenvalues.

The exact values of the matrices are defined as follows:

 A∈R2×2:ai,j∼Uniform (0,1) i,j=1,2, B1=[11], B2=[b1], Q1=[0.01001] , Q2=[100q], R1=0.01, R2=r.

### 5.2 Numerical results

Using the setup outlined in the previous section we randomly generated LQ games to search for counterexamples. We first present results that show that these counterexamples may be quite common. We then use policy gradient in two of the LQ games we generated and highlight the existence of limit cycles and the fact that the players’ time-averaged strategies do not converge to the Nash equilibrium.

#### Avoidance of Nash in a nontrivial class of LQ games.

As can be seen in Figure 1, across the different parameter values we considered, we found that in to of randomly sampled LQ games, the unique global Nash equilibrium was a strict saddle point of the gradient dynamics and would therefore be avoided by policy gradient. Of particular interest, when , for all values of and that we tested at least of the LQ games had a global Nash equilibrium with the strict saddle property. In the worst case, around of the LQ games for the given values of , , and admitted such Nash equilibria.

These empirical observations imply that multi-agent policy gradient, even in the relatively straightforward setting of linear dynamics, linear policies, and quadratic costs, would fail to converge to the unique Nash equilibrium in up to one out of four such problems. This suggests that for more complicated cost functions, policy classes, and dynamics, Nash equilibria may often be avoided by policy gradient.

We remark that each point in Figure 1 represents the number of counterexamples found (out of ) for each parameter value, meaning that for and we were able to consistently generate around different examples of games where policy gradient almost surely avoids the only stationary point of the dynamics.

Note also that we were unable to find any counterexamples when was varied in and , . This suggests that depending on the structure of the dynamical system it may be possible to give convergence guarantees.

#### Convergence to Cycles.

Figures 24 show the payoffs and parameter values of the two players when they use policy gradient in two general-sum LQ games we identified as being counterexamples for convergence to the Nash equilibrium.

In the two games, we initialize both players in a ball of radius around their Nash equilibrium strategies and let them perform policy gradient with step size . We observe that in both games the players diverge from the Nash equilibrium and converge to limit cycles.

For the two games in Figures 25, the game parameters are such that , , and . The two matrices are defined as follows:

 (i): A=[0.5880.0280.5700.056],(ii):  A=[0.5110.0640.5330.993]. (5)

The eigenvalues of the corresponding game Jacobian evaluated at the Nash equilibrium are as follows:

 (i): spec(Dω(K∗1,K∗2))={10.88,2.02,−0.21,−0.06} (ii): spec(Dω(K∗1,K∗2))={9.76,0.54,−0.01+0.08j,−0.01−0.08j}.

Thus, these games do satisfy the conditions of Corollary 4 for the avoidance of Nash equilibria. We conclude this section by noting that, as shown in Figure 5, the players’ average payoffs do not necessarily converge to the Nash equilibrium payoffs.

## 6 Discussion

We have shown that in the relatively straightforward setting of –player LQ games, agents performing policy gradient have no guarantees of local, and therefore global, convergence to the unique Nash equilibrium even if they randomly initialize their first policies in a small neighborhood of the Nash equilibrium. Since we also showed that the Nash equilibrium is the only critical point of the gradient dynamics, this means that, for this class of games, policy-gradient algorithms have no guarantees of convergence to any set of stationary policies.

Since linear dynamics, quadratic costs, and linear policies are a relatively simple setup compared to many recent deep multi-agent reinforcement learning problems (Bansal et al., 2018; Jaderberg et al., 2019), we believe that the issues of non-convergence are likely to be present in more complex scenarios involving more complex dynamics and parametrizations of the policies. This can be viewed as a cautionary note, but it also suggests that the algorithms that have yielded impressive results in multi-agent settings can be further improved by leveraging the underlying game-theoretic structure.

We remark that we only analyzed the deterministic policy gradient setting, though the findings extend to settings in which players construct unbiased estimates of their gradients

(Sutton and Barto, 2017) and even actor-critic methods (Srinivasan et al., 2018). Indeed all of these algorithms will suffer the same problems since they all seek to track the same limiting continuous-time dynamical system (Mazumdar et al., 2018).

Our numerical experiments also highlight the existence of limit cycles in the policy-gradient dynamics. Unlike in classical optimization settings in which oscillations are normally caused by the choice of step sizes, the cycles we highlight are behaviors that can occur even with arbitrarily small step sizes. They are a fundamental feature of learning in multi-agent settings (Mazumdar et al., 2018). We remark that there is no obvious link between the limit cycles that arise in the gradient dynamics of the LQ games and the Nash equilibrium of the game. Indeed, unlike with other game dynamics in more simple games, such as the well-studied replicator dynamics in bilinear games (Mertikopoulos et al., 2018) or multiplicative weights in rock-paper-scissors (Hommes and Ochea, 2012), the time average of the players’ strategies does not coincide with the Nash equilibrium. This may be due to the fact that the Nash equilibrium is a saddle point of the gradient dynamics and not simply marginally stable, though the issue warrants further investigation.

This paper highlights how algorithms developed for classical optimization or single-agent optimal control settings may not behave as expected in multi-agent and competitive environments. Algorithms and approaches that have provable convergence guarantees and performance in competitive settings, while retaining the scalability and ease of implementation of simple policy-gradient methods, are therefore a crucial and promising open area of research.

## Appendix A Proofs of Auxiliary Results

Consider an –player LQ game. The vector-valued map twice continuously differentiable on ; i.e., . Following the proof technique of Zhang et al. (2019), we show the regularity of using the implicit function theorem (Abraham et al., 1988). In particular, we show that and for are with respect to each on the space of stabilizing matrices.

For any stabilizing , is the unique solution to the following discrete-time Lyapunov equation:

 ¯AΣK¯AT+Σ0=ΣK, (6)

where and . Both sides of this expression can be vectorized. Indeed, using the same notation as in Zhang et al. (2019), let be the map that vectorizes its argument and let be defined by

 Ψ(vect(ΣK),K1,…,KN)=[¯A⊗¯A]⋅vect(ΣK)+vect(Σ0).

Then, (6) can be written as

 F(vect(ΣK),K1,…,KN)=Ψ(vect(ΣK),K1,…,KN)−vect(ΣK)=0.

The map implicitly defines . Moreover, letting

denote the appropriately sized identity matrix, we have that

 ∂F(vect(ΣK),K1,…,KN)∂vectT(ΣK)=[¯A⊗¯A]−I.

For stabilizing , this matrix is an isomorphism since is inside the unit circle. Thus, using the implicit function theorem, we conclude that . As noted in Zhang et al. (2019), the proof for each , is completely analogous. Since and are and is linear in these terms, the result of the proposition follows.

## Appendix B Additional Mathematical Preliminaries and Results

The following theorem is the celebrated center manifold theorem from geometry. We utilize it in showing avoidance of saddle point equilibria of the dynamics. [Center and Stable Manifolds (Shub, 1978, Theorem III.7), Smale (1967)] Let be a fixed point for the local diffeomorphism where is an open neighborhood of in and . Let be the invariant splitting of

into generalized eigenspaces of

corresponding to eigenvalues of absolute value less than one, equal to one, and greater than one. To the invariant subspace there is an associated local –invariant embedded disc called the local stable center manifold of dimension and ball around such that , and if for all , then .

## References

• Abraham et al. (1988) R. Abraham, J. E. Marsden, and T. Ratiu.

Manifolds, Tensor Analysis, and Applications

.
Springer, 1988.
• Bansal et al. (2018) T. Bansal, J. Pachocki, S. Sidor, I. Sutskever, and I. Mordatch. Emergent complexity via multi-agent competition. In International Conference on Learning Representations, 2018.
• Basar and Olsder (1998) T. Basar and G. Olsder.

Dynamic Noncooperative Game Theory

.
Society for Industrial and Applied Mathematics, 2 edition, 1998.
• Dean et al. (2017) S. Dean, H. Mania, N. Matni, B. Recht, and S. Tu. On the sample complexity of the linear quadratic regulator. ArXiv e-prints, 2017.
• Fazel et al. (2018) M. Fazel, R. Ge, S. M. Kakade, and M. Mesbahi. Global convergence of policy gradient methods for the linear quadratic regulator. In

International Conference on Machine Learning

, 2018.
• Hommes and Ochea (2012) C. H. Hommes and M. I. Ochea.

Multiple equilibria and limit cycles in evolutionary games with logit dynamics.

Games and Economic Behavior, 74, 2012.
• Jaderberg et al. (2019) M. Jaderberg, W. Czarnecki, I. Dunning, L. Marris, G. Lever, A. Garcia CastaÃ±eda, C. Beattie, N. C. Rabinowitz, A. Morcos, A. Ruderman, N. Sonnerat, T. Green, L. Deason, J. Z. Leibo, D. Silver, D. Hassabis, K. Kavukcuoglu, and T. Graepel. Human-level performance in 3D multiplayer games with population-based reinforcement learning. Science, 364, 2019.
• Kalman (1960) R. E. Kalman. Contributions to the theory of optimal control. Boletin de la Sociedad Matematica Mexicana, 5, 1960.
• Lanctot et al. (2017) M. Lanctot, V. Zambaldi, A. Gruslys, A. Lazaridou, K. Tuyls, J. Perolat, D. Silver, and T. Graepel. A unified game-theoretic approach to multiagent reinforcement learning. In Advances in Neural Information Processing Systems 30. 2017.
• Li and Gajic (1995) T. Li and Z. Gajic. Lyapunov iterations for solving coupled algebraic Riccati equations of Nash differential games and algebraic Riccati equations of zero-sum games. In New Trends in Dynamic Games and Applications, 1995.
• Lowe et al. (2017) R. Lowe, Y. Wu, A. Tamar, J. Harb, P. Abbeel, and I. Mordatch. Multi-agent actor-critic for mixed cooperative-competitive environments. In Advances in Neural Information Processing Systems 30. 2017.
• Malik et al. (2019) D. Malik, A. Pananjady, K. Bhatia, K. Khamaru, P. Bartlett, and M. Wainwright. Derivative-free methods for policy optimization: Guarantees for linear quadratic systems. In Proceedings of Machine Learning Research, 2019.
• Mazumdar et al. (2018) E. Mazumdar, L. J. Ratliff, and S Sastry. On the convergence of gradient-based learning in continuous games. ArXiv e-prints, 2018.
• Mertikopoulos et al. (2018) P. Mertikopoulos, C. H. Papadimitriou, and G. Piliouras. Cycles in adversarial regularized learning. In Proceedings of the 29th Annual ACM-SIAM Symposium on Discrete Algorithms, 2018.
• Shub (1978) M. Shub. Global Stability of Dynamical Systems. Springer-Verlag, 1978.
• Smale (1967) S. Smale. Differentiable dynamical systems. Bull. Amer. Math. Soc., 73, 1967.
• Srinivasan et al. (2018) S. Srinivasan, M. Lanctot, V. Zambaldi, J. Perolat, K. Tuyls, R. Munos, and M. Bowling. Actor-critic policy optimization in partially observable multiagent environments. In Advances in Neural Information Processing Systems 31. 2018.
• Sutton and Barto (2017) R. S. Sutton and A. G. Barto. Reinforcement Learning: An Introduction. MIT press, 2017.
• Zhang et al. (2019) K. Zhang, Z. Yang, and T. Basar. Policy optimization provably converges to Nash equilibria in zero-sum linear quadratic games, 2019.