# Stable Opponent Shaping in Differentiable Games

A growing number of learning methods are actually games which optimise multiple, interdependent objectives in parallel -- from GANs and intrinsic curiosity to multi-agent RL. Opponent shaping is a powerful approach to improve learning dynamics in such games, accounting for the fact that the 'environment' includes agents adapting to one another's updates. Learning with Opponent-Learning Awareness (LOLA) is a recent algorithm which exploits this dynamic response and encourages cooperation in settings like the Iterated Prisoner's Dilemma. Although experimentally successful, we show that LOLA can exhibit 'arrogant' behaviour directly at odds with convergence. In fact, remarkably few algorithms have theoretical guarantees applying across all differentiable games. In this paper we present Stable Opponent Shaping (SOS), a new method that interpolates between LOLA and a stable variant named LookAhead. We prove that LookAhead locally converges and avoids strict saddles in all differentiable games, the strongest results in the field so far. SOS inherits these desirable guarantees, while also shaping the learning of opponents and consistently either matching or outperforming LOLA experimentally.

## Authors

• 5 publications
• 18 publications
• 32 publications
• 42 publications
• 63 publications
• ### Differentiable Game Mechanics

Deep learning is built on the foundational guarantee that gradient desce...
05/13/2019 ∙ by Alistair Letcher, et al. ∙ 0

• ### Agent Environment Cycle Games

Partially Observable Stochastic Games (POSGs), are the most general mode...
09/28/2020 ∙ by Justin K. Terry, et al. ∙ 1

• ### Newton-based Policy Optimization for Games

Many learning problems involve multiple agents optimizing different inte...
07/15/2020 ∙ by Giorgia Ramponi, et al. ∙ 0

• ### Convergence of Multi-Agent Learning with a Finite Step Size in General-Sum Games

Learning in a multi-agent system is challenging because agents are simul...
03/07/2019 ∙ by Xinliang Song, et al. ∙ 0

• ### The Mechanics of n-Player Differentiable Games

The cornerstone underpinning deep learning is the guarantee that gradien...
02/15/2018 ∙ by David Balduzzi, et al. ∙ 0

• ### Balancing Two-Player Stochastic Games with Soft Q-Learning

Within the context of video games the notion of perfectly rational agent...
02/09/2018 ∙ by Jordi Grau-Moya, et al. ∙ 0

• ### On the Impossibility of Global Convergence in Multi-Loss Optimization

Under mild regularity conditions, gradient-based methods converge global...
05/26/2020 ∙ by Alistair Letcher, et al. ∙ 0

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

#### Problem Setting.

While machine learning has traditionally focused on optimising single objectives, generative adversarial nets (GANs)

(Goodfellow et al., 2014) have showcased the potential of architectures dealing with multiple interacting goals. They have since then proliferated substantially, including intrinsic curiosity (Pathak et al., 2017), imaginative agents (Racanière et al., 2017), synthetic gradients (Jaderberg et al., 2017)

, hierarchical reinforcement learning (RL)

(Wayne & Abbott, 2014; Vezhnevets et al., 2017) and multi-agent RL in general (Busoniu et al., 2008).

These can effectively be viewed as differentiable games played by cooperating and competing agents – which may simply be different internal components of a single system, like the generator and discriminator in GANs. The difficulty is that each loss depends on all parameters, including those of other agents. While gradient descent on single functions has been widely successful, converging to local minima under rather mild conditions (Lee et al., 2017), its simultaneous generalisation can fail even in simple two-player, two-parameter zero-sum games. No algorithm has yet been shown to converge, even locally, in all differentiable games.

#### Related Work.

Convergence has widely been studied in convex -player games, see especially Rosen (1965); Facchinei & Kanzow (2007). However, the recent success of non-convex games exemplified by GANs calls for a better understanding of this general class where comparatively little is known. Mertikopoulos & Zhou (2018) recently prove local convergence of no-regreat learning to variationally stable equilibria, though under a number of regularity assumptions.

Conversely, a number of algorithms have been successful in the non-convex setting for restricted classes of games. These include policy prediction in two-player two-action bimatrix games (Zhang & Lesser, 2010); WoLF in two-player two-action games (Bowling & Veloso, 2001); AWESOME in repeated games (Conitzer & Sandholm, 2007); Optimistic Mirror Descent in two-player bilinear zero-sum games (Daskalakis et al., 2018) and Consensus Optimisation (CO) in two-player zero-sum games (Mescheder et al., 2017). An important body of work including Heusel et al. (2017); Nagarajan & Kolter (2017) has also appeared for the specific case of GANs.

Working towards bridging this gap, some of the authors recently proposed Symplectic Gradient Adjustment (SGA), see Balduzzi et al. (2018). This algorithm is provably ‘attracted’ to stable fixed points while ‘repelled’ from unstable ones in all differentiable games (-player, non-convex). Nonetheless, these results are weaker than strict convergence guarantees. Moreover, SGA agents may act against their own self-interest by prioritising stability over individual loss. SGA was also discovered independently by Gemp & Mahadevan (2018), drawing on variational inequalities.

In a different direction, Learning with Opponent-Learning Awareness (LOLA) (Foerster et al., 2018) modifies the learning objective by predicting and differentiating through opponent learning steps. This is intuitively appealing and experimentally successful, encouraging cooperation in settings like the Iterated Prisoner’s Dilemma (IPD) where more stable algorithms like SGA defect. However, LOLA has no guarantees of converging or even preserving fixed points of the game.

#### Contribution.

We begin by constructing the first explicit tandem game where LOLA agents adopt ‘arrogant’ behaviour and converge to non-fixed points. We pinpoint the cause of failure and show that a natural variant named LookAhead (LA), discovered before LOLA by Zhang & Lesser (2010), successfully preserves fixed points. We then prove that LookAhead locally converges and avoids strict saddles in all differentiable games, filling a theoretical gap in multi-agent learning. This is enabled through a unified approach based on fixed-point iterations and dynamical systems. These techniques apply equally well to algorithms like CO and SGA, though this is not our present focus.

While LookAhead is theoretically robust, the shaping component endowing LOLA with a capacity to exploit opponent dynamics is lost. We solve this dilemma with an algorithm named Stable Opponent Shaping (SOS), trading between stability and exploitation by interpolating between LookAhead and LOLA. Using an intuitive and theoretically grounded criterion for this interpolation parameter, SOS inherits both strong convergence guarantees from LA and opponent shaping from LOLA.

On the experimental side, we show that SOS plays tit-for-tat in the IPD on par with LOLA, while all other methods mostly defect. We display the practical consequences of our theoretical guarantees in the tandem game, where SOS always outperforms LOLA. Finally we implement a more involved GAN setup, testing for mode collapse and mode hopping when learning Gaussian mixture distributions. SOS successfully spreads mass across all Gaussians, at least matching dedicated algorithms like CO, while LA is significantly slower and simultaneous gradient descent fails entirely.

## 2 Background

### 2.1 Differentiable games

We frame the problem of multi-agent learning as a game. Adapted from Balduzzi et al. (2018)

, the following definition insists only on differentiability for gradient-based methods to apply. This concept is strictly more general than stochastic games, whose parameters are usually restricted to action-state transition probabilities or functional approximations thereof.

###### Definition 1.

A differentiable game is a set of players with parameters and twice continuously differentiable losses , where for each and .

Crucially, note that each loss is a function of all parameters. From the viewpoint of player , parameters can be written as where contains all other players’ parameters. We do not make the common assumption that each is convex as a function of alone, for any fixed opponent parameters , nor do we restrict

to the probability simplex – though this restriction can be recovered via projection or sigmoid functions

. If

, the ‘game’ is simply to minimise a given loss function. In this case one can reach

local minima

by (possibly stochastic) gradient descent (GD). For arbitrary

, the standard solution concept is that of Nash equilibria.

###### Definition 2.

A point is a (local) Nash equilibrium if for each , there are neighbourhoods of such that for all . In other words, each player’s strategy is a local best response to current opponent strategies.

We write and for any . Define the simultaneous gradient of the game as the concatenation of each player’s gradient,

 ξ=(∇1L1,…,∇nLn)\raisebox0.0pt$⊺$∈Rd.

The th component of is the direction of greatest increase in with respect to . If each agent minimises their loss independently from others, they perform GD on their component with learning rate . Hence, the parameter update for all agents is given by , where and is element-wise multiplication. This is also called naive learning (NL), reducing to if agents have the same learning rate. This is assumed for notational simplicity, though irrelevant to our results. The following example shows that NL can fail to converge.

###### Example 1.

Consider , where players control the and parameters respectively. The origin is a (global and unique) Nash equilibrium. The simultaneous gradient is and cycles around the origin. Explicitly, a gradient step from yields

 (x,y)←(x,y)−α(y,−x)=(x−αy,y+αx)

which has distance from the origin for any and . It follows that agents diverge away from the origin for any . The cause of failure is that is not the gradient of a single function, implying that each agent’s loss is inherently dependent on others. This results in a contradiction between the non-stationarity of each agent, and the optimisation of each loss independently from others. Failure of convergence in this simple two-player zero-sum game shows that gradient descent does not generalise well to differentiable games. We consider an alternative solution concept to Nash equilibria before introducing LOLA.

### 2.2 Stable fixed points

Consider the game given by where players control the and parameters respectively. The optimal solution is , since then . However the origin is a global Nash equilibrium, while also a saddle point of . It is highly undesirable to converge to the origin in this game, since infinitely better losses can be reached in the anti-diagonal direction. In this light, Nash equilibria cannot be the right solution concept to aim for in multi-agent learning. To define stable fixed points, first introduce the ‘Hessian’ of the game as the block matrix

 H=∇ξ=⎛⎜ ⎜⎝∇11L1⋯∇1nL1⋮⋱⋮∇n1Ln⋯∇nnLn⎞⎟ ⎟⎠∈Rd×d.

This can equivalently be viewed as the Jacobian of the vector field

. Importantly, note that is not symmetric in general unless , in which case we recover the usual Hessian .

###### Definition 3.

A point is a fixed point if . It is stable if , unstable if and a strict saddle if

has an eigenvalue with negative real part.

The name ‘fixed point’ is coherent with GD, since implies a fixed update . Though Nash equilibria were shown to be inadequate above, it is not obvious that stable fixed points (SFPs) are a better solution concept. In Appendix A we provide intuition for why SFPs are both closer to local minima in the context of multi-loss optimisation, and more tractable for convergence proofs. Moreover, this definition is an improved variant on that in Balduzzi et al. (2018), assuming positive semi-definiteness only at instead of holding in a neighbourhood. This makes the class of SFPs as large as possible, while sufficient for all our theoretical results.

Assuming invertibility of at SFPs is crucial to all convergence results in this paper. The same assumption is present in related work including Mescheder et al. (2017), and cannot be avoided. Even for single losses, a fixed point with singular Hessian can be a local minimum, maximum, or saddle point. Invertibility is thus necessary to ensure that SFPs really are ‘local minima’. This is omitted from now on. Finally note that unstable fixed points are a subset of strict saddles, making creftype 6 both stronger and more general than results for SGA by Balduzzi et al. (2018).

### 2.3 Learning with opponent-learning awareness (LOLA)

Accounting for nonstationarity, Learning with Opponent-Learning Awareness (LOLA) modifies the learning objective by predicting and differentiating through opponent learning steps (Foerster et al., 2018). For simplicity, if then agent 1 optimises with respect to , where is the predicted learning step for agent 2. Foerster et al. (2018) assume that opponents are naive learners, namely . After first-order Taylor expansion, the loss is approximately given by . By minimising this quantity, agent 1 learns parameters that align the opponent learning step with the direction of greatest decrease in , exploiting opponent dynamics to further reduce one’s losses. Differentiating with respect to , the adjustment is

 ∇1L1+(∇21L1)\raisebox0.0pt$⊺$Δθ2+(∇1Δθ2)\raisebox0.0pt$⊺$∇2L1.

By explicitly differentiating through in the rightmost term, LOLA agents actively shape opponent learning. This has proven effective in reaching cooperative equilibria in multi-agent learning, finding success in a number of games including tit-for-tat in the IPD. The middle term above was originally dropped by the authors because “LOLA focuses on this shaping of the learning direction of the opponent”. We choose not to eliminate this term, as also inherent in LOLA-DiCE (Foerster et al., 2018). Preserving both terms will in fact be key to developing stable opponent shaping.

First we formulate -player LOLA in vectorial form. Let and be the matrices of diagonal and anti-diagonal blocks of , so that . Also define and the operator constructing a vector from the block matrix diagonal, namely .

###### Proposition 1 (Appendix B).

 \textsclola=(I−αHo)ξ−α\raisebox0.8pt$χ$.

While experimentally successful, LOLA fails to preserve fixed points of the game since

 (I−αHo)ξ(¯θ)−α\raisebox0.8pt$χ$(¯θ)=−α\raisebox0.8pt$χ$(¯θ)≠0

in general. Even if is a Nash equilibrium, the update can push them away despite parameters being optimal. This may worsen the losses for all agents, as in the game below.

###### Example 2 (Tandem).

Imagine a tandem controlled by agents facing opposite directions, who feed and force into their pedals respectively. Negative numbers correspond to pedalling backwards.

Moving coherently requires , embodied by a quadratic loss . However it is easier for agents to pedal forwards, translated by linear losses and . The game is thus given by and . These sub-goals are incompatible, so agents cannot simply accelerate forwards. The SFPs are given by . Computing , none of these are preserved by LOLA. Instead, we show in Appendix C that LOLA can only converge to sub-optimal scenarios with worse losses for both agents, for any .

Intuitively, the root of failure is that LOLA agents try to shape opponent learning and enforce compliance by accelerating forwards, assuming a dynamic response from their opponent. The other agent does the same, so they become ‘arrogant’ and suffer by pushing strongly in opposite directions.

## 3 Method

The shaping term prevents LOLA from preserving fixed points. Consider removing this component entirely, giving . This variant preserves fixed points, but what does it mean from the perspective of each agent? Note that LOLA optimises with respect to , while is a function of . In other words, we assume that our opponent’s learning step depends on our current optimisation with respect to . This is inaccurate, since opponents cannot see our updated parameters until the next step. Instead, assume we optimise where are the current parameters. After Taylor expansion, the gradient with respect to is given by

 ∇1L1+(∇21L1)\raisebox0.0pt$⊺$Δθ2

since does not depend on . In vectorial form, we recover the variant since the shaping term corresponds precisely to differentiating through . We name this LookAhead, which was discovered before LOLA by Zhang & Lesser (2010) though not explicitly named. Using the stop-gradient operator 111

This operator is implemented in TensorFlow as

and in PyTorch as

detach., this can be reformulated as optimising where prevents gradient flowing from upon differentiation.

The main result of Zhang & Lesser (2010) is that LookAhead converges to Nash equilibria in the small class of two-player, two-action bimatrix games. We will prove local convergence to SFP and non-convergence to strict saddles in all differentiable games. On the other hand, by discarding the problematic shaping term, we also eliminated LOLA’s capacity to exploit opponent dynamics and encourage cooperation. This will be witnessed in the IPD, where LookAhead agents mostly defect.

### 3.2 Stable opponent shaping (SOS)

We propose Stable Opponent Shaping (SOS), an algorithm preserving both advantages at once. Define the partial stop-gradient operator , where is the identity and stands for partial. A -LOLA agent optimises the modified objective

 L1(θ1,θ2+⊥1−pΔθ2,…,θn+⊥1−pΔθn),

collapsing to LookAhead at and LOLA at . The resulting gradient is given by

 ξp\coloneqqp\textsc−lola=(I−αHo)ξ−pα\raisebox0.0pt$χ$

with . We obtain an algorithm trading between shaping and stability as a function of . Note however that preservation of fixed points only holds if is infinitesimal, in which case -LOLA is almost identical to LookAhead – losing the very purpose of interpolation. Instead we propose a two-part criterion for at each learning step, through which all guarantees descend.

First choose such that points in the same direction as LookAhead. This will not be enough to prove convergence itself, but prevents arrogant behaviour by ensuring convergence only to fixed points. Formally, the first criterion is given by . If then automatically, so we choose for maximal shaping. Otherwise choose

 p=min{1,−a∥ξ0∥2⟨−α\raisebox0.8pt$χ$,ξ0⟩}

with any hyperparameter

. This guarantees a positive inner product

 ⟨ξp,ξ0⟩=p⟨−α\raisebox0.8pt$χ$,ξ0⟩+∥ξ0∥2≥−a∥ξ0∥2+∥ξ0∥2=∥ξ0∥2(1−a)>0.

We complement this with a second criterion ensuring local convergence. The idea is to scale by a function of if is small enough, which certainly holds in neighbourhoods of fixed points. Let be a hyperparameter and take if , otherwise . Choosing and according to these criteria, the two-part criterion is . SOS is obtained by combining -LOLA with this criterion, as summarised in Algorithm 1. Crucially, all theoretical results in the next section are independent from the choice of hyperparameters and .

## 4 Theoretical Results

Our central theoretical contribution is that LookAhead and SOS converge locally to SFP and avoid strict saddles in all differentiable games. Since the learning gradients involve second-order Hessian terms, our results assume thrice continuously differentiable losses (omitted hereafter). Losses which are but not are very degenerate, so this is a mild assumption. Statements made about SOS crucially hold for any hyperparameters . See Appendices E, LABEL: and D for detailed proofs.

### 4.1 Local convergence to stable fixed points

Convergence is proved using Ostrowski’s Theorem. This reduces convergence of a gradient adjustment to positive stability (eigenvalues with positive real part) of at stable fixed points.

###### Theorem 2.

Let be invertible with symmetric diagonal blocks. Then there exists such that is positive stable for all .

This type of result would usually be proved either by analytical means showing positive definiteness and hence positive stability, or direct eigenvalue analysis. We show in Appendix D that is not necessarily positive definite, while there is no necessary relationship between eigenpairs of and . This makes our theorem all the more interesting and non-trivial. We use a similarity transformation trick to circumvent the dual obstacle, allowing for analysis of positive definiteness with respect to a new inner product. We obtain positive stability by invariance under change of basis.

###### Corollary 3.

LookAhead converges locally to stable fixed points for sufficiently small.

Using the second criterion for , we prove local convergence of SOS in all differentiable games despite the presence of a shaping term (unlike LOLA).

###### Theorem 4.

SOS converges locally to stable fixed points for sufficiently small.

Using the first criterion for , we prove that SOS only converges to fixed points (unlike LOLA).

###### Proposition 5.

If SOS converges to and is small then is a fixed point of the game.

Now assume that is initialised randomly (or with arbitrarily small noise), as is standard in ML. Let be the SOS iteration. Using both the second criterion and the Stable Manifold Theorem from dynamical systems, we can prove that every strict saddle has a neighbourhood such that has measure zero for sufficiently small. Since is initialised randomly, we obtain the following result.

###### Theorem 6.

SOS locally avoids strict saddles almost surely, for sufficiently small.

This also holds for LookAhead, and could be strenghtened to global initialisations provided a strong boundedness assumption on . This is trickier for SOS since is not globally continuous. Altogether, our results for LookAhead and the correct criterion for -LOLA lead to some of the strongest theoretical guarantees in multi-agent learning. Furthermore, SOS retains all of LOLA’s opponent shaping capacity while LookAhead does not, as shown experimentally in the next section.

## 5 Experiments and Discussion

We evaluate the performance of SOS in three differentiable games. We first showcase opponent shaping and superiority over LA/CO/SGA/NL in the Iterated Prisoner’s Dilemma (IPD). This leaves SOS and LOLA, which have differed only in theory up to now. We bridge this gap by showing that SOS always outperforms LOLA in the tandem game, avoiding arrogant behaviour by decaying while LOLA overshoots. Finally we test SOS on a more involved GAN learning task, with results similar to dedicated methods like Consensus Optimisation.

### 5.1 Experimental setup

#### Ipd:

This game is an infinite sequence of the well-known Prisoner’s Dilemma, where the payoff is discounted by a factor at each iteration. Agents are endowed with a memory of actions at the previous state. Hence there are parameters for each agent : the probability of cooperating at start state or state for . One Nash equilibrium is to always defect (DD), with a normalised loss of . A better equilibrium with loss is named tit-for-tat (TFT), where each player begins by cooperating and then mimicks the opponent’s previous action.

We run 300 training episodes for SOS, LA, CO, SGA and NL. The parameters are initialised following a normal distribution around

probability of cooperation, with unit variance. We fix

and , following Foerster et al. (2018). We choose and for SOS. The first is a robust and arbitrary middle ground, while the latter is intentionally small to avoid poor SFP.

#### Tandem:

Though local convergence is guaranteed for SOS, it is possible that SOS diverges from poor initialisations. This turns out to be impossible in the tandem game since the Hessian is globally positive semi-definite. We show this explicitly by running 300 training episodes for SOS and LOLA. Parameters are initialised following a normal distribution around the origin. We found performance to be robust to hyperparameters . Here we fix and .

#### Gaussian mixtures:

We reproduce a setup from Balduzzi et al. (2018). The game is to learn a Gaussian mixture distribution using GANs. Data is sampled from a highly multimodal distribution designed to probe the tendency to collapse onto a subset of modes during training – see ground truth in Appendix F

. The generator and discriminator networks each have 6 ReLU layers of 384 neurons, with 2 and 1 output neurons respectively. Learning rates are chosen by grid search at iteration 8k, with

and for SOS, following the same reasoning as the IPD.

### 5.2 Results and discussion

#### Ipd:

Results are given in Figure 2. Parameters in part (A) are the end-run probabilities of cooperating for each memory state, encoded in different colours. Only 50 runs are shown for visibility. Losses at each step are displayed in part (B), averaged across 300 episodes with shaded deviations.

SOS and LOLA mostly succeed in playing tit-for-tat, displayed by the accumulation of points in the correct corners of (A) plots. For instance, CC and CD points are mostly in the top right and left corners so agent 2 responds to cooperation with cooperation. Agents also cooperate at the start state, represented by points all hidden in the top right corner. Tit-for-tat strategy is further indicated by the losses close to in part (B). On the other hand, most points for LA/CO/SGA/NL are accumulated at the bottom left, so agents mostly defect. This results in poor losses, demonstrating the limited effectiveness of recent proposals like SGA and CO. Finally note that trained parameters and losses for SOS are almost identical to those for LOLA, displaying equal capacity in opponent shaping while also inheriting convergence guarantees and outperforming LOLA in the next experiment.

#### Tandem:

Results are given in Figure 3. SOS always succeeds in decreasing to reach the correct equilibria, with losses averaging at . LOLA fails to preserve fixed points, overshooting with losses averaging at . The criterion for SOS is shown in action in part (B), decaying to avoid overshooting. This illustrates that purely theoretical guarantees descend into practical outperformance. Note that SOS even gets away from the LOLA fixed points if initialised there (not shown), converging to improved losses using the alignment criterion with LookAhead.

#### Gaussian mixtures:

The generator distribution and KL divergence are given at 2k, 4k, 6k, 8k iterations for NL, CO and SOS in Figure 4. Results for SGA, LOLA and LA are in Appendix F. SOS achieves convincing results by spreading mass across all Gaussians, as do CO/SGA/LOLA. LookAhead is significantly slower, while NL fails through mode collapse and hopping. Only visual inspection was used for comparison by Balduzzi et al. (2018), while KL divergence gives stronger numerical evidence here. SOS and CO are slightly superior to others with reference to this metric. However CO is aimed specifically toward two-player zero-sum GAN optimisation, while SOS is widely applicable with strong theoretical guarantees in all differentiable games.

## 6 Conclusion

Theoretical results in machine learning have significantly helped understand the causes of success and failure in applications, from optimisation to architecture. While gradient descent on single losses has been studied extensively, algorithms dealing with interacting goals are proliferating, with little grasp of the underlying dynamics. The analysis behind CO and SGA has been helpful in this respect, though lacking either in generality or convergence guarantees. The first contribution of this paper is to provide a unified framework and fill this theoretical gap with robust convergence results for LookAhead in all differentiable games. Capturing stable fixed points as the correct solution concept was essential for these techniques to apply.

Furthermore, we showed that opponent shaping is both a powerful approach leading to experimental success and cooperative behaviour – while at the same time preventing LOLA from preserving fixed points in general. This conundrum is solved through a robust interpolation between LookAhead and LOLA, giving birth to SOS through a robust criterion. This was partially enabled by choosing to preserve the ‘middle’ term in LOLA, and using it to inherit stability from LookAhead. This results in convergence guarantees stronger than all previous algorithms, but also in practical superiority over LOLA in the tandem game. Moreover, SOS fully preserves opponent shaping and outperforms SGA, CO, LA and NL in the IPD by encouraging tit-for-tat policy instead of defecting. Finally, SOS convincingly learns Gaussian mixtures on par with the dedicated CO algorithm.

## 7 Acknowledgements

This project has received funding from the European Research Council under the European Union’s Horizon 2020 research and innovation programme (grant agreement number 637713). It was also supported by the Oxford-Google DeepMind Graduate Scholarship.

## References

• Balduzzi et al. (2018) D. Balduzzi, S. Racaniere, J. Martens, J. Foerster, K. Tuyls, and T. Graepel. The Mechanics of n-Player Differentiable Games. ICML, 2018.
• Bowling & Veloso (2001) M. Bowling and M. Veloso. Rational and convergent learning in stochastic games. In

Proceedings of the 17th International Joint Conference on Artificial Intelligence - Volume 2

, pp. 1021–1026. Morgan Kaufmann Publishers Inc., 2001.
• Busoniu et al. (2008) L. Busoniu, R. Babuska, and B. De Schutter. A comprehensive survey of multiagent reinforcement learning. IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews), 38(2):156–172, March 2008.
• Conitzer & Sandholm (2007) V. Conitzer and T. Sandholm. AWESOME: A General Multiagent Learning Algorithm that Converges in Self-Play and Learns a Best Response Against Stationary Opponents. Machine Learning, 67(1):23–43, May 2007.
• Daskalakis et al. (2018) C. Daskalakis, A. Ilyas, V. Syrgkanis, and H. Zeng. Training GANs with Optimism. ICLR, 2018.
• Facchinei & Kanzow (2007) Francisco Facchinei and Christian Kanzow. Generalized Nash equilibrium problems. 4OR, 5(3), Sep 2007.
• Foerster et al. (2018) J. N. Foerster, R. Y. Chen, M. Al-Shedivat, S. Whiteson, P. Abbeel, and I. Mordatch. Learning with Opponent-Learning Awareness. AAMAS, 2018.
• Foerster et al. (2018) J. N. Foerster, G. Farquhar, M. Al-Shedivat, T. Rocktäschel, E. P. Xing, and S. Whiteson. DiCE: The Infinitely Differentiable Monte-Carlo Estimator. ICML, 2018.
• Gemp & Mahadevan (2018) I. Gemp and S. Mahadevan. Global Convergence to the Equilibrium of GANs using Variational Inequalities. ArXiv e-prints, 2018.
• Goodfellow et al. (2014) I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio. Generative Adversarial Networks. NIPS, 2014.
• Heusel et al. (2017) M. Heusel, H. Ramsauer, T. Unterthiner, B. Nessler, and S. Hochreiter. GANs Trained by a Two Time-Scale Update Rule Converge to a Local Nash Equilibrium. NIPS, 2017.
• Jaderberg et al. (2017) M. Jaderberg, W. M. Czarnecki, S. Osindero, O. Vinyals, A. Graves, D. Silver, and K. Kavukcuoglu. Decoupled Neural Interfaces using Synthetic Gradients. ICML, 2017.
• Lee et al. (2016) J. D. Lee, M. Simchowitz, M. I. Jordan, and B. Recht. Gradient Descent Only Converges to Minimizers. In 29th Annual Conference on Learning Theory, volume 49 of Proceedings of Machine Learning Research, pp. 1246–1257, 2016.
• Lee et al. (2017) J. D. Lee, I. Panageas, G. Piliouras, M. Simchowitz, M. I. Jordan, and B. Recht. First-order Methods Almost Always Avoid Saddle Points. ArXiv e-prints, 2017.
• Mertikopoulos & Zhou (2018) Panayotis Mertikopoulos and Zhengyuan Zhou. Learning in games with continuous action sets and unknown payoff functions. Mathematical Programming, Mar 2018.
• Mescheder et al. (2017) L. Mescheder, S. Nowozin, and A. Geiger. The Numerics of GANs. NIPS, 2017.
• Nagarajan & Kolter (2017) V. Nagarajan and J. Kolter. Gradient descent GAN optimization is locally stable. NIPS, 2017.
• Ortega & Rheinboldt (2000) J. Ortega and W. Rheinboldt. Iterative Solution of Nonlinear Equations in Several Variables. Society for Industrial and Applied Mathematics, 2000.
• Panageas & Piliouras (2017) I. Panageas and G. Piliouras. Gradient Descent Only Converges to Minimizers: Non-Isolated Critical Points and Invariant Regions. In ITCS 2017, volume 67 of Leibniz International Proceedings in Informatics, pp. 2:1–2:12, 2017.
• Pathak et al. (2017) D. Pathak, P. Agrawal, A. A. Efros, and T. Darrell. Curiosity-driven Exploration by Self-supervised Prediction. ICML, 2017.
• Racanière et al. (2017) S. Racanière, T. Weber, D. P. Reichert, L. Buesing, A. Guez, D. Jimenez Rezende, A. Puigdomènech Badia, O. Vinyals, N. Heess, Y. Li, R. Pascanu, P. Battaglia, D. Hassabis, D. Silver, and D. Wierstra. Imagination-Augmented Agents for Deep Reinforcement Learning. NIPS, 2017.
• Rosen (1965) J.B. Rosen. Existence and Uniqueness of Equilibrium Points for Concave N-Person Games. Econometrica, 33, Jul 1965.
• Vezhnevets et al. (2017) A. S. Vezhnevets, S. Osindero, T. Schaul, N. Heess, M. Jaderberg, D. Silver, and K. Kavukcuoglu. FeUdal Networks for Hierarchical Reinforcement Learning. ICML, 2017.
• Wayne & Abbott (2014) G. Wayne and L. F. Abbott. Hierarchical control using networks trained with higher-level forward models. Neural Computation, 26(10):2163–2193, 2014.
• Zhang & Lesser (2010) C. Zhang and V. Lesser. Multi-Agent Learning with Policy Prediction. AAAI Conference on Artificial Intelligence, 2010.

## Appendix A Stable Fixed Points

In the main text we showed that Nash equilibria are inadequate in multi-agent learning, exemplified by the simple game given by , where the origin is a global Nash equilibrium but a saddle point of the losses. It is not however obvious that SFP are a better solution concept. We begin by pointing out that for single losses, invertibility and symmetry of the Hessian imply positive definiteness at SFP. These are exactly local minima of detected by the second partial derivative test, namely those points provably attainable by gradient descent.

To emphasise this, note that gradient descent does not converge locally to all local minima. This can be seen by considering the example and the local (global) minimum . There is no neighbourhood for which gradient descent converges to , since initialising at will always converge to for appropriate learning rates, with almost surely. This occurs precisely because the Hessian is singular at . Though a degenerate example, this suggests an important difference to make between the ideal solution concept (local minima) and that for which local convergence claims are possible to attain (local minima with invertible ).

Accordingly, the definition of SFP is the immediate generalisation of ‘fixed points with positive semi-definite Hessian’, or in other words, ‘second-order-tractable local minima’. It is important to impose only positive semi-definiteness to keep the class as large as possible, despite strict positive definiteness holding for single losses due to symmetry. Imposing strict positivity would for instance exclude the origin in the cyclic game , a point certainly worthy of convergence.

Note also that imposing a weaker condition than would be incorrect. Invertibility aside, local convergence of gradient descent on single functions cannot be guaranteed if , since such points are strict saddles. These are almost always avoided by gradient descent, as proven by Lee et al. (2016) and Panageas & Piliouras (2017). It is thus necessary to impose as a minimal requirement in optimisation methods attempting to generalise gradient descent.

###### Remark A.1.

A matrix is positive semi-definite iff the same holds for its symmetric part , so SFP could equivalently be defined as . This is the original formulation given by part of the authors (Balduzzi et al., 2018), who also imposed the extra requirement in a neighbourhood of . After discussion we decided to drop this assumption, pointing out that it is 1) more restrictive, 2) superficial to all theoretical results and 3) weakens the analogy with tractable local minima. The only thing gained by imposing semi-positivity in a neighbourhood is that SFP become a subset of Nash equilibria.

Regarding unstable fixed points and strict saddles, note that implies in a neighbourhood, hence being equivalent to the definition in Balduzzi et al. (2018). It follows also that unstable points are a subset of strict saddles: if then all eigenvalues are negative since any eigenpair satisfies

 0>Re(v\raisebox0.0pt$⊺$\raisebox0.0pt$⊺$Hv)=Re(λv\raisebox0.0pt$⊺$v)=Re(λ).

We introduced strict saddles in this paper as a generalisation of unstable FP, which are more difficult to handle but nonetheless tractable using dynamical systems. The name is chosen by analogy to the definition in Lee et al. (2016) for single losses.

## Appendix B Lola Vectorial Form

###### Proposition B.1.

 LOLA=(I−αHo)ξ−αdiag(H\raisebox0.0pt$⊺$o∇L).

in the usual assumption of equal learning rates.

###### Proof.

Recall the modified objective

 L1(θ1,θ2−α∇2L2,…,θn−α∇nLn)

for agent , and so on for each agent. First-order Taylor expansion yields

 L1−α∑j≠1(∇jL1)\raisebox0.0pt$⊺$∇jLj

and similarly for each agent. Differentiating with respect to , the adjustment for player is

 \textsclolai =∇i⎡⎣Li−α∑j≠i(∇jLi)\raisebox0.0pt$⊺$∇jLj⎤⎦ =∇iLi−α∑j≠i(∇jiLi)\raisebox0.0pt$⊺$∇jLj+(∇jiLj)\raisebox0.0pt$⊺$∇jLi =∇iLi−α∑j≠i∇ijLi∇jLj−α∑j≠i(∇jiLj)\raisebox0.0pt$⊺$∇jLi =ξi−α∑j(Ho)ijξj−α∑j(H\raisebox0.0pt$⊺$o)ij(∇L)ji =ξi−α(Hoξ)i−α(H\raisebox0.0pt$⊺$o∇L)ii =[ξ−αHoξ−αdiag(H\raisebox0.0pt$⊺$o∇L)]i

and thus

 \textsclola=(I−αHo)ξ−αdiag(H\raisebox0.0pt$⊺$\raisebox0.0pt$⊺$o∇L)

as required. ∎

## Appendix C Tandem Game

We provide a more detailed exposition of the tandem game in this section, including computation of fixed points for NL/LOLA and corresponding losses. Recall that the game is given by

 L1(x,y)=(x+y)2−2xandL2(x,y)=(x+y)2−2y.

Intuitively, agents wants to have since is the leading loss, but would also prefer to have positive and . These are incompatible, so the agents must not be ‘arrogant’ and instead make concessions. The fixed points are given by

 ξ=2(x+y−1)(11)=0,

namely any pair . The corresponding losses are , summing to for any . We have

 H=2(1111)⪰0

everywhere, so all fixed points are SFP. LOLA fails to preserve these, since

 \raisebox0.0pt$χ$=diag(H\raisebox0.0pt$⊺$o∇L)=4diag(0110)(x+y−1x+yx+yx+y−1)=4(x+y)(11)

which is non-zero for any SFP . Instead, LOLA can only converge to points such that

 \textsclola=ξ−αHoξ−α\raisebox0.8pt$χ$=0.

We solve this explicitly as follows:

 lola =2(x+y−1)(11)−4α(x+y−1)(0110)(11)−4α(x+y)(11) =2[(1−4α)(x+y)−(1−2α)](11).

The fixed points for LOLA are thus pairs such that

 x+y=1−2α1−4α,

noting that for all . This leads to worse losses

 L1=(1−2α1−4α)2−2x>1−2x=L1(x,1−x)

for agent 1 and similarly for agent 2. In particular, losses always sum to something greater than . This becomes negligible as the learning rate becomes smaller, but is always positive nonetheless Taking arbitrarily small is not a viable solution since convergence will in turn be arbitrarily slow. LOLA is thus not a strong algorithm candidate for all differentiable games.

## Appendix D Convergence Proofs

We use Ostrowski’s theorem as a unified framework for proving local convergence of gradient-based methods. This is a standard result on fixed-point iterations, adapted from (Ortega & Rheinboldt, 2000, 10.1.3). We also invoke and prove a topological result of our own, creftype D.9, at the end of this section. This is useful in deducing local convergence, though not central to intuition.

###### Theorem D.1 (Ostrowski).

Let be continuously differentiable on an open subset , and assume is a fixed point. If all eigenvalues of are strictly in the unit circle of , then there is an open neighbourhood of such that for all , the sequence converges to . Moreover, the rate of convergence is at least linear in .

###### Definition D.2.

A matrix is called positive stable if all its eigenvalues have positive real part.

Recall the simultaneous gradient and the Hessian defined for differentiable games. Let be any matrix with continuously differentiable entries.

###### Corollary D.3.

Assume is a fixed point of a differentiable game such that is positive stable. Then the iterative procedure

 F(x)=x−αXξ(x)

converges locally to for sufficiently small.

###### Proof.

By definition of fixed points, and so

 ∇[Xξ](¯x)=∇X(¯x)ξ(¯x)+X(¯x)∇ξ(¯x)=XH(¯x)

is positive stable by assumption, namely has eigenvalues with . It follows that

 ∇F(¯x)=I−α∇[Xξ](¯x)

has eigenvalues , which are in the unit circle for small . More precisely,

 |1−αak−iαbk|2<1 ⟺ 1−2αak+α2a2k+α2b2k<1 ⟺ 0<α<2aka2k+b2k

which is always possible for . Hence has eigenvalues in the unit circle for , and we are done by Ostrowski’s Theorem since is a fixed point of . ∎

We apply this corollary to LookAhead, which is given by

 F(θ)=θ−αXξ(θ)

where . It is thus sufficient to prove the following result.

###### Theorem D.4.

Let invertible with symmetric diagonal blocks. Then there exists such that is positive stable for all .

###### Remark D.5.

Note that may fail to be positive definite, though true in the case of matrices. This no longer holds in higher dimensions, exemplified by the Hessian

 H=⎛⎜ ⎜ ⎜⎝9−4−3−3−2121−3010−3121⎞⎟ ⎟ ⎟⎠.

By direct computation (symbolic in ), one can show that always has positive eigenvalues for small , whereas its symmetric part always has a negative eigenvalue with magnitude in the order of . This implies that and in turn is not positive definite. As such, an analytical proof of the theorem involving bounds on the corresponding bilinear form will fail.

This makes the result all the more interesting, but more involved. Central to the proof is a similarity transformation proving positive definiteness with respect to a different inner product, a novel technique we have not found in the multi-agent learning literature.

###### Proof.

We cannot study the eigenvalues of directly, since there is no necessary relationship between eigenpairs of and . In the aim of using analytical tools, the trick is to find a positive definite matrix which is similar to , thus sharing the same positive eigenvalues. First define

 G1=(I+αHd)H and G2=−αH2,

where is the sub-matrix of diagonal blocks,and rewrite

 G=(I−αHo)H=(I−α(H−Hd))H=(I+αHd)H−αH2=G1+G2.

Note that is block diagonal with symmetric blocks , so is symmetric and positive definite for all . In particular its principal square root

 M=(I+αHd)1/2

is unique and invertible. Now note that

 M−1G1M=M−1M2HM=M\raisebox0.0pt$⊺$HM,

which is positive semi-definite since

 u\raisebox0.0pt$⊺$M\raisebox0.0pt$⊺$HMu=(Mu)\raisebox0.0pt$⊺$H(Mu)≥0

for all non-zero . In particular provides a similarity transformation which eliminates from while simultaneously delivering positive semi-definiteness. We can now prove that

 M−1GM=M−1G1M+M−1G2M

is positive definite, establishing positive stability of by similarity. Let where is the vector space dimension, namely . Recall that the -sphere is the space of unit vectors in . Take any and consider the quantity

 u\raisebox0.0pt$⊺$M−1GMu.

First note that a Taylor expansion of in yields

 M=(I+αHd)1/2=I+O(α)

and

 M−1=(I+αHd)−1/2=I+O(α).

This implies in turn that

 u\raisebox0.0pt$⊺$M−1GMu=u\raisebox0.0pt$⊺$Gu+O(α).

There are two cases to distinguish. If then

 u\raisebox0.0pt$⊺$M−1GMu =u\raisebox0.0pt$⊺$Gu+O(α)=u\raisebox0.0pt$⊺$G1u+O(α)=u\raisebox0.0pt$⊺$Hu+O(α)>0

for sufficiently small. Otherwise, and consider decomposing into symmetric and antisymmetric parts and , so that . By antisymmetry of we have and hence . Now implies , so by Cholesky decomposition of there exists a matrix such that . In particular implies , and in turn . Since is invertible and , we have and so . It follows in particular that

 −αu\raisebox0.0pt$⊺$H2u=−αu\raisebox0.0pt$⊺$\raisebox0.0pt$⊺$(S\raisebox0.0pt$⊺$−A\raisebox0.0pt$⊺$)(S+A)u=αu\raisebox0.0pt$⊺$A\raisebox0.0pt$⊺$Au=α∥Au∥2>0.

Using positive semi-definiteness of ,

 u\raisebox0.0pt$⊺$M−1GMu =u\raisebox0.0pt$⊺$M−1G1Mu+u\raisebox0.0pt$⊺$M−1G2Mu ≥−αu\raisebox0.0pt$⊺$M−1H2Mu =−αu\raisebox0.0pt$⊺$H2u+O(α2) =α∥Au∥2+O(α2)>0

for small enough. We conclude that for any there is such that

 u\raisebox0.0pt$⊺$M−1GMu>0

for all , where is a function with compact. By creftype D.9, this can be extended uniformly with some such that

 u\raisebox0.0pt$⊺$M−1GMu>0

for all and . It follows that is positive definite for all and thus is positive stable for in the same range, by similarity. ∎

###### Corollary D.6.

LookAhead converges locally to stable fixed points for sufficiently small.

###### Proof.

For any SFP we have and invertible by definition, with diagonal blocks symmetric by twice continuous differentiability. We are done by the result above and creftype D.3. ∎

We now prove that local convergence results descend to SOS. The following lemma establishes the crucial claim that our criterion for is in neighbourhoods of fixed points. This is necessary to invoke analytical arguments including Ostrowski’s Theorem, and would be untrue globally.

###### Lemma D.7.

If is a fixed point and is sufficiently small then in a neighbourhood of .

###### Proof.

First note that , so there is a (bounded) neighbourhood of such that for all , for any choice of hyperparameter . In particular by definition of the second criterion. We want to show that near , or equivalently . Since in , it remains only to show that

 −a∥ξ0∥2⟨−α\raisebox0.8pt$χ$,ξ0⟩≥∥ξ(θ)∥2

in some neighbourhood of , for any choice of hyperparameter . Now by boundedness of and continuity of , there exists such that for all and bounded . It follows by Cauchy-Schwartz that

 −a∥ξ0∥2⟨−α\raisebox0.8pt$χ$,ξ