# Fast and Furious Learning in Zero-Sum Games: Vanishing Regret with Non-Vanishing Step Sizes

We show for the first time, to our knowledge, that it is possible to reconcile in online learning in zero-sum games two seemingly contradictory objectives: vanishing time-average regret and non-vanishing step sizes. This phenomenon, that we coin "fast and furious" learning in games, sets a new benchmark about what is possible both in max-min optimization as well as in multi-agent systems. Our analysis does not depend on introducing a carefully tailored dynamic. Instead we focus on the most well studied online dynamic, gradient descent. Similarly, we focus on the simplest textbook class of games, two-agent two-strategy zero-sum games, such as Matching Pennies. Even for this simplest of benchmarks the best known bound for total regret, prior to our work, was the trivial one of O(T), which is immediately applicable even to a non-learning agent. Based on a tight understanding of the geometry of the non-equilibrating trajectories in the dual space we prove a regret bound of Θ(√(T)) matching the well known optimal bound for adaptive step sizes in the online setting. This guarantee holds for all fixed step-sizes without having to know the time horizon in advance and adapt the fixed step-size accordingly. As a corollary, we establish that even with fixed learning rates the time-average of mixed strategies, utilities converge to their exact Nash equilibrium values.

There are no comments yet.

## Authors

• 6 publications
• 54 publications
• ### Vortices Instead of Equilibria in MinMax Optimization: Chaos and Butterfly Effects of Online Learning in Zero-Sum Games

We establish that algorithmic experiments in zero-sum games "fail misera...
05/21/2019 ∙ by Yun Kuen Cheung, et al. ∙ 0

• ### Learning Zero-sum Stochastic Games with Posterior Sampling

In this paper, we propose Posterior Sampling Reinforcement Learning for ...
09/08/2021 ∙ by Mehdi Jafarnia-Jahromi, et al. ∙ 0

• ### Optimistic and Adaptive Lagrangian Hedging

In online learning an algorithm plays against an environment with losses...
01/23/2021 ∙ by Ryan D'Orazio, et al. ∙ 0

• ### Finite-Time Last-Iterate Convergence for Multi-Agent Learning in Games

We consider multi-agent learning via online gradient descent (OGD) in a ...
02/23/2020 ∙ by Tianyi Lin, et al. ∙ 7

• ### Learning in time-varying games

In this paper, we examine the long-term behavior of regret-minimizing ag...
09/10/2018 ∙ by Benoit Duvocelle, et al. ∙ 0

• ### Finite Regret and Cycles with Fixed Step-Size via Alternating Gradient Descent-Ascent

Gradient descent is arguably one of the most popular online optimization...
07/09/2019 ∙ by James P. Bailey, et al. ∙ 0

• ### Multi-Agent Learning in Network Zero-Sum Games is a Hamiltonian System

Zero-sum games are natural, if informal, analogues of closed physical sy...
03/05/2019 ∙ by James P. Bailey, et al. ∙ 0

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

The performance of online learning algorithms such as online gradient descent in adversarial, adaptive settings is a classic staple of optimization and game theory, e.g,

[10, 17, 37]. Arguably, the most well known results in this space are the following:

• Sublinear regret of is achievable in adversarial settings but only after employing a carefully chosen sequence of shrinking step-sizes or if the time horizon is finite and known in advance and the fixed learning rate is selected accordingly.

• Sublinear regret algorithms “converge” to Nash equilibria in zero-sum games.

Despite the well established nature of these results recent work has revealed some surprising insights that come to challenge the traditional ways of thinking in this area. Specifically, in the case of zero-sum games what is referred to as “convergence” to equilibrium, is the fact that when both agent apply regret-minimizing algorithms, both the time-average of the mixed strategy profiles as well as the utilities of the agents converge approximately to their Nash equilibrium values, where the approximation error can become arbitrarily close to zero by choosing a sufficiently small step-size. Naturally, this statement does not imply that the day-to-day behavior converges to equilibria. In fact, the actual realized behavior is antithetical to convergence to equilibrium. [1] showed that Nash equilibria are repelling in zero-sum games for all follow-the-regularized-leader dynamics. As seen in Figure 1 the dynamics spiral outwards away from the equilibrium.

These novel insights about the geometry of learning dynamics in zero-sum games suggest a much richer and not well understood landscape of coupled strategic behaviors. They also raise the tantalizing possibility that we may be able to leverage this knowledge to prove tighter regret bounds in games. In fact, a series of recent papers has focused on beating the “black-box” regret bounds using a combination of tailored dynamics and adaptive step-sizes, e.g, [12, 30, 33, 16] but so far no new bounds have been proven for the classic setting of fixed learning rates. Interestingly, [16] explicitly examine the case of fixed learning rates to show that learning achieves sublinear “approximate regret” where the algorithm compares itself against times the performance of the best action with hindsight. In contrast, our aim is to show sublinear regret for fixed using the standard notion of regret.

Intuitively, non-equilibration and more generally this emergent behavioral complexity seem like harbingers of bad news in terms of system performance as well as of significant analytical obstacles. This pessimism seems especially justified given recent results about the behavior of online dynamics with fixed step-sizes in other small games (e.g. two-by-two coordination/congestion games), where their behavior can be shown to become provably chaotic ([25, 11]). Nevertheless, we show that we can leverage this geometric information to provide the first to our knowledge sublinear regret guarantees for online gradient descent with fixed step-size in games. Instability of Nash equilibria is not an obstacle, but in fact may be leveraged as a tool, for proving low regret.

Our results. We study the dynamics of gradient descent with fixed step size in two-strategy, two-player games. We leverage a deep understanding of the geometry of its orbits to prove the first sublinear regret bounds despite the constant learning rate. We show that the player strategies are repelled away from the Nash equilibrium. More specifically, regardless of the choice of the initial condition there are only a finite number of iterations where both players select mixed strategies (Theorem 1). We prove a worst-case regret bound of for arbitrarily learning without prior knowledge of (Theorem 3) matching the well known optimal bound for adaptive learning rates. An immediate corollary of our results is that time-average of the mixed strategy profiles as well as the utilities of the agents converge to their exact Nash equilibrium values (and not to approximations thereof) (Corollary 4). Finally, we present a matching lower bound of (Theorem 5) establishing that our regret analysis is tight.

To obtain the upper bound, we establish a tight understanding of the geometry of the trajectories in the dual space, i.e., the trajectories of the payoff vectors. We show there exists a linear transformation of the payoff vectors that rotate around the Nash equilibrium. Moreover, the distance between the Nash equilibrium and these transformed utility vectors increases by a constant in each rotation (Lemma

8). In addition, the time to complete a rotation is proportional to the distance between the Nash equilibrium and the transformed payoff vectors (Lemma 9). Together, these results imply a quadratic relationship between the number of iterations and the number of rotations completed establishing the regret bound. We establish the lower bound by exactly tracking the strategies and regret for a single game.

## 2 Preliminaries

A two-player game consists of two players where each player has strategies to select from. Player can either select a pure strategy or a mixed strategy . A strategy is fully mixed if .

The most commonly studied class of games is zero-sum games. In a zero-sum game, there is a payoff matrix where player 1 receives utility and player receives utility resulting in the following optimization problem:

 maxx1∈X1minx2∈X2x1⋅Ax2 (Two-Player Zero-Sum Game)

The solution to this saddle problem is the Nash equilibrium . If player 1 selects her Nash equilibria , then she guarantees her utility is independent of what strategy player selects. is referred to as the value of the game.

### 2.1 Online Learning in Continuous Time

In many applications of game theory, players know neither the payoff matrix nor the Nash equilibria. In such settings, players select their strategies adaptively. The most common way to do this in continuous time is by using a follow-the-regularized-leader (FTRL) algorithm. Given a strongly convex regularizer, a learning rate , and an initial payoff vector , players select their strategies at time according to

 y1(T) =y1(0)+∫T0Ax2(t)dt (Player 1 Payoff Vector) y2(T) =y2(0)−∫T0A⊺x1(t)dt (Player 2 Payoff Vector) xi(T) =argmaxxi≥0:∑j∈[ni]xij=1{yi(T)⋅xi−hi(xi)η} (Continuous FTRL)

In this paper, we are primarily interested in the regularizer resulting in the Gradient Descent algorithm:

 xi(t) =argmaxxi≥0:∑j∈[ni]xij=1{yi(t)⋅xi−||xi||222η} (Continuous Gradient Descent)

Continuous time FTRL learning in games has an interesting number of properties including time-average converge to the set of coarse correlated equilibria at a rate of in general games [21] and thus to Nash equilibria in zero-sum games. These systems can also exhibit interesting recurrent behavior e.g. periodicity [28, 23], Poincaré recurrence [21, 29, 27] and limit cycles [19]. These systems have formal connections to Hamiltonian dynamics (i.e. energy perserving systems) [2]. All of these types of recurrent behavior are special cases of chain recurrence [26, 24].

### 2.2 Online Learning in Discrete Time

In most settings, players update their strategies iteratively in discrete time steps. The most common class of online learning algorithms is again the family of follow-the-regularized-leader algorithms.

 yT1 =y01+T−1∑t=1Axt2 (Player 1 Payoff Vector) yT2 =y02−T−1∑t=1A⊺xt1 (Player 2 Payoff Vector) xti =argmaxxi≥0:∑j∈[ni]xij=1{yti⋅xi−hi(xi)η} (FTRL) xti =argmaxxi≥0:∑j∈[ni]xij=1{yti⋅xi−||xi||222η} (Gradient Descent)

where corresponds to the learning rate. In Lemma 6 of Appendix A, we show (FTRL) is the first order approximation of (Continuous FTRL).

These algorithms again have interesting properties in zero-sum games. The time-average strategy converges to a -approximate Nash equilibrium [10]. On the contrary, Bailey and Piliouras show that the day-to-day behavior diverges away from interior Nash equilibria [1]. For notational simplicity we do not introduce different learning rates but all of our proofs immediately carry over to this setting.

### 2.3 Regret in Online Learning

The most common way of analyzing an online learning algorithm is by examining its regret. The regret at time/iteration is the difference between the accumulated utility gained by the algorithm and the total utility of the best fixed action with hindsight. Formally for player 1,

 Regret1(T) =maxx1∈X1{∫T0x1⋅Ax2(t)dt}−∫T0x1(t)⋅Ax2(t)dt (1) Regret1(T) =maxx1∈X1{T∑t=0x1⋅Axt2}−T∑t=0xt1⋅Axt2 (2)

for continuous and discrete time respectively.

In the case of (Continuous FTRL) it is possible to show rather strong regret guarantees. Specifically, [21] establishes that even for non-zero-sum games. In contrast, (FTRL) only guarantees for a fixed learning rate. In this paper, we utilize the geometry of (Gradient Descent) to show in 2x2 zero-sum games ().

## 3 The Geometry of Gradient Descent

###### Theorem 1.

Let be a 2x2 game that has a unique fully mixed Nash equilibrium where strategies are updated according to (Gradient Descent). For any non-equilibrium initial strategies, there exists a such that is on the boundary for all .

Theorem 1 strengthens the result for (Gradient Descent) in 2x2 games from [1]. Specifically, [1] show that strategies come arbitrarily close to the boundary infinitely often when updated with any version of (FTRL). This is accomplished by closely studying the geometry of the player strategies. We strengthen this result for (Gradient Descent) in 2x2 games by focusing on the geometry of the payoff vectors. The proof of Theorem 1 relies on many of the tools developed in Section 4 for Theorem 3 and is deferred to Appendix F.
The first step to understanding the trajectories of the dynamics of (Gradient Descent), is characterizing the solution to (Gradient Descent). To streamline the discussion and presentation of results, we defer the proof of Lemma 2 to Appendix B.

###### Lemma 2.

The solution to (Gradient Descent) is given by

 xtij =⎧⎪⎨⎪⎩0for j∉Siη(ytij−∑k∈Siytik|Si|)+1|Si|for j∈Si. (3)

where is found using Algorithm 1.

### 3.1 Convex Conjugate of the Regularizer

Our analysis primarily takes place in the space of payoff vectors. The payoff vector is a formal dual of the strategy obtained via

 h∗(yti)=maxxi≥0:∑j∈[ni]xij=1{yti⋅xi−hi(xi)η} (4)

which is known as the convex conjugate or Fenchel Coupling of and is closely related to the Bregman Divergence. In [21] it is shown that the “energy” is conserved in (Continuous FTRL). By Lemma 6, (FTRL) is the first order approximation of (Continuous FTRL). The energy is convex, and therefore the energy will be non-decreasing in (FTRL). [1] capitalized on this non-decreasing energy to show that strategies come arbitrarily close to the boundary infinitely often in (FTRL).

In a similar fashion, we precisely compute to better understand the dynamics of (Gradient Descent). We deviate slightly from traditional analysis of (FTRL) and embed the learning rate into the regularizer . Formally, define . Through the maximizing argument [18], we have

 h∗i(yti) =yti⋅xti−||xti||222η. (5)

From Lemma 2,

 yti⋅xti =∑j∈Siytij⎛⎝η⎛⎝ytij−∑k∈Siytik|Si|⎞⎠+1|Si|⎞⎠ (6) =η∑j∈Si(ytij)2−η∑j∈Si∑k∈Siytijytik|Si|+∑j∈Siytij|Si| (7)

and

 ||xti||222η =∑j∈Si(η(ytij−∑k∈Siytik|Si|)+1|Si|)22η (8) =η2∑j∈Si(ytij)2−η∑j∈Si∑k∈Siytijytij|Si|+η2(∑j∈Siytij)2|Si|+12η1|Si|. (9)

Therefore,

 h∗(yti)i =yti⋅xti−||xti||222η (10) =η2∑j∈Si(ytij)2+∑j∈Siytij|Si|−η2(∑j∈Siytij)2|Si|−12η1|Si|. (11)

### 3.2 Selecting the Right Dual Space in 2x2 Games

Since is a strongly smooth function in the simplex, we expect for to be strongly convex [18] – at least when it’s corresponding dual variable is positive. However, (11) is not strongly convex for all . This is because cannot appear anywhere in . Rather, is contained to a space dual to the domain .

There are many non-intersecting dual spaces for the payoff vectors that yield strategies . [21] informally define a dual space when they focus the analysis on the vector . Similarly, we define a dual space that will be convenient for showing our results in 2x2 zero-sum games. Consider the payoff matrix

 A=[abcd] (12)

Without loss of generality, we may assume , , and is singular, i.e., (see Appendix C for details). Denote as

 Δyt1 =yt+11−yt1 (13) =Axt2 (14) =[(a−b)xt21+b(c−d)xt21+d] (15)

Therefore

since is singular. When increases by , increases by . Thus, the vector describes the span of the dual space . Moreover, (FTRL) is invariant to constant shifts in the payoff vector and therefore we may assume . By induction,

 [d−c,a−b]⋅yt1 =[d−c,a−b]⋅(yt−11+Δyt−11) (17) =[d−c,a−b]⋅yt−11=0 (18)

This conveniently allows us to express in terms of ,

 yt12 =c−da−byt11. (19)

Symmetrically,

 yt22 =b−da−cyt21. (20)

Combining these relationships with Lemma 2 yields

 xt11 =⎧⎪ ⎪ ⎪ ⎪ ⎪ ⎪⎨⎪ ⎪ ⎪ ⎪ ⎪ ⎪⎩0if η(1−c−da−b)yt112+12≤01if η(1−c−da−b)yt112+12≥1η(1−c−da−b)yt112+12% otherwise (21) xt21 =⎧⎪ ⎪ ⎪ ⎪ ⎪ ⎪⎨⎪ ⎪ ⎪ ⎪ ⎪ ⎪⎩0if η(1−b−da−c)yt212+12≤01if η(1−b−da−c)yt212+12≥1η(1−b−da−c)yt212+12% otherwise (22)

The selection of this dual space also allows us to employ a convenient variable substitution to plot and on the same graph.

 zt1 =η(1−c−da−b)yt112+12 (23) zt2 =η(1−b−da−c)yt212+12 (24)

The strategy can now be expressed as

 xti1 =⎧⎪⎨⎪⎩0 if zti≤01 if zti≥1zti otherwise (25)

Moreover, (11) can be rewritten as

 h∗i(yt1)=¯h∗1(zt1) =⎧⎪⎨⎪⎩α10zt1−β10if zt1≤0α11zt1−β11if zt1≥1γ1(zt1)2+α1zt1−β1otherwise (26) h∗i(yt2)=¯h∗2(zt2) =⎧⎪⎨⎪⎩α20zt2−β20if zt2≤0α21zt2−β21if zt2≥1γ2(zt2)2+α2zt2−β2otherwise (27)

where and . Both of these expressions are obviously strongly convex when the corresponding player strategy is in . The full details of these reduction can be found in Appendix D. With this notation, is simply the projection of onto the unit square as shown in Figure 2.

## 4 Θ(√T) Regret in 2x2 Zero-Sum Games

###### Theorem 3.

Let be a 2x2 game that has a unique fully mixed Nash equilibrium. When is updated according to (Gradient Descent), .

It is well known that if an algorithm admits sublinear regret in zero-sum games, then the time-average play converges to a Nash equilibirum. Thus, Theorem 3 immediately results in the following corollary.

###### Corollary 4.

Let be a 2x2 game that has a unique fully mixed Nash equilibrium. When is updated according to (Gradient Descent), average strategy converges to as .

###### Proof of Theorem 3.

The result is simple if . Neither player strategy will ever change. Since player 1’s opponent is playing the fully mixed , player 1’s utility is constant independent of what strategy is selected and therefore the regret is always . Now consider .

The main details of the proof are captured in Figure 3. Specifically in Section E.1, we establish break points and analyze the impact strategies have on the regret. The strategies are contained in adjacent red and green sections as shown in Figure 3.

Next in Section E.2, we show that there exists iterations where for each partitioning, . Specifically, we show that consecutive payoff vectors appear in a red section of Figure 3. The remaining points all appear in a green section and the corresponding player strategies are equivalent. This implies

 tj+1−1∑t=tj(xt+11−xt1)⋅Axt2 =∑t∈[tj,tj+1−1]:xt+11≠xt1(xt+11−xt1)⋅Axt2 (28) ∈∑t∈[tj,tj+1−1]:xt+11≠xt1O(1) (29) ∈O(1) (30)

Denote as the total energy of the system in iteration . In Section E.3, we show this energy increases linearly in each partition, i.e., . In Section E.4, we also show that the size of each partition is proportional to the energy in the system at the beginning of that partition, i.e., . Combining these two, . Therefore and where is the total number of partitions. Finally, it is well known ([10]) that the regret of player in zero-sum games through iterations is bounded by

 Regret1(T) ≤O(1)+T∑t=0(xt+11−xt1)⋅Axt2 (31) ≤O(1)+t0−1∑t=0(xt+11−xt1)⋅Axt2+k∑i=1ti−1∑t=ti−1(xt+11−xt1)⋅Axt2 (32) ∈O(1)+k∑i=1O(1) (33) ∈O(√T) (34)

completing the proof of the theorem. ∎

Next, we provide a game and initial conditions that has regret establishing that the bound in Theorem 3 is tight.

###### Theorem 5.

Consider the game Matching Pennies with learning rate and initial conditions . Then player 1’s regret is when strategies are updated with (Gradient Descent).

The proof follows similarly to the proof of Theorem 3 by exactly computing the regret in every iteration of (Gradient Descent). The full details appear in Section G.

## 5 Related Work

The study of learning dynamics in game theory has a long history dating back work of [7] and [31] on fictitious play in zero-sum games, which followed shortly after von Neumann’s seminal work on zero-sum games ([34, 35]). Some good reference books are the following: [17, 10, 37]. The classic results about time-average convergence of no-regret dynamics have been successfully generalized to include multiplayer extensions of network constant-sum games by [15, 9, 8].

Non-equilibrating dynamics in algorithmic game theory. In recent years the algorithmic game theory community has produced several interesting non-equilibrium results. These proofs are typically based on ad-hoc techniques, and results in this area typically revolve around specific examples of games with a handful of agents and strategies. [13] show that multiplicative weights update (MWU) does not converge even in a time-average sense in the case of a specific 3x3 game. [19] establish non-convergence for a continuous-time variant of MWU, known as the replicator dynamic, for a 2x2x2 game and show that as a result the system social welfare converges to states that dominate all Nash equilibria. [25, 11] prove the existence of Li-Yorke chaos in MWU dynamics of 2x2 potential games. Our result add a new chapter in this area with new detailed understanding of the non-equilibrium trajectories of gradient descent in two-by-two zero-sum games and their implications to regret.

Connections to continuous time dynamics in game theory. From the perspective of evolutionary game theory, which typically studies continuous time dynamics, numerous nonconvergence results are known but again typically for small games, e.g., [32]. [29] and [27] show that replicator dynamics, the continuous time version of MWU exhibit a specific type of near periodic behavior, which is known as Poincaré recurrence. Recently, [21] show how to generalize this recurrent behavior for replicator to more general continuous time variants of FTRL dynamics. [20] show that these arguments can also be adapted in the case of dynamically evolving games. Cycles arise also in team competition ([28]) as well as in network competition ([23]). The papers in this category combine delicate arguments such as volume preservation and the existence of constants of motions (“energy preservation”) for the dynamics to establish cyclic behavior. In the case of discrete time dynamics, such as the multiplicative weights or gradient descent, the system trajectories are first order approximations of the above motion and these conservation arguments are no longer valid. Instead as we have seen in this paper the “energy” is not preserved but increases over time at a predictable rate that allows us to prove tight bounds on the regret. Finally, [26] have put forward a program for linking game theory and topology of dynamical systems.

Fast regret minimization in games. It is widely known that the “black-box” average regret rate of it is achieved by MWU with suitably shrinking step size without making any assumptions about its environment. Recently, several authors have focused instead on obtaining stronger regret guarantees for systems of learning algorithms in games. [12] and [30] develop no-regret dynamics with a regret minimization rate when played against each other in two-player zero-sum games. [33] further analyze a recency biased variant of FTRL in more general games and showed a regret minimization rate. The social welfare converges at a rate of , a result which was extended to standard versions of FRTL dynamics by [16].

Learning in zero-sum games and applications to Artificial Intelligence.

A stream of recent papers proves positive results about convergence to equilibria in (mostly bilinear) zero-sum games for suitable adapted variants of first-order methods and then apply these techniques to Generative Adversarial Networks (GANs), showing improved performance (e.g. [14]) [4] exploit conservation laws of learning dynamics in zero-sum games (e.g. [29, 21]) to develop new algorithms for training GANs that add a new component to the dynamic that aims at minimizing this energy function. Different energy shrinking techniques for convergence in GANs (non-convex saddle point problems) exploit connections to variational inequalities and employ mirror descent techniques with an extra gradient step ([22]). Game theoretic inspired techniques such as time-averaging seem to work well in practice for a wide range of architectures ([36]).

Finally, the emergence of cycles in zero-sum competition lies at the core of some of the most exciting problems in creating artificial agents for complex environments such as Starcraft, where even evaluating the strength of an individual agent is a non-trivial task ([5]). Recent approaches are inspired by the emergence of cyclic behavior to introduce algorithms that aim at game-theoretic niching ([3]).

## 6 Conclusion

We present the first, to our knowledge, proof of sublinear regret for the most classic FTRL dynamic, online gradient descent, in two-by-two zero-sum games. Our proof techniques leverage geometric information and hinge upon the fact that FTRL dynamics, although are typically referred to as “converging” to Nash equilibria in zero-sum games, diverge away from them. We strongly believe that these techniques, which we are just introducing, are far from being fully mined. Although several novel ideas will be required, we are fairly confident that these sublinear regret bounds carry over to much more general classes of FTRL dynamics as well as to large (zero-sum) games.

## Acknowledgements

James P. Bailey and Georgios Piliouras acknowledge SUTD grant SRG ESD 2015 097, MOE AcRF Tier 2 Grant 2016-T2-1-170, grant PIE-SGP-AI-2018-01 and NRF 2018 Fellowship NRF-NRFF2018-07.

## References

• [1] Bailey, J. P., and Piliouras, G. Multiplicative weights update in zero-sum games. In ACM Conference on Economics and Computation (2018).
• [2] Bailey, J. P., and Piliouras, G. Multi-Agent Learning in Network Zero-Sum Games is a Hamiltonian System. arXiv e-prints (Mar 2019), arXiv:1903.01720.
• [3] Balduzzi, D., Garnelo, M., Bachrach, Y., Czarnecki, W. M., Perolat, J., Jaderberg, M., and Graepel, T. Open-ended learning in symmetric zero-sum games. arXiv preprint arXiv:1901.08106 (2019).
• [4] Balduzzi, D., Racaniere, S., Martens, J., Foerster, J., Tuyls, K., and Graepel, T. The Mechanics of n-Player Differentiable Games. In ICML (2018).
• [5] Balduzzi, D., Tuyls, K., Pérolat, J., and Graepel, T. Re-evaluating evaluation. In NIPS (2018).
• [6] Bertsekas, D. P. Nonlinear programming. Athena scientific Belmont, 1999.
• [7] Brown, G. Iterative solutions of games by fictitious play. In Activity Analysis of Production and Allocation, T.C. Koopmans (Ed.), New York: Wiley. (1951).
• [8] Cai, Y., Candogan, O., Daskalakis, C., and Papadimitriou, C. Zero-sum polymatrix games: A generalization of minmax. Mathematics of Operations Research 41, 2 (2016), 648–655.
• [9] Cai, Y., and Daskalakis, C. On minmax theorems for multiplayer games. In ACM-SIAM Symposium on Discrete Algorithms (2011), SODA, pp. 217–234.
• [10] Cesa-Bianchi, N., and Lugoisi, G. Prediction, Learning, and Games. Cambridge University Press, 2006.
• [11] Chotibut, T., Falniowski, F., Misiurewicz, M., and Piliouras, G. Family of chaotic maps from game theory. arXiv preprint arXiv:1807.06831 (2018).
• [12] Daskalakis, C., Deckelbaum, A., and Kim, A. Near-optimal no-regret algorithms for zero-sum games. In Proceedings of the Twenty-second Annual ACM-SIAM Symposium on Discrete Algorithms (Philadelphia, PA, USA, 2011), SODA ’11, Society for Industrial and Applied Mathematics, pp. 235–254.
• [13] Daskalakis, C., Frongillo, R., Papadimitriou, C., Pierrakos, G., and Valiant, G. On learning algorithms for Nash equilibria. Symposium on Algorithmic Game Theory (SAGT) (2010), 114–125.
• [14] Daskalakis, C., Ilyas, A., Syrgkanis, V., and Zeng, H. Training gans with optimism. In ICLR (2018).
• [15] Daskalakis, C., and Papadimitriou, C. On a network generalization of the minmax theorem. In ICALP (2009), pp. 423–434.
• [16] Foster, D. J., Lykouris, T., Sridharan, K., and Tardos, E. Learning in games: Robustness of fast convergence. In Advances in Neural Information Processing Systems (2016), pp. 4727–4735.
• [17] Fudenberg, D., and Levine, D. K. The Theory of Learning in Games. MIT Press Books. The MIT Press, 1998.
• [18] Kakade, S. M., Shalev-shwartz, S., and Tewari, A. On the duality of strong convexity and strong smoothness: Learning applications and matrix regularization, 2009.
• [19] Kleinberg, R., Ligett, K., Piliouras, G., and Tardos, É. Beyond the Nash equilibrium barrier. In Symposium on Innovations in Computer Science (ICS) (2011).
• [20] Mai, T., Panageas, I., Ratcliff, W., Vazirani, V. V., and Yunker, P. Cycles in Zero Sum Differential Games and Biological Diversity. In ACM EC (2018).
• [21] Mertikopoulos, P., Papadimitriou, C., and Piliouras, G. Cycles in adversarial regularized learning. In ACM-SIAM Symposium on Discrete Algorithms (2018).
• [22] Mertikopoulos, P., Zenati, H., Lecouat, B., Foo, C.-S., Chandrasekhar, V., and Piliouras, G. Mirror descent in saddle-point problems: Going the extra (gradient) mile. ArXiv e-prints (July 2018).
• [23] Nagarajan, S. G., Mohamed, S., and Piliouras, G. Three body problems in evolutionary game dynamics: Convergence, periodicity and limit cycles. In Proceedings of the 17th International Conference on Autonomous Agents and MultiAgent Systems (2018), International Foundation for Autonomous Agents and Multi-agent Systems, pp. 685–693.
• [24] Omidshafiei, S., Papadimitriou, C., Piliouras, G., Tuyls, K., Rowland, M., Lespiau, J.-B., Czarnecki, W. M., Lanctot, M., Perolat, J., and Munos, R. -rank: Multi-agent evaluation by evolution. arXiv preprint arXiv:1903.01373 (2019).
• [25] Palaiopanos, G., Panageas, I., and Piliouras, G. Multiplicative weights update with constant step-size in congestion games: Convergence, limit cycles and chaos. In Advances in Neural Information Processing Systems (2017), pp. 5872–5882.
• [26] Papadimitriou, C., and Piliouras, G. From nash equilibria to chain recurrent sets: An algorithmic solution concept for game theory. Entropy 20, 10 (2018).
• [27] Piliouras, G., Nieto-Granda, C., Christensen, H. I., and Shamma, J. S. Persistent patterns: Multi-agent learning beyond equilibrium and utility. In AAMAS (2014), pp. 181–188.
• [28] Piliouras, G., and Schulman, L. J. Learning dynamics and the co-evolution of competing sexual species. In ITCS (2018).
• [29] Piliouras, G., and Shamma, J. S. Optimization despite chaos: Convex relaxations to complex limit sets via poincaré recurrence. In Proceedings of the twenty-fifth annual ACM-SIAM symposium on Discrete algorithms (2014), SIAM, pp. 861–873.
• [30] Rakhlin, S., and Sridharan, K. Optimization, learning, and games with predictable sequences. In Advances in Neural Information Processing Systems (2013), pp. 3066–3074.
• [31] Robinson, J. An iterative method of solving a game. Annals of Mathematics 54 (1951), 296–301.
• [32] Sandholm, W. H. Population Games and Evolutionary Dynamics. MIT Press, 2010.
• [33] Syrgkanis, V., Agarwal, A., Luo, H., and Schapire, R. E. Fast convergence of regularized learning in games. In Proceedings of the 28th International Conference on Neural Information Processing Systems (Cambridge, MA, USA, 2015), NIPS’15, MIT Press, pp. 2989–2997.
• [34] von Neumann, J. Zur theorie der gesellschaftsspiele. Mathematische Annalen 100 (1928), 295–300.
• [35] von Neumann, J., and Morgenstern, O. Theory of Games and Economic Behavior. Princeton University Press, 1944.
• [36] Yazıcı, Y., Foo, C.-S., Winkler, S., Yap, K.-H., Piliouras, G., and Chandrasekhar, V. The Unusual Effectiveness of Averaging in GAN Training. ArXiv e-prints (June 2018).
• [37] Young, H. P. Strategic learning and its limits. Oxford Univ. Press, 2004.

## Appendix A First Order Approximation of (Continuous FTRL)

###### Lemma 6.

(FTRL) is the first order approximation of (Continuous FTRL).

###### Proof.

The first order approximation of is

 ^y1(t) =y1(t−1)+ddty1(t−1) (35) =y1(t−1)+Ax1(t−1) (36)

and

 ^x1(t) =argmaxx1∈X1{x⋅^y1(t)−h1(x1)η} (37)

Inductively, and as defined in (FTRL) completing the proof of the lemma. ∎

## Appendix B Optimal Solution to (Gradient Descent)

The KKT optimality conditions (see [6]) for (Gradient Descent) are given by

 xti =η(yti−λti⋅1+uti) (Critical Point) xti ≥0 (Non-negativity) ni∑j=1xtij =1 (Primal Feasibility) uti ≥0 (Dual Feasibility) uti⋅xti =0 (Complimentary Slackness)

where and .

Let be the set of where . By (Complimentary Slackness), for all . Therefore, (Critical Point) becomes

 xtij ={0for j∉Siη(ytij−λti)for j∈Si. (38)

Substituting (38) into (Primal Feasibility) yields

 1 =ni∑j=1xtij (39) =∑j∈Siη(ytij−λti) (40)

and . Therefore

 xtij =⎧⎪⎨⎪⎩0for j∉Siη(ytij−∑k∈Siytik|Si|)+1|Si|for j∈Si. (41)

The variable represents that the constraint is unenforced. Enforcing constraints never improves the objective value of an optimization problem and therefore is a maximal set where (41) is feasible. Moreover, it is straightforward to show that if then . Thus, greedily removing the lowest valued from until (41) is feasible yields the optimal solution to (Gradient Descent).

## Appendix C Payoff Matrix Assumptions

The payoff matrix is in the form

 A=[abcd] (42)

In this paper, we make three assumptions about : , and . In order, we show that we may make these assumption without loss of generality.

In 2x2 games, if there is a unique fully mixed Nash equilibrium, then it is straight forward to show that player 2’s equilibrium is

 xNE2=(d−ba+d−b−c,a−ca+d−b−c) (43)

and therefore , and when there is a unique fully mixed Nash equilibrium. Similarly, by analyzing player 1’s Nash equilibrium, and . Now consider the payoff matrix