1. Introduction
Zerosum games are at the heart of game theory and an important concept in a range of areas of applied mathematics. The development of efficient methods for finding equilibria in such games is therefore a central problem in algorithmic game theory
[10]. The problem is often cast as finding an optimum of a sequential decision problem, and much of the literature in mathematical programming and algorithmic game theory focuses on the setting of convexconcave objective functions.With the rise of generative adversarial networks (GANs) [34], along with other machine learning methods that rely on optimising not just one single objective function, there is a renewed interest in the study of algorithms for finding equilibria in continuous games without convexityconcavity assumptions. As an example, Goodfellow et al. [34] originally cast the GAN training problem as follows: find a Nash equilibrium of the zerosum game
Models of this type are notoriously difficult to train [48, 5, 6, 9], while also having enormous empirical success in a number of areas [39, 30, 40, 52, 53, 18]. This has sparked an interest in developing computational methods for finding equilibria in general minimax games, as well as a surge in work on the theoretical underpinnings of GAN training; see, e.g., [37, 25, 19, 17, 35, 49] and references therein.
In the absence of a convexityconcavity assumption, finding (pure) equilibria points in minimax games is a challenging problem [21]. Much of the existing work on computational methods concerns algorithms that come with some type of local convergence. For work along these lines, with a focus on the machine learning context, see e.g. [36, 20, 8, 42, 3, 41, 29, 47, 38, 31, 50].
Motivated by the need for efficient methods for training GANs, [37]
shift the focus from pure to mixed strategies and reformulates the zerosum game that defines the training as a multiagent optimisation problem in the space of probability measures. Building on this work, in
[25] the authors analyse general minimax games, including GANs, and associated mixed Nash equilibria by connecting them to Langevin dynamics. They construct implementable training dynamics, and test them numerically, and provide a theoretical analysis based on gradient flows in the space of probability measures [4].When using a system of interacting particles, as in [37, 25], it is crucial to understand the properties of the system as the number of particles grows. Along with convergence of the particle system, e.g., law of large numbers and central limit results, an important aspect is to understand the fluctuations from such limits. In this note we therefore consider the training dynamics of [25] from a large deviations perspective. Under mild regularity conditions on the payoff function of the underlying game, we establish a large deviation principle (LDP) for the empirical measure of an interacting particle system corresponding to the Langevin DescentAscent dynamics of [25]. We show how slightly different versions of the convergence results of [25], for the interacting particle systems and the associated NikaidôIsoda error, follow from this LDP. To prove the LDP, we use that the particle dynamics corresponding to the strategies of the two players can be formulated in a way that fits with the large deviation results of [16]. In addition to obtaining the results in this note, building on the rather general results of [16] also sets the stage for future work on dynamics with more general diffusion terms.
Large deviations have proven to be extremely useful in the analysis and design of Monte Carlo methods, in particular in the rareevent context; for some examples see [7, 15, 28, 27, 24, 13]. Whereas tools from large deviations theory, and their connections to stochastic control, have been used extensively in the Monte Carlo setting, they are largely unexplored in the analysis of machine learning methods. Moreover, starting with the work [1], it has been established that there is a close link between gradient flows on the space of probability measures and large deviations for particle systems. In a sense, an LDP identifies the most natural gradient flow formulation for a given PDE; see [2, 26, 46] and references therein. With the increased use of particle systems and gradient flows in the analysis of machine learning methods, it is natural to introduce the large deviation framework in this setting. Our work can be seen as a first step in this direction, by establishing a relevant LDP in the setting of zerosum games. This opens up the possibility for more detailed analysis of computational methods for finding mixed Nash equilibria, including for the training of GANs, based on the associated rate function.
For future work, we are interested in extending the results to variants of mirrordescentlike algorithms, as appearing in, e.g., [37, 25, 14], and computing the corresponding rate functions for specific examples of minimax games appearing in the training of GANs. The latter will be used to investigate the impact different parameters have on convergence and to compare different methods (i.e., particle dynamics). A stochastic version of what in [25] is referred to as WassersteinFisherRao dynamics is of particular interest.
Notation
For a space , we take to denote the space of probability measures on and to denote the space of continuous functions from into . We take to be the space of realvalued, bounded functions on . If is a metric space, with metric , we take to be the set of functions that are bounded and Lipschitz on , with norm
where is the Lipschitz constant of . The dual boundedLipschitz metric on is defined as
and metrises the topology of weak convergence on . Another family of metrics on that will be used is the Wasserstein metrics: for , the Wasserstein metric on the space of probability measures on with finite
th moment defined as
where is the collection of all couplings of and . Convergence in is equivalent to weak convergence plus convergence of the first moments; see [51] for a more detailed discussion of the Wasserstein metrics.
Lastly, for an element of a product space , we use to denote the component of and to denote the component.
2. Definitions and model setup
The setting of interest in this paper is twoplayer zerosum games and methods for finding mixed Nash equilibria for such games. To define the relevant concepts and quantities, let and , for some ; for convenience we set . We focus on the Euclidean setup here, although generalisation to Riemannian manifolds, the setting used in [25], can also be considered. A twoplayer zerosum game consists of a set of two players with parameters and and a function that gives the payoff for player one. That is, for , the payoff for player one, the player, is and the payoff of player two, the player, is . To ease comparison with [25] we refer to also as the loss of the game.
A pure Nash equilibrium point (NE) [43] is a pair of strategies such that
(2.1) 
A mixed Nash equilibrium point (MNE) [44, 33]
is a pair of probability distributions
such that, for all , ,Similar to [25], to simplify the notation we take to denote the expected loss: For in and ,
The definition of an MNE then becomes: for all ,
Comparing this to (2.1) shows that an MNE for the loss is a pure Nash equilibrium point with respect to
. For a given loss function
, neither Nash equilibria nor MNEs are guaranteed to exist for continuous games. However, MNEs exist in greater generality (see, e.g. [33]) and are more viable for computational methods.The aim of [37, 25] is to construct efficient computational methods for finding approximations of an MNE for a given loss function . A common way to quantify the accuracy of such an approximation , used also in [25], is the Nikaidô and Isoda (NI) error [45]:
(2.2) 
A pair of distributions is called an MNE if it satisfies . Note that for an MNE , .
As mentioned in Section 1, although existence is less of an issue when we consider MNEs instead of pure NEs, finding an MNE for a given game is typically a difficult task. In [25] the authors propose an approach for approximating MNEs associated with based on a meanfield dynamics. Whereas the meanfield dynamics are defined in terms of a PDE, actual computations are based on the following interacting particle system: for , let and be independent Wiener processes on and , respectively, take and for some initial distributions and , consider the system of coupled SDEs
(2.3) 
In [25] this is referred to as the Langevin ascentdescent dynamics, or entropic regularisation (Algorithm 1 therein). The dynamics also resembles those of Algorithm 4 in [37], therein referred to as approximate infinitedimensional mirror descent.
In [25] the authors go on to prove a series of results for the limit and about the Wasserstein gradient flow associated with the corresponding meanfield dynamics. First, they show that the empirical measure of the particle system (2.3) converges, with respect to the metric and uniformly in time, to a solution of the gradient flow (2.4),
(2.4) 
where
It is also shown that the mean absolute error of the error associated with the empirical measure converges to 0 (uniformly in time), that if the solution of (2.4) is to converge in time, then the limit can be characterised as a certain fixed point, and is an MNE for given that is above a specified threshold.
3. An LDP and a.s.convergence of the NI error
Consider the coupled systems of SDEs (2.3). We view this as describing the dynamics of particles over the (arbitrary) time interval . The empirical measures for the two collections of particles are defined as
(3.1) 
These measures are viewed as elements in and , respectively, where we equip and with the supremum norm. For , the corresponding marginals are denoted and , which belong to and , respectively. Henceforth we make the following assumption on the loss . The loss function is continuous and bounded, and is Lipschitz and bounded.
Under this assumption, the regularity properties of carries over to the coefficients appearing in an alternative form of (2.3), ensuring that these are regular enough for existing large deviation results to apply. It is clear that the assumptions can be weakened in different directions, in particular the assumption that and are bounded. However, the focus of this note is no to obtain the most general results possible, why we are content with using this more convenient assumption for now. Note also that in [25] the underlying parameter spaces are assumed to be compact, which under their assumptions on implicitly ensures boundedness of both and .
In order to state the main large deviation result, we first introduce some notation from [16], in which the LDP is established on an augmented space (see [16] for a more thorough description). Let , the trajectory space for each particle , and , the trajectory space for the Wiener process, both equipped with the maximum norm. Let denote the space of deterministic relaxed controls on and the corresponding subset of elements with finite first moment. is equipped with the topology generated by the metric applied to normalised versions of the elements in : For any , consider the distance , where is the measure (see [16] and references therein for further details related to these spaces).
Within the space , let denote the collection of measures that satisfy

,

is a weak solution of the SDE
(3.2) where is a standard Wiener process defined on some probability space and under the corresponding probability measure the triplet has distribution ;
is the law of the random variable
.
The main result in [16] states that the sequence of empirical measures satisfies the LDP with rate function (see Remark 3.2 in [16])
(3.3) 
where
and the triple is the canonical process on (equipped with its Borel algebra) and a.s. satisfies
(3.4) 
This is a controltype formulation of the rate function, where acts as the control in the limit equation for .
We are now ready to state the main large deviation result. The key observation is that, under Assumption 3, the LDP for the pair of empirical measures can be shown using the results from [16] described in the previous paragraphs.
For each , define . Assume that , for some , as . Then, the family of empirical measures satisfies an LDP, on , with speed and rate function given in (3.3).
Moreover, under the same conditions and with viewed as an element of , satisfies an LDP with speed and rate function
(3.5) 
where the supremum is taken over real Schwartz functions on and, for an element , is defined in (3). Before we give a proof, we comment briefly on the result. First, those familiar with large deviation theory recognise this as a DawsonGärtner type result. Whereas the original results by Dawson and Gärtner [22] can be applied in the current setting, with different assumptions on and , their results require using a certain inductive topology on the space of probability measures; see, e.g., [16, 32] for more on this. The results in [22] are also restricted to diffusion coefficients that are nondegenerate and independent of the empirical measures , whereas for future work we are interested in dynamics where the latter does not hold. We have therefore opted to use a more flexible approach to the large deviation result also in this simpler setting. Lastly, we expect that the alternative formulation (3.5) of the rate function will be beneficial for studying the performance of the proposed algorithms, similar to how a parellell form of the rate function for smallnoise diffusions has been used in the context of importance sampling [27, 15]
Proof of Theorem 3.
The proof follows from the LDP in [16] once we express the particle dynamics in an appropriate way and verify that the assumptions of [16] hold. For the first part, we define a twocomponent particle system, with components and : set , where . With this definition, we identify as the empirical measure of :
(3.6) 
The empirical measures and , and their marginals, are now the marginals of and , e.g., . To ease the notation we set , to be the corresponding marginal and analogously for , .
With the dynamics (2.3) for and , the dynamics for the “new” system can be expressed as, for ,
(3.7) 
To express this in a more standard form, similar to the interacting particle systems treated in [16], we define the function as , with
(3.8) 
If we also define the process on as , a Wiener process on , then the dynamics for the s can be expressed as follows: for , is a solution of the SDE
(3.9) 
This SDE is of the form considered in [16]. It remains to check that the assumptions used therein for proving the LDP also hold in our setting. As mentioned in [16], it suffices to have the drift be uniformly Lipschitz^{1}^{1}1Weaker conditions than this, and the one imposed in this paper, can suffice as well; see comment before the proof and [16] for a more extensive discussion.. We now prove that this holds under Assumption 3.
Take to be such that and let be the Lipschitz constant of . Then, and are both in . For any and , we have for that
(3.10) 
For the first term on the righthand side of (3.10), we have
Considering now the second term on the righthand side of (3.10), we use the Lipschitz property of :
The calculations for are completely analogous and we conclude that
This shows that is Lipschitz.
Because the diffusion coefficient is constant, the global Lipschitz property of is enough for an application of [16, Theorem 3.1].
It remains to move from the rate function (3.3) to the calculus of variations form (3.5). Starting with the controlformulation, we can use the contraction principle to obtain an LDP for on . Let denote the corresponding rate function. Standard calculations suggest that it can be rewritten as (3.5). With the generator associated with (3.4), and the corresponding (formal) adjoint — with corresponding to the case — the PDE characterisation of the dynamics (3.4) is
(3.11) 
which is interpreted in the weak sense. We now take to be the Schwartz space of real distributions on and, for and , define the norm by
where the supremum is taken over real Schwartz test functions on such that . Based on the characterisation (3), with , we have the representation
for a heuristic description of how to go from (
3) and the control formulation (3.3) to this version of the rate function. In the recent works [11, 12] the authors give, to the best of our knowledge, the first rigorous proof of this type of equivalence between the different types of rate functions, for both moderate and large deviations, in the setting of multiscale processes. The above expression for is precisely (3.5), the prescribed form of the rate function. ∎Theorem 3 establishes the relevant LDP by appealing to results in [16]. A first byproduct is a lawoflargenumberstype convergence of the empirical measures defined in (3.6). The sequence of empirical measures converges, as , almost surely in to , the solution of (3) with :
(3.12) 
Corollary 3 is a version of the first part of Theorem 3 in [25]. To see this, insert the definition (3.8) of into (3.12): the resulting equation is precisely the entropyregularised gradient flow (2.4). More generally, the LDP for a particle systems identifies a natural candidate for the gradient flow structure of the limit as : both the dissipation mechanism and entropy functional (see [4]) can be identified from the LDP, see [1, 2] and subsequent work by Peletier and coauthors. In [25] the gradient flow is used to propose the dynamics used for finding MNEs. For other particle dynamics, aimed at the same task, this type of large deviation analysis can be used to identify the correct gradient flow to use for further analysis. This will be the subject of future work on mirrordescentlike algorithms.
In [25], to establish the convergence to the gradient flow, the authors work in the topology on and , consider the convergence of the marginals for and conclude that the convergence is uniform in . Here, the result follows from the LDP in Theorem 3, by noting that as in (3.12) satisfies . A standard argument using the BorelCantelli lemma then shows that is indeed the limit of as .
The next result, which also follows from Theorem 3, corresponds to the second part of Theorem 3 in [25], modulo the different topologies and mode of convergence being used. In order to have results that are uniform over , and ease notation, we define the map, for ,
The sequence converges almost surely in to , where is the same as in Corollary 3.
Proof.
As a first step we establish that the map is continuous for each . In Lemma 2 in [25] it is shown that the NI error, defined in (2.2), is a Lipschitz map on when the distance is used to define the topology on . Therein the underlying state spaces are also assumed compact. Removing compactness, Assumption 3 is enough for the main ideas of the proof to be used also for the topology of weak convergence on and . First, similar to [25], we note that under Assumption 3, the function , for any , is continuous, bounded and Lipschitz.
With this observation, we now adapt the arguments from Lemma 2 in [25] to establish the continuity of the NI error. Instead of , we work with the dual boundedLipschitz metric, and rather than the Lipschitz constant of we use an upper bound on the bounded Lipschitz norm of the function . We note that this upper bound can be taken independent of due to the assumptions on . Using this property, for any two , the steps used in [25, Lemma 2] leads to the upper bound
Using the analogous argument for the map , for any , we have
for any . The continuity of the NI error now follows from an application of the triangle inequality.
With the desired continuity established, we can now show the claimed convergence of . From Corollary 3 we have convergence of the marginals of to those of the solution of (3.12). Combined with the continuity obtained in the previous paragraphs, for each , with probability one we have
To obtain the convergence uniformly in , we employ again the argument used to show that the NI error is a continuous function on . Repeating the steps outlined above, similar to those in [25], we arrive at, for any ,
a.s., where is an upper bound on the norm of (for any ) and (for any ). By the convergence , for a given , there is a such that a.s. for all . This also implies the same property for the marginals: a.s., for any , if . Therefore, since the constant does not depend on or ,
a.s. for all . For a given , pick , so that the inequality becomes
a.s. for all . This shows the claimed convergence. ∎
In addition to the convergence in Proposition 3, as a consequence of Theorem 3 we also obtain the corresponding LDP for the NI error. The sequence satisfies an LDP on , with speed and rate function, for ,
(3.13) 
The proof is a direct consequence of combining Theorem 3, the continuity of the map and the contraction principle [23]. Note that an LDP for the NI error at a fixed time follows by applying the contraction principle once more.
Acknowledgments
The authors are grateful to Z. W. Bezemek and K. Spiliopoulos for useful discussions of a first version of the paper. In particular regarding the relation between the different formulations of the rate functions in Theorem 3.
The research of VN and PN was supported by Wallenberg AI, Autonomous Systems and Software Program (WASP) funded by the Knut and Alice Wallenberg Foundation. The research of PN was also supported by the Swedish Research Council (VR201807050).
References
 [1] S. Adams, N. Dirr, M. A. Peletier, and J. Zimmer. From a largedeviations principle to the Wasserstein gradient flow: A new micromacro passage. Commun. Math. Phys., 307:791–815, 2011.
 [2] S. Adams, N. Dirr, M. A. Peletier, and J. Zimmer. Large deviations and gradient flows. Phil. Trans. R. Soc. A., 371:20120341, 2013.

[3]
L. Adolphs, H. Daneshmand, A. Lucchi, and T. Hofmann.
Local saddle point optimization: A curvature exploitation approach.
In
The 22nd International Conference on Artificial Intelligence and Statistics
, 486–495, 2019.  [4] L. Ambrosio, N. Gigli, and G. Savaré. Gradient flows in metric spaces and in the space of probability measures. Lectures in Mathematics ETH Zürich. Birkhäuser Verlag, Basel, 2005.
 [5] M. Arjovsky and L. Bottou. Towards principled methods for training generative adversarial networks. In International Conference on Learning Representations, 2017.
 [6] M. Arjovsky, S. Chintala, and L. Bottou. Wasserstein generative adversarial networks. In Proceedings of the 34th International Conference on Machine Learning, 214–223, 2017.
 [7] S. Asmussen and P. W. Glynn. Stochastic simulation: Algorithms and analysis. Stochastic modelling and applied probability. Springer, New York, NY, first edition, 2007.
 [8] D. Balduzzi, S. Racaniere, J. Martens, J. Foerster, K. Tuyls, and T. Graepel. The mechanics of nplayer differentiable games. In Proceedings of the 35th International Conference on Machine Learning, 354–363, 2018.
 [9] S. A. Barnett. Convergence problems with generative adversarial networks (GANs). Preprint, arXiv:1806.11382, 2018.
 [10] T. Başar and G. J. Olsder. Dynamic noncooperative game theory. Society for Industrial and Applied Mathematics, 1998.
 [11] Z. Bezemek and K. Spiliopoulos. Large deviations for interacting multiscale particle systems. Preprint, arXiv:2011.03032, 2020.
 [12] Z. Bezemek and K. Spiliopoulos. Moderate deviations for fully coupled multiscale weakly interacting particle systems. Preprint, arXiv:2202.08403, 2022.
 [13] J. Bierkens, P. Nyquist, and M C. Schlottke. Large deviations for the empirical measure of the zigzag process. Ann. Appl. Probab., 31(6):2811–2843, 2021.

[14]
A. Borovykh, N. Kantas, P. Parpas, and G. A. Pavliotis.
On stochastic mirror descent with interacting particles: Convergence properties and variance reduction.
Physica D: Nonlinear Phenomena, 418:132844, 2021.  [15] A. Budhiraja and P. Dupuis. Analysis and approximation of rare events, volume 94 of Probability Theory and Stochastic Modelling. Springer, New York, 2019. Representations and weak convergence methods.
 [16] A. Budhiraja, P. Dupuis, and M. Fischer. Large deviations for weakly interacting processes via weak convergence methods. Ann. Probab., 40(1):74–102, 2012.
 [17] H. Cao and X. Guo. Approximation and convergence of GANs training: an SDE approach. Preprint, arXiv:2006.02047, 2020.
 [18] A. Clark, J. Donahue, and K. Simonyan. Adversarial video generation on complex datasets. Preprint, arXiv:1907.06571, 2019.

[19]
Giovanni Conforti, Anna Kazeykina, and Zhenjie Ren.
Game on random environment, meanfield langevin system, and neural networks.
Math. Oper. Res., to appear, 2022.  [20] C. Daskalakis, A. Ilyas, V. Syrgkanis, and H. Zeng. Training GANs with optimism. In International Conference on Learning Representations, 2018.

[21]
C. Daskalakis, S. Skoulakis, and M. Zampetakis.
The complexity of constrained minmax optimization.
In
Proceedings of the 53rd Annual ACM SIGACT Symposium on Theory of Computing
, 1466–1478, 2021.  [22] D. A. Dawson and J. Gärtner. Large deviations from the McKeanVlasov limit for weakly interacting diffusions. Stochastics, 20(4):247–308, 1987.
 [23] A. Dembo and O. Zeitouni. Large deviations techniques and applications. Stochastic Modelling and Applied Probability. Springer, New York, NY, second edition, 1998.
 [24] J. Doll, P. Dupuis, and P. Nyquist. A large deviations analysis of certain qualitative properties of parallel tempering and infinite swapping algorithms. Appl. Math. Optim., 78(1):103–144, 2018.
 [25] C. DomingoEnrich, S. Jelassi, A. Mensch, G. Rotskoff, and J. Bruna. A meanfield analysis of twoplayer zerosum games. In Advances in Neural Information Processing Systems, 33:20215–20226, 2020.
 [26] M. H. Duong, M. A. Peletier, and J. Zimmer. Generic formalism of a vlasovfokkerplanck equation and connection to largedeviation principles. Nonlinearity, 26:2951–2971, 2013.
 [27] P. Dupuis, K. Spiliopoulos, and X. Zhou. Escaping from an attractor: importance sampling and rest points I. Ann. Appl. Probab., 25(5):2909–2958, 2015.
 [28] P. Dupuis and H. Wang. Subsolutions of an Isaacs equation and efficient schemes for importance sampling. Math. Oper. Res., 32(3):723–757, 2007.
 [29] S. S. Sastry E. Mazumdar, M. I. Jordan. On finding local Nash equilibria (and only local Nash equilibria) in zerosum games. Preprint, arXiv:1901.00838, 2019.
 [30] J. Engel, K. K. Agrawal, S. Chen, I. Gulrajani, C. Donahue, and A. Roberts. Gansynth: Adversarial neural audio synthesis. Preprint, arXiv:1902.08710, 2019.
 [31] T. Fiez, L. Ratliff, E. Mazumdar, E. Faulkner, and A. Narang. Global convergence to local minmax equilibrium in classes of nonconvex zerosum games. In Advances in Neural Information Processing Systems, 34: 29049–29063, 2021.
 [32] M. Fischer. On the form of the large deviation rate function for the empirical measures of weakly interacting systems. Bernoulli, 20(4):1765 – 1801, 2014.
 [33] I. L. Glicksberg. A further generalization of the Kakutani fixed theorem, with application to Nash equilibrium points. Proc. Amer. Math. Soc., 3:170–174, 1952.
 [34] I. Goodfellow, J. PougetAbadie, M. Mirza, B. Xu, D. WardeFarley, S. Ozair, A. Courville, and Y. Bengio. Generative adversarial nets. In Advances in Neural Information Processing Systems, 27, 2014.
 [35] X. Guo and O. Mounjid. GANs training: A game and stochastic control approach. Preprint, arXiv:2112.00222, 2021.
 [36] M. Heusel, H. Ramsauer, T. Unterthiner, B. Nessler, and S. Hochreiter. GANs trained by a two timescale update rule converge to a local Nash equilibrium. In Advances in Neural Information Processing Systems, 30, 2017.
 [37] Y.P. Hsieh, C. Liu, and V. Cevher. Finding mixed Nash equilibria of generative adversarial networks. In Proceedings of the 36th International Conference on Machine Learning, 2810–2819, 2019.
 [38] C. Jin, P. Netrapalli, and M. I. Jordan. What is local optimality in nonconvexnonconcave minimax optimization? In Proceedings of the 37th International Conference on Machine Learning, 4880–4889, 2020.
 [39] T. Karras, M. Aittala, S. Laine, E. Härkönen, J. Hellsten, J. Lehtinen, and T. Aila. Aliasfree generative adversarial networks. in Advances in Neural Information Processing Systems, 34, 2021.

[40]
S. W. Kim, Y. Zhou, J. Philion, A. Torralba, and S. Fidler.
Learning to simulate dynamic environments with gamegan.
In
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition
, 1231–1240, 2020.  [41] E. Mazumdar and L. J. Ratliff. Local Nash equilibria are isolated, strict local Nash equilibria in ’almost all’ zerosum continuous games. In 58th IEEE Conference on Decision and Control, CDC, 6899–6904, IEEE, 2019.
 [42] P. Mertikopoulos, B. Lecouat, H. Zenati, C.S. Foo, V. Chandrasekhar, and G. Piliouras. Optimistic mirror descent in saddlepoint problems: Going the extra(gradient) mile. In International Conference on Learning Representations, 2019.
 [43] J. Nash. Noncooperative games. Ann. of Math. (2), 54:286–295, 1951.
 [44] J. F. Nash, Jr. Equilibrium points in person games. Proc. Nat. Acad. Sci. U.S.A., 36:48–49, 1950.
 [45] H. Nikaidô and K. Isoda. Note on noncooperative convex games. Pacific J. Math., 5:807–815, 1955.
 [46] M. Peletier, N. Gavish, and P. Nyquist. Large deviations and gradient flows for the brownian onedimensional hardrod system. Potential Analysis, 2021.
 [47] A. Raghunathan, A. Cherian, and D. Jha. Game theoretic optimization via gradientbased NikaidôIsoda function. In Proceedings of the 36th International Conference on Machine Learning, 5291–5300, 2019.
 [48] T. Salimans, I. Goodfellow, W. Zaremba, V. Cheung, A. Radford, and X. Chen. Improved techniques for training GANs. In Advances in Neural Information Processing Systems, 29, 2016.
 [49] S. Sidheekh, A. Aimen, and N. C. Krishnan. On characterizing GAN convergence through proximal duality gap. In Proceedings of the 38th International Conference on Machine Learning, 9660–9670, 2021.
 [50] I. Tsaknakis and M. Hong. Finding firstorder Nash equilibria of zerosum games with the regularized NikaidôIsoda function. In Proceedings of the 24th International Conference on Artificial Intelligence and Statistics, 1189–1197, 2021.
 [51] C. Villani. Optimal transport: Old and new. SpringerVerlag, Berlin, 2009.

[52]
J. Yu, Z. Lin, J. Yang, X. Shen, X. Lu, and T. S. Huang.
Generative image inpainting with contextual attention.
In Proceedings of the IEEE conference on computer vision and pattern recognition, 5505–5514, 2018.  [53] H. Zhang, T. Xu, H. Li, S. Zhang, X. Wang, X. Huang, and D. N. Metaxas. Stackgan: Text to photorealistic image synthesis with stacked generative adversarial networks. In Proceedings of the IEEE international conference on computer vision, 5907–5915, 2017.