1 Introduction
The classical problem of finding Nash equilibria in multiplayer games has been a focus of intense research in computer science, control theory, economics and mathematics (Basar and Olsder, 1998; Nisan et al., 2007; C. Daskalakis, 2009)
. Some connections have been made between this extensive literature and machine learning
(see, e.g., CesaBianchi and Lugosi, 2006; Banerjee and Peng, 2003; Foerster et al., 2017), but these connections have focused principally on decisionmaking by single agents and multiple agents, and not on the burgeoning patternrecognition side of machine learning, with its focus on large data sets and simple gradientbased algorithms for prediction and inference. This gap has begun to close in recent years, due to new formulations of learning problems as involving competition between subsystems that are construed as adversaries
(Goodfellow et al., 2014), the need to robustify learning systems with regard to against actual adversaries (Xu et al., 2009) and with regard to mismatch between assumptions and datagenerating mechanisms (Yang, 2011; Giordano et al., 2018), and an increasing awareness that realworld machinelearning systems are often embedded in larger economic systems or networks (Jordan, 2018).These emerging connections bring significant algorithmic and conceptual challenges to the fore. Indeed, while gradientbased learning has been a major success in machine learning, both in theory and in practice, work on gradientbased algorithms in game theory has often highlighted their limitations. For example, gradientbased approaches are known to be difficult to tune and train
(Daskalakis et al., 2017; Mescheder et al., 2017; Hommes and Ochea, 2012; Balduzzi et al., 2018), and recent work has shown that gradientbased learning will almost surely avoid a subset of the local Nash equilibria in generalsum games (Mazumdar and Ratliff, ). Moreover, there is no shortage of work showing that gradientbased algorithms can converge to limit cycles or even diverge in gametheoretic settings (Benaïm and Hirsch, 1999; Hommes and Ochea, 2012; Daskalakis et al., 2017; Mertikopoulos et al., 2018b).These drawbacks have led to a renewed interest in approaches to finding the Nash equilibria of zerosum games, or equivalently, to solving saddle point problems. Recent work has attempted to use secondorder information to reduce oscillations around equilibria and speed up convergence to fixed points of the gradient dynamics (Mescheder et al., 2017; Balduzzi et al., 2018). Other recent approaches have attempted to tackle the problem from the variational inequality perspective but also with an eye on reducing oscillatory behaviors (Mertikopoulos et al., 2018a; Gidel et al., 2018).
None of these approaches, however, address a fundamental issue that arises in zerosum games. As we will discuss, the set of attracting fixed points for the gradient dynamics in zerosum games can include critical points that are not Nash equilibria. In fact, any saddle point of the underlying function that does not satisfy a particular alignment condition of a Nash equilibrium is a candidate attracting equilibrium for the gradient dynamics. Further, as we show, these points are attracting for a variety of recently proposed adjustments to gradientbased algorithms, including consensus optimization (Mescheder et al., 2017), the symplectic gradient adjustment (Balduzzi et al., 2018), and a twotimescale version of simultaneous gradient descent (Heusel et al., 2017). Moreover, we show by counterexample that these algorithms can all converge to nonNash stationary points.
We present a new gradientbased algorithm for finding the local Nash equilibria of twoplayer zerosum games and prove that the only stationary points to which the algorithm can converge are local Nash equilibria. Our algorithm makes essential use of the underlying structure of zerosum games. To obtain our theoretical results we work in continuous time—via an ordinary differential equation (ODE)—and our algorithm is obtained via a discretization of the ODE. While a naive discretization would require a matrix inversion and would be computationally burdensome, our discretization is a twotimescale discretization that avoids matrix inversion entirely and is of a similar computational complexity as that of other gradientbased algorithms.
The paper is organized as follows. In Section 2 we define our notation and the problem we address. In Section 3 we define the limiting ODE that we would like our algorithm to follow and show that it has the desirable property that its only limit points are local Nash equilibria of the game. In Section 4 we introduce local symplectic surgery, a twotimescale procedure that asymptotically tracks the limiting ODE and show that it can be implemented efficiently. Finally, in Section 5 we present two numerical examples to validate the algorithm. The first is a toy example with three local Nash equilibria, and one nonNash fixed point. We show that simultaneous gradient descent and other recently proposed algorithms for zerosum games can converge to any of the four points while the proposed algorithm only converges to the local Nash equilibria. The second example is a small generative adversarial network (GAN), where we show that the proposed algorithm converges to a suitable solution within a similar number of steps as simultaneous gradient descent.
2 Preliminaries
We consider a twoplayer game, in which one player tries to minimize a function, , with respect to their decision variable , and the other player aims to maximize with respect to their decision variable , where . We write such a game as , since the second player can be seen as minimizing . We assume that neither player knows anything about the critical points of
, but that both players follow the rules of the game. Such a situation arises naturally when training machine learning algorithms (e.g., training generative adversarial networks or in multiagent reinforcement learning). Without restricting
, and assuming both players are noncooperative, the best they can hope to achieve is a local Nash equilibrium; i.e., a point that satisfiesfor all and in neighborhoods of and respectively. Such equilibria are locally optimal for both players with respect to their own decision variable, meaning that neither player has an incentive to unilaterally deviate from such a point. As was shown in Ratliff et al. (2013), generically, local Nash equilibria will satisfy slightly stronger conditions, namely they will be differential Nash equilibria (DNE):
A strategy is a differential Nash equilibrium if:

and .

, and .
Here and denote the partial derivatives of with respect to and respectively, and and denote the matrices of second derivatives of with respect to and . Both differential and local Nash equilibria in twoplayer zerosum games are, by definition, special saddle points of the function that satisfy a particular alignment condition with respect to the player’s decision variables. Indeed, the definition of differential Nash equilibria, which holds for almost all local Nash equilibria in a formal mathematical sense, makes this condition clear: the directions of positive and negative curvature of the function at a local Nash equilibria must be aligned with the minimizing and maximizing player’s decision variables respectively.
We note that the key difference between local and differential Nash equilibria is that , and are required to be definite instead of semidefinite. This distinction simplifies our analysis while still allowing our results to hold for almost all continuous games.
2.1 Issues with gradientbased algorithms in zerosum games
Having introduced local Nash equilibria as the solution concept of interest, we now consider how to find such solutions, and in particular we highlight some issues with gradientbased algorithms in zerosum continuous games. The most common method of finding local Nash equilibria in such games is to have both players randomly initialize their variables and then follow their respective gradients. That is, at each step , each agent updates their variable as follows:
where is a sequence of step sizes. The minimizing player performs gradient descent on their cost while the maximizing player ascends their gradient. We refer to this algorithm as simultaneous gradient descent (simGD). To simplify the notation, we let
, and define the vectorvalued function
as:In this notation, the simGD update is given by:
(1) 
Since (1) is in the form of a discretetime dynamical system, it is natural to examine its limiting behavior through the lens of dynamical systems theory. Intuitively, given a properly chosen sequence of step sizes, (1) should have the same limiting behavior as the continuoustime flow:
(2) 
We can analyze this flow in neighborhoods of equilibria by studying the Jacobian matrix of , denoted :
(3) 
We remark that the diagonal blocks of are always symmetric and . Thus can be written as the sum of a block symmetric matrix and a block antisymmetric matrix , where:
Given the structure of the Jacobian, we can now draw links between differential Nash equilibria and equilibrium concepts in dynamical systems theory. We focus on hyperbolic critical points of .
A strategy is a critical point of if . It is a hyperbolic critical point if for all , where
, denotes the real part of the eigenvalue
of . It is well known that hyperbolic critical points are generic among critical points of smooth dynamical systems (see e.g. (Sastry, 1999)), meaning that our focus on hyperbolic critical points is not very restrictive. Of particular interest are locally asymptotically stable equilibria of the dynamics.A strategy is a locally asymptotically stable equilibrium (LASE) of the continuoustime dynamics if and for all . LASE have the desirable property that they are locally exponentially attracting under the flow of . This implies that a properly discretized version of will also converge exponentially fast in a neighborhood of such points. LASE are the only attracting hyperbolic equilibria. Thus, making statements about all the LASE of a certain continuoustime dynamical system allows us to characterize all attracting hyperbolic equilibria.
As shown in Ratliff et al. (2013) and Nagarajan and Kolter (2017), the fact that all differential Nash equilibria are critical points of coupled with the structure of in zerosum games guarantees that all differential Nash equilibria of the game are LASE of the gradient dynamics. However the converse is not true. The structure present in zerosum games is not enough to ensure that the differential Nash equilibria are the only LASE of the gradient dynamics. When either or is indefinite at a critical point of , the Jacobian can still have eigenvalues with strictly positive real parts.
Consider a matrix having the form:
where and . These conditions imply that cannot be the Jacobian of at an local Nash equilibria. However, if and , both of the eigenvalues of will have strictly positive real parts, and such a point could still be a LASE of the gradient dynamics.
Such points, which we refer to as nonNash LASE of (2), are what makes having guarantees on the convergence of algorithms in zerosum games particularly difficult. NonNash LASE are not locally optimal for both players, and may not even be optimal for one of the players. By definition, at least one of the two players has a direction in which they would move to unilaterally decrease their cost. Such points arise solely due to the gradient dynamics, and persist even in other gradientbased dynamics suggested in the literature. In Appendix B, we show that three recent algorithms for finding local Nash equilibria in zerosum continuous games—consensus optimization, symplectic gradient adjustment, and a twotime scale version of simGD—are susceptible to converge to such points and therefore have no guarantees of convergence to local Nash equilibria. We note that such points can be very common since every saddle point of that is not a local Nash equilibrium is a candidate nonNash LASE of the gradient dynamics. Further, local minima or maxima of could also be nonNash LASE of the gradient dynamics.
To understand how nonNash equilibria can be attracting under the flow of , we again analyze the Jacobian of . At such points, the symmetric matrix must have both positive and negative eigenvalues. The sum of with , however, has eigenvalues with strictly positive real part. Thus, the antisymmetric matrix can be seen as stabilizing such points.
Previous gradientbased algorithms for zerosum games have also pinpointed the matrix as the source of problems in zerosum games, however they focus on a different issue. Consensus optimization (Mescheder et al., 2017) and the symplectic gradient adjustment (Balduzzi et al., 2018) both seek to adjust the gradient dynamics to reduce oscillatory behaviors in neighborhoods of stable equilibria. Since the matrix is antisymmetric, it has only imaginary eigenvalues. If it dominates , then the eigenvalues of can have a large imaginary component. This leads to oscillations around equilibria that have been shown empirically to slow down convergence (Mescheder et al., 2017). Both of these adjustments rely on tunable hyperparameters to achieve their goals. Their effectiveness is therefore highly reliant on the choice of parameter. Further, as shown in Appendix B neither of the adjustments are able to rule out convergence to nonNash equilibria.
A second promising line of research into theoretically sound methods of finding the Nash equilibria of zerosum games has approached the issue from the perspective of variational inequalities (Mertikopoulos et al., 2018a; Gidel et al., 2018). In Mertikopoulos et al. (2018a) extragradient methods were used to solve coherent saddle point problems and reduce oscillations when converging to saddle points. In such problems, however, all saddle points of the function are assumed to be local Nash equilibria, and thus the issue of converging to nonNash equilibria is assumed away. Similarly, by assuming that is monotone, as in the theoretical treatment of the averaging scheme proposed in Gidel et al. (2018), the cost function is implicitly assumed to be convexconcave. This in turn implies that the Jacobian satisfies the conditions for a Nash equilibrium everywhere. The behavior of their approaches in more general zerosum games with less structure (like the training of GANs) is therefore not well known. Moreover, since their approach relies on averaging the gradients, they do not fundamentally change the nature of the critical points of simGD.
In the following sections we propose an algorithm for which the only LASE are the differential Nash equilibria of the game. We also show that, regardless of the choice of hyperparameter, the Jacobian of the new dynamics at LASE has real eigenvalues, which means that the dynamics cannot exhibit oscillatory behaviors around differential Nash equilibria.
3 Constructing the limiting differential equation
In this section we define the continuoustime flow that our discretetime algorithm should ideally follow.
Assumption 1 (Lipschitz assumptions on and )
Assume that and and are Lipschitz and Lipschitz respectively. Finally assume that all critical points of are hyperbolic.
We do not require to be invertible everywhere, but only at the critical points of .
Now, consider the continuoustime flow:
(4) 
where is such that for all and and .
The function ensures that, even when is not invertible everywhere, the inverse matrix in (4) exists. The vanishing condition ensures us that the Jacobian of the adjustment term is exactly at differential Nash equilibria.
The dynamics introduced in (4) can be seen as an adjusted version of the gradient dynamics where the adjustment term only allows trajectories to approach critical points of along the players’ axes. If a critical point is not locally optimal for one of the players (i.e., it is a nonNash critical point) then that player can push the dynamics out of a neighborhood of that point. The mechanism is easier to see if we assume is invertible and set . This results in the following dynamics:
(5) 
In this simplified form we can see that the Jacobian of the adjustment is approximately when is small. This approximation is exact at critical points of . Adding this adjustment term to exactly cancels out the rotational part of the vector field contributed by the antisymmetric matrix in a neighborhood of critical points. Since we identified as the source of oscillatory behaviors and nonNash equilibria in Section 2, this adjustment addresses both of these issues. The following theorem establishes this formally.
Under Assumption 1 and if , the continuoustime dynamical system satisfies:

is a LASE of is a differential Nash equilibrium of the game .

If is a critical point of , then the Jacobian of at has real eigenvalues.
We first show that:
Clearly, . To show the converse, we assume that but . This implies that:
Since we assumed that this cannot be true, we must have that .
Having shown that under our assumptions, the critical points of are the same as those of , we now note that the Jacobian of at a critical point must have the form:
By assumption, at critical points, is invertible and . Given that , terms that include disappear, and the adjustment term contributes only a factor of to the Jacobian of at a critical point. This exactly cancels out the antisymmetric part of the Jacobian of . The Jacobian of is therefore symmetric at critical points of and has positive eigenvalues only when and .
Since these are also the conditions for differential Nash equilibria, all differential Nash equilibria of must be LASE of . Further, nonNash LASE of cannot be LASE of , since by definition either or is indefinite at such points. To show the second part of the theorem, we simply note that must be symmetric at all critical points which in turn implies that it has only real eigenvalues.
The continuoustime dynamical system therefore solves both of the problems we highlighted in Section 2, for any choice of the function that satisfies our assumptions. The assumption that
is never an eigenvector of
with an eigenvalue of ensures that the adjustment does not create new critical points. In high dimensions the assumption is mild since the scenario is extremely specific, but it is also possible to show that this assumption can be removed entirely by adding a timevarying term to while still retaining the theoretical guarantees. We show this in Appendix A.Theorem 3 shows that the only attracting hyperbolic equilibria of the limiting ordinary differential equation (ODE) are the differential Nash equilibria of the game. Also, since is symmetric at critical points of , if either or has at least one negative eigenvalue then such a point would be a linearly unstable equilibrium of . Such points are linearly unstable and are therefore almost surely avoided when the algorithm is randomly initialized (Benaïm and Hirsch, 1995; Sastry, 1999).
Theorem 3 also guarantees that the continuoustime dynamics do not oscillate near critical points. Oscillatory behaviors, as outlined in Mescheder et al. (2017), are known to slow down convergence of the discretized version of the process. Reducing oscillations near critical points is the main goal of consensus optimization (Mescheder et al., 2017) and the symplectic gradient adjustment (Balduzzi et al., 2018)
. However, for both algorithms, the extent to which they are able to reduce the oscillations depends on the choice of hyperparameter. The proposed dynamics achieves this for any
that satisfies our assumptions.We close this section by noting that one can premultiply the adjustment term by some function such that while still retaining the theoretical properties described in Theorem 3. Such a function can be used to ensure that the dynamics closely track a trajectory of simGD except in neighborhoods of critical points. For example, if the matrix is illconditioned, such a term could be used to ensure that the adjustment does not dominate the underlying gradient dynamics. In Section 5 we give an example of such a damping function.
4 Twotimescale approximation
Given the limiting ODE, we could perform a straightforward Euler discretization to obtain a discretetime update having the form:
However, due to the matrix inversion, such a discretetime update would be prohibitively expensive to implement in highdimensional parameter spaces like those encountered when training GANs. To solve this problem, we now introduce a twotimescale approximation to the continuoustime dynamics that has the same limiting behavior, but is much faster to compute at each iteration, than the simple discretization. Since this procedure serves to exactly remove the symplectic part, of Jacobian in neighborhoods of hyperbolic critical points, we refer to this twotimescale procedure as local symplectic surgery (LSS). In Appendix A we derive the twotimescale update rule for the timevarying version of the limiting ODE and show that it also has the same properties.
The twotimescale approximation to (4) is given by:
(6) 
where and are defined as:
and the sequences of step sizes , satisfy the following assumptions:
Assumption 2 (Assumptions on the step sizes)
The sequences and satisfy:

, and ;

, and ;

.
We note that is Lipschitz continuous in uniformly in under Assumption 1.
The process performs gradient descent on a regularized version of least squares, where the regularization is governed by . If the process is on a faster time scale, the intuition is that it will first converge to , and then will track the limiting ODE in (4). In the next section we show that this behavior holds even in the presence of noise.
The key benefit to the twotimescale process is that and can be computed efficiently since neither require a matrix inversion. In fact, as we show in Appendix C, the computation can be done with Jacobianvector products with the same order of complexity as that of simGD, consensus optimization, and the symplectic gradient adjustment. This insight gives rise to the procedure outlined in Algorithm 1.
4.1 Longterm behavior of the twotimescale approximation
We now show that LSS asymptotically tracks the limiting ODE even in the presence of noise. This implies that the algorithm has the same limiting behavior as (4
). In particular, our setup allows us to treat the case where one only has access to unbiased estimates of
and at each iteration. This is the setting most likely to be encountered in practice, for example in the case of training GANs in a minibatch setting.We assume that we have access to estimators and such that:
To place this in the form of classical twotimescale stochastic approximation processes, we write each estimator and as the sum of its mean and zeromean noise processes and respectively. This results in the following two timescale process:
(7) 
We assume that the noise processes satisfy the following standard conditions (Benaïm, 1999; Borkar, 2008):
Assumption 3
Assumptions on the noise: Define the filtration :
for . Given , we assume that:

and are conditionally independent given for .

and for .

and almost surely for some positive constants and .
Given our assumptions on the estimator, cost function, and step sizes we now show that (7) asymptotically tracks a trajectory of the continuoustime dynamics almost surely. Since , , and are not uniformly Lipschitz continuous in both and , we cannot directly invoke results from the literature. Instead, we adapt the proof of Theorem 2 in Chapter 6 of Borkar (2008) to show that almost surely. We then invoke Proposition 4.1 from Benaïm (1999) to show that asymptotically tracks . We note that this approach only holds on the event . Thus, if the stochastic approximation process remains bounded, then under our assumptions we are sure to track a trajectory of the limiting ODE.
almost surely.
We first rewrite (7) as:
where . By assumption, . Since is locally Lipschitz continuous, it is bounded on the event . Thus, almost surely.
From Lemma 1 in Chapter 6 of Borkar (2008) , the above processes, on the event , converge almost surely to internally chaintransitive invariant sets of and . Since, for a fixed , is a Lipschitz continuous function of with a globally asymptotically stable equilibrium at , the claim follows.
Having shown that almost surely, we now show that will asymptotically track a trajectory of the limiting ODE. Let us first define for to be the trajectory of starting at at time .
We first rewrite the process as:
We note that, from Lemma 4.1, almost surely. Since , we can write this process as:
where almost surely. Since is continuously differentiable, it is locally Lipschitz, and on the event it is bounded. It thus induces a continuous globally integrable vector field, and therefore satisfies the assumptions for Propositions 4.1 in Benaïm (1999). Further, by assumption the sequence of step sizes and martingale difference sequences satisfy the assumptions of Proposition 4.2 in Benaïm (1999). Invoking Proposition 4.1 and 4.2 in Benaïm (1999) gives us the desired result.
Theorem 4.1 guarantees that LSS asymptotically tracks a trajectory of the limiting ODE. The approximation will therefore avoid nonNash equilibria of the gradient dynamics. Further, the only locally asymptotically stable points for LSS must be the differential Nash equilibria of the game.
5 Numerical Examples
We now present two numerical examples that illustrate the performance of both the limiting ODE and LSS. The first is a zerosum game played over a function in that allows us to observe the behavior of both the limiting ODE around both local Nash and nonNash equilibria. In the second example we use LSS to train a small generative adversarial network (GAN) to learn a mixture of eight Gaussians. Further numerical experiments and comments are provided in Appendix D.
5.1 2D example
For the first example, we consider the game based on the following function in :
This function is a fourthorder polynomial that is scaled by an exponential to ensure that it is bounded. The gradient dynamics associated with function have four LASE. By evaluating the Jacobian of at these points we find that three of the LASE are local Nash equilibria. These are denoted by ‘x’ in Figure 1. The fourth LASE is a nonNash equilibrium which is denoted with a star. In Figure 1, we plot the sample paths of both simGD and our limiting ODE from the same initial positions, shown with red dots. We clearly see that simGD converges to all four LASE, depending on the initialization. Our algorithm, on the other hand, only converges to the local Nash equilibria. When initialized close to the nonNash equilibrium it diverges from the simGD path and ends up converging to a LNE.
This numerical example also allows us to study the behavior of our algorithm around LASE. By focusing on a local Nash equilibrium, as in Figure 1B, we observe that the limiting ODE approaches it directly even when simGD displays oscillatory behaviors. This empirically validates the second part of Theorem 3.
In Figure 2 we empirically validate that LSS asymptotically tracks the limiting ODE. When the fast timescale has not converged, the process tracks the gradient dynamics. Once it has converged however, we see that it closely tracks the limiting ODE which leads it to converge to only the local Nash equilibria. This behavior highlights an issue with the twotimescale approach. Since the nonNash equilibria of the gradient dynamics are saddle points for the new dynamics they can slow down convergence. However, the process will eventually escape such points (Benaïm, 1999).
In our numerical experiments we let . We also make use of a damping function as described in Section 3. The limiting ODE is therefore given by:
where . For the twotimescale process, since there is no noise we use constant step sizes and the following update:
where ,,, and .
5.2 Generative adversarial network
We now train a generative adversarial network with LSS. Both the discriminator and generator are fully connected neural networks with four hidden layers of 16 neurons each. The tanh activation function is used since it satisfies the smoothness assumptions for our functions. For the latent space, we use a 16dimensional Gaussian with mean zero and covariance
. The ground truth distribution is a mixture of eight Gaussians with their modes uniformly spaced around the unit circle and covariance .In Figure 3, we show the progression of the generator at , , , and iterations for a GAN initialized with the same weights and biases and then trained with A. simGD and B. LSS. We can see empirically that, in this example, LSS converges to the true distribution while simGD quickly suffers mode collapse, showing how the adjusted dynamics can lead to convergence to better equilibria. Further numerical experiments are shown in Appendix D.
We caution that convergence rate per se is not necessarily a reasonable metric on which to compare performance in the GAN setting or in other gametheoretic settings. Competing algorithms may converge faster than our method when used to train GANs, but once because the competitors could be converging quickly to a nonNash equilibrium, which is not desirable. Indeed, the optimal solution is known to be a local Nash equilibrium for GANs (Goodfellow et al., 2014; Nagarajan and Kolter, 2017). LSS may initially move towards a nonNash equilibrium, while subsequently escaping the neighborhood of such points before converging. This will lead to a slower convergence rate, but a better quality solution.
6 Discussion
We have introduced local symplectic surgery, a new twotimescale algorithm for finding the local Nash equilibria of twoplayer zerosum continuous games. We have established that this comes with the guarantee that the only hyperbolic critical points to which it can converge are the local Nash equilibria of the underlying game. This significantly improves upon previous methods for finding such points which, as shown in Appendix B, cannot give such guarantees. We have analyzed the asymptotic properties of the proposed algorithm and have shown that the algorithm can be implemented efficiently. Altogether, these results show that the proposed algorithm yields limit points with gametheoretic relevance while ruling out oscillations near those equilibria and having a similar periteration complexity as existing methods which do not come with the same guarantees. Our numerical examples allow us to empirically observe these properties.
It is important to emphasize that our analysis has been limited to neighborhoods of equilibria; the proposed algorithm can converge in principle to limit cycles at other locations of the space. These are hard to rule out completely. Moreover, some of these limit cycles may actually have some gametheoretic relevance (Hommes and Ochea, 2012; Benaim and Hirsch, 1997). Another limitation of our analysis is that we have assumed the existence of local Nash equilibria in games. Showing that they exist and finding them is very hard to do in general. Our algorithm will converge to local Nash equilibria, but may diverge when the game does not admit equilibria or when the algorithm does not come any equilibria its region of attraction. Thus, divergence of our algorithm is not a certificate that no equilibria exist. Such caveats, however, are the same as those for other gradientbased approaches for finding local Nash equilibria.
Another drawback to our approach is the use of secondorder information. Though the twotimescale approximation does not need access to the full Jacobian of the gradient dynamics, the update does involve computing Jacobianvector products. This is similar to other recently proposed approaches but will be inherently slower to compute than pure first or zerothorder methods. Bridging this gap while retaining similar theoretical properties remains an interesting avenue of further research.
In all, we have shown that some of the inherent flaws to gradientbased methods in zerosum games can be overcome by designing our algorithms to take advantage of the gametheoretic setting. Indeed, by using the structure of local Nash equilibria we designed an algorithm that has significantly stronger theoretical support than existing approaches.
References
 Balduzzi et al. (2018) D. Balduzzi, S. Racaniere, J. Martens, J. Foerster, K. Tuyls, and T. Graepel. The mechanics of nplayer differentiable games. In International Conference on Machine Learning, 2018.
 Banerjee and Peng (2003) B. Banerjee and J. Peng. Adaptive policy gradient in multiagent learning. In Proceedings of the Second International Joint Conference on Autonomous Agents and Multiagent Systems, 2003.
 Basar and Olsder (1998) T. Basar and G. Olsder. Dynamic Noncooperative Game Theory. Society for Industrial and Applied Mathematics, 2 edition, 1998.
 Benaïm (1999) M. Benaïm. Dynamics of stochastic approximation algorithms. In Séminaire de Probabilités XXXIII, pages 1–68. Springer Berlin Heidelberg, 1999.
 Benaïm and Hirsch (1995) M. Benaïm and M. Hirsch. Dynamics of MorseSmale urn processes. Ergodic Theory and Dynamical Systems, 15(6), 12 1995.
 Benaim and Hirsch (1997) M. Benaim and M. Hirsch. Learning processes, mixed equilibria and dynamical systems arising from repeated games. Games and Economic Behavior, 1997.
 Benaïm and Hirsch (1999) M. Benaïm and M. Hirsch. Mixed equilibria and dynamical systems arising from fictitious play in perturbed games. Games and Economic Behavior, 29:36–72, 1999.
 Borkar (2008) V. S. Borkar. Stochastic Approximation: A Dynamical Systems Viewpoint. Cambridge University Press, 2008.
 C. Daskalakis (2009) C. Papadimitriou C. Daskalakis, P. Goldberg. The complexity of computing a Nash equilibrium. SIAM Journal on Computing, 39:195–259, 02 2009.
 CesaBianchi and Lugosi (2006) N. CesaBianchi and G. Lugosi. Prediction, Learning, and Games. Cambridge University Press, Cambridge, UK, 2006.
 Daskalakis et al. (2017) C. Daskalakis, A. Ilyas, V. Syrgkanis, and H. Zeng. Traning GANs with Optimism. arxiv:1711.00141, 2017.
 Foerster et al. (2017) J. Foerster, R. Y. Chen, M. AlShedivat, S. Whiteson, P. Abbeel, and I. Mordatch. Learning with opponentlearning awareness. CoRR, abs/1709.04326, 2017.
 Gidel et al. (2018) G. Gidel, H. Berard, P. Vincent, and S. LacosteJulien. A variational inequality perspective on generative adversarial nets. CoRR, 2018. URL http://arxiv.org/abs/1802.10551.
 Giordano et al. (2018) R. Giordano, T. Broderick, and M. I. Jordan. Covariances, robustness, and variational Bayes. Journal of Machine Learning Research, 2018.
 Goodfellow et al. (2014) I. Goodfellow, J. PougetAbadie, M. Mirza, B. Xu, D. WardeFarley, S. Ozair, A. Courville, and Y. Bengio. Generative adversarial networks. arxiv:1406.2661, 2014.
 Heusel et al. (2017) M. Heusel, H. Ramsauer, T. Unterthiner, B. Nessler, and S. Hochreiter. GANs trained by a two timescale update rule converge to a local Nash equilibrium. In Advances in Neural Information Processing Systems 30, 12 2017.

Hommes and Ochea (2012)
C. H. Hommes and M. I. Ochea.
Multiple equilibria and limit cycles in evolutionary games with logit dynamics.
Games and Economic Behavior, 74(1):434 –441, 2012.  Jordan (2018) M. I. Jordan. Artificial intelligence: The revolution hasn’t happened yet. Medium, 2018.
 (19) E. Mazumdar and L. J. Ratliff. On the convergence of gradientbased learning in continuous games. ArXiv eprints.
 Mertikopoulos et al. (2018a) P. Mertikopoulos, H. Zenati, B. Lecouat, C. Foo, V. Chandrasekhar, and G. Piliouras. Mirror descent in saddlepoint problems: Going the extra (gradient) mile. CoRR, abs/1807.02629, 2018a.
 Mertikopoulos et al. (2018b) Panayotis Mertikopoulos, Christos H. Papadimitriou, and Georgios Piliouras. Cycles in adversarial regularized learning. In roceedings of the 29th annual ACMSIAM symposium on discrete algorithms, 2018b.
 Mescheder et al. (2017) L. M. Mescheder, S. Nowozin, and A. Geiger. The numerics of GANs. In Advances in Neural Information Processing Systems 30, 2017.
 Nagarajan and Kolter (2017) V. Nagarajan and Z. Kolter. Gradient descent GAN optimization is locally stable. In Advances in Neural Information Processing Systems 30. 2017.
 Nisan et al. (2007) N. Nisan, T. Roughgarden, E. Tardos, and V. Vazirani. Algorithmic Game Theory. Cambridge University Press, Cambridge, UK, 2007.
 Ratliff et al. (2013) L. J. Ratliff, S. A. Burden, and S. S. Sastry. Characterization and computation of local Nash equilibria in continuous games. In Proceedings of the 51st Annual Allerton Conference on Communication, Control, and Computing, pages 917–924, Oct 2013.
 Sastry (1999) S. S. Sastry. Nonlinear Systems. Springer New York, 1999.

Xu et al. (2009)
H. Xu, C. Caramanis, and S. Mannor.
Robustness and regularization of support vector machines.
Journal of Machine Learning Research, 10:1485–1510, December 2009. ISSN 15324435.  Yang (2011) L. Yang. Active learning with a drifting distribution. In Advances in Neural Information Processing Systems. 2011.
Appendix A Timevarying adjustment
In this section we analyze a slightly different version of (4) that allows us to remove the assumption that is never an eigenvector of with associated eigenvalue . Though this assumption is relatively mild, since intuitively it will be very rare that is exactly the eigenvector of the adjustment matrix, we show that by adding a third term to (4) we can remove it entirely while retaining our theoretical guarantees. The new dynamics are constructed by adding a timevarying term to the dynamics that goes to zero only when is zero. This guarantees that the only critical points of the limiting dynamics are the critical points of . The analysis of these dynamics is slightly more involved and requires generalizations of the definition of a LASE to handle timevarying dynamics. We first define an equilibrium of a potentially timevarying dynamical system as a point such that for all . We can now generalize the definition of a LASE to the timevarying setting.
A strategy is a locally uniformly asymptotically stable equilibrium of the timevarying continuous time dynamics if is an equilibrium of , and and for all .
Locally uniformly asymptotically stable equilibria, under this definition, also have the property that they are locally exponentially attracting under the flow, . Further, since the linearization around a locally uniformly asymptotically stable equilibrium is timeinvariant, we can still invoke converse Lyapunov theorems like those presented in Sastry (1999) when deriving the nonasymptotic bounds.
Having defined equilibria and a generalization of LASE for timevarying systems, we now introduce a timevarying version of the continuoustime ODE presented in Section 3 which allows us to remove the assumption that is never an eigenvector of with associated eigenvalue . The limiting ODE is given by:
(8) 
where is as described in Section 3, can be decomposed as:
where satisfies:

for all .

.

,
and where satisfies:

such that .

.
Thus we require that the timevarying adjustment term must be bounded and is equal to zero only when . Most importantly, we require that for any that is not a critical point of , must be changing in time. An example of a that satisfies these requirements is:
(9) 
for , and .
These conditions, as the next theorem shows, allow us to guarantee that the only locally asymptotically stable equilibria are the differential Nash equilibria of the game.
Under Assumption 1 the continuoustime dynamical system satisfies:

is a locally uniformly asymptotically stable equilibrium of is a DNE of the game .

If is an equilibrium point of , then the Jacobian of at is timeinvariant and has real eigenvalues.
We first show that:
By construction . To show the converse, we assume that there exists a such that but . This implies that:
Since is a constant and , taking the derivative of both sides with respect to gives us the following condition on under our assumption:
By assumption this cannot be true. Thus, we have a contradiction and .
Having shown that the critical points of are the same as that of , we now note that the Jacobian of , at critical points, must be . Under the same development as the proof of Theorem 3 the Jacobian of is given by:
Again, by construction when . The third term therefore disappears and we have that . The proof now follows from that of Theorem 3.
We have shown that adding a timevarying term to the original adjusted dynamics allows us to remove the assumption that the adjustment term is never exactly . As in Section 3 we can now construct a twotimescale process that asymptotically tracks (8). We assume that is a deterministic function of a trajectory of an ODE:
with a fixed initial condition such that . We assume that is Lipschitzcontinuous and is continuous and bounded. Note that under our assumptions, for all .
The form of introduced in (9), can be written as , where satisfies the linear dynamical system:
with .
Given this setup, the continuoustime dynamics can be written as:
(10) 
where:
Having made this further assumption on the timevarying term, we now introduce the twotimescale process that asymptotically tracks (10). This process is given by:
(11) 
where
Proceeding as in Section 3, we write and where and are martingale difference sequences satisfying Assumption 3. We note that the process is deterministic.
This twotimescale process gives rise to the timevarying version of local symplectic surgery (TVLSS) outlined in Algorithm 2.
Comments
There are no comments yet.