1 Introduction
In this paper we consider the nonasymptotic local convergence and stability of discretetime gradientbased optimization algorithms for solving smooth twoplayer zerosum games of the form,
(1.1) 
The motivation behind our nonasymptotic analysis follows from the observation that Generative Adversarial Networks (GANs) lack principled understanding both in the computational and the algorithmic level
^{1}^{1}1We refer the readers to Huszár (2017)’s concise post on this issue.. GAN optimization is a special case of (1.1), which has been developed for learning a complex and multimodal probability distribution based on samples from
(over ), through learning a generator function that transforms the input distribution (over ) to match the target . Ignoring the parameter regularization, the value function corresponding to a GAN is of the form,(1.2) 
where parametrizes the generator function and discriminator function , respectively. The original GAN (Goodfellow et al., 2014), for example, corresponds to choosing , where
is the sigmoid function; Wasserstein GAN
(Arjovsky et al., 2017) considers ; GAN (Nowozin et al., 2016) proposes to use , where denotes the Fenchel dual of . Recently, several attempts have been made to understand whether GANs learn the target distribution in the statistical sense (Liu et al., 2017; Arora and Zhang, 2017; Liang, 2017; Arora et al., 2017).Optimization of GANs (and value functions of the form (1.1) at large) is hard, both in theory and in practice (Singh et al., 2000; Pfau and Vinyals, 2016; Salimans et al., 2016). Global optimization of a general value function with multiple saddle points is impractical and unstable, so we instead resort to the more modest problem of searching for a local saddle point such that no player has the incentive to deviate locally
For smooth value functions, the above conditions are equivalent to the following solution concept:
Definition 1 (Local Nash Equilibrium).
is called a local Nash equilibrium if

, ;

, .
Here we use to denote the offdiagonal term , and name it the interaction term throughout the paper. denotes the gradient , and for the Hessian .
In practice, discretetime dynamical systems are employed to numerically approach the saddle points of , as is the case in GANs (Goodfellow et al., 2014), and in primaldual methods for nonlinear optimization (Singh et al., 2000). The simplest possibility is Simultaneous Gradient Ascent (SGA), which corresponds to the following discretetime dynamical system,
(1.3) 
where is the step size or learning rate. In the limit of vanishing step size, SGA approximates a continuoustime autonomous dynamical system, the asymptotic convergence of which has been established in Singh et al. (2000); Cherukuri et al. (2017); Nagarajan and Kolter (2017)
. In practice, however, it has been widely reported that the discretetime SGA dynamics for GAN optimization suffers from instabilities due to the possibility of complex eigenvalues in the operator of the dynamical system
(Salimans et al., 2016; Metz et al., 2016; Nagarajan and Kolter, 2017; Mescheder et al., 2017; Heusel et al., 2017). We believe room for improvement still exists in the current theory, which we hope will render it to be more informative in practice:
Nonasymptotic convergence speed. In practice, one is concerned with finite step size
which is typically subject to extensive hyperparameter tuning. Detailed characterizations on the convergence speed, and theoretical insights on the choice of learning rate can be helpful.

Unified simple analysis for modified saddle point dynamics. Several attempts to fix GAN optimization have been put forth by independent researchers, which modify the dynamics (Mescheder et al., 2017; Daskalakis et al., 2017; Yadav et al., 2017) using very different insights. A unified analysis that reviews the deeper connections amongst these proposals helps to better understand the saddle point dynamics at large.
In this paper, we address the above points by studying the theory of nonasymptotic convergence of SGA and related discretetime saddle point dynamics, namely, Optimistic Mirror Descent (OMD), Consensus Optimization (CO), and Predictive Method (PM). More concretely, we provide the following theoretical contributions about the crucial effect of the offdiagonal interaction term in twoplayer games:

Stable case: curse of the interaction term. Locally, SGA converges exponentially fast to a stable Nash equilibrium with a carefully chosen learning rate. This can be viewed as a generalization (rather than a special case) of the local convergence guarantee for singleplayer gradient descent for stronglyconvex functions. In addition, we quantitatively isolate the slowdown in the convergence rate of twoplayer SGA compared to singleplayer gradient descent, due to the presence of the offdiagonal interaction term for the twoplayer game.

Unstable case: blessing of the interaction term. For unstable Nash equilibria, SGA diverges away for any nonzero learning rate. We discover a unified nonasymptotic analysis that encompasses three proposed modified dynamics — OMD, CO, and PM. The analysis shows that all these algorithms, at a high level, share the same idea of utilizing the curvature introduced by the interaction term . Unlike the slow sublinear rate of convergence experienced by singleplayer gradient descent for nonstrongly convex functions^{2}^{2}2In fact, Nesterov (2013) constructed a convex function that is nonstrongly convex, such that all first order methods suffer slow sublinear rate of convergence (in optimization literature, linear rate refers to exponential convergence speed)., the OMD/CO/PM effectively exploit the interaction term to achieve exponential convergence to unstable Nash equilibria. The analysis also provides specific advice on the choice of learning rate for each procedure, albeit restricted to the simple case of bilinear games.
The organization of the paper is as follows. In Section 2 we consider the (admittedly idealized) situation when locally, the value function satisfies strict convexity/concavity. We show nonasymptotic exponential convergence to Nash equilibria for SGA, and identify an optimized learning rate. To reveal and understand the new features of the modified dynamics, we study in Section 3 the minimal unstable bilinear game, showing that proposed stabilizing techniques all achieve exponential convergence to unstable Nash equilibria. Finally, in Section 4 we take a step closer to the real world by numerically evaluating each of the proposed dynamical systems using value functions of GAN form (1.2
), under objective evaluation metrics. Detailed proofs are deferred to the Section
5.2 Stable Case: Nonasymptotic Local Convergence
In this section we will establish the nonasymptotic convergence of SGA dynamics to saddle points that are stable local Nash equilibria. With a properly chosen learning rate, the local convergence can be intuitively pictured as cycling inwards to these saddle points, where the distance to the saddle point of interest is exponentially contracting. First, let’s introduce the notion of stable equilibrium.
Definition 2 (Stable Local Nash Equilibrium).
is called a stable local Nash equilibrium if

, ;

, .
The above notion of stability is stronger than the Definition 1, in the sense that and have smallest eigenvalues bounded away from .
Assumption 1 (Local Strong ConvexityConcavity).
Consider that is smooth and twice differentiable, and let be a stable local Nash equilibrium as in Definition 2. Assume that for some , there exists an open neighborhood near such that for all , the following strong convexityconcavity condition holds,
It will prove convenient to introduce some notation before introducing the main theorem. Let us define the following blockwise abbreviation for the matrix of second derivatives,
(2.1) 
and define as
(2.2) 
where denote the largest and smallest eigenvalue of matrix .
Theorem 2.1 (Exponential Convergence: SGA).
Remark 2.1.
It is interesting to compare the convergence speed of the saddle point dynamics to conventional gradient descent in one variable, for a stronglyconvex function. We remind the reader that to obtain an minimizer for a stronglyconvex function, one needs the following number of iterations of gradient descent,
depending on whether we are optimizing with respect to or , respectively. It is now evident that due to the presence of , the convergence of twoplayer SGA to a saddlepoint can be significantly slower than convergence of singleplayer gradient descent. Note the following inequality,
which follows from the fact that the LHS is lowerbounded by
Therefore the convergence of SGA is slower than that in the conventional GD
We would like to emphasize that for the saddle point convergence, the slowdown effect of the interaction term is explicit in our nonasymptotic analysis.
The intuition that the discretetime SGA dynamics cycles inward to a stable Nash equilibrium exponentially fast can be seen in the following way. The presence of the offdiagonal antisymmetric component in Eqn. (2.1) means that the associated linear operator of the discretetime dynamics has complex eigenvalues, which results in periodic cycling behavior. However, due to the explicit choice of , the distance to stable Nash equilibrium is shrinking exponentially fast. The local exponential stability in the infinitesimal/asymptotic case when has already been studied in a nice paper Nagarajan and Kolter (2017) (Theorem 3.1 therein) by showing the Jacobian matrix of a particular form of GAN objective is Hurwitz (has all strictly negative eigenvalues). There are two distinct differences in our result: (1) we provide nonasymptotic convergence, with specific guidance on the choice of learning rate
; (2) our analysis goes through analyzing the singular values (which is rather different from the modulus of eigenvalue for a general matrix), instead of involving the complex eigenvalues, and this simple technique generalizes to three other modified saddle point dynamics which we discuss in the next section.
3 Unstable Case: Local BiLinear Problem
Oscillation and instability for SGA occurs when the problem is nonstrongly convexconcave, as in the bilinear game (or more precisely, at least linear in one player). This observation was first pointed out using a very simple linear game in Salimans et al. (2016). More generally, as a result of Theorem 2.1, this phenomenon occurs when the local Nash equilibrium is nonstable,
Let’s consider an extreme case when . In this case, we will show that: (1) an improved analysis for Optimistic Mirror Descent (OMD) introduced in Daskalakis et al. (2017), (2) a modified version of Predictive Methods (PM) motivated from Yadav et al. (2017) and (3) Consensus Optimization (CO) introduced in Mescheder et al. (2017), can fix the oscillation problem and provide exponential convergence to unstable Nash equilibrium, using a novel unified nonasymptotic analysis. Our analysis shows that these stabilizing techniques, at a high level, all manipulate the dynamics to utilize the curvature generated by the interaction term — which we refer to as the “blessing” of the interaction term, to contrast with the “slowdown effect” of the interaction term in the strongly convexconcave case. Once again as eluded in the introduction, this fast linear rate convergence result (in the nonstrongly convexconcave twoplayer game) should be compared to the significantly slower sublinear convergence rate for all firstordermethods in convex but nonstrongly convex optimization (one player). The latter was proved by a lower bound argument in (Nesterov, 2013, Theorem 2.1.7). We are ready to state the main result proved in this section informally.
Theorem 3.1 (Informal: Unstable Case).
All these three modified dynamics, in the bilinear game, enjoy the last iterate exponential convergence guarantee.
We will start with the minimal unstable bilinear game, explaining why oscillation can happen when the game is nonstrongly convexconcave, and to reveal and understand the new features of the modified dynamics. This bilinear game can be motivated by considering the lowestorder truncation of the Taylor expansion of a general smooth twoplayer game around a Nash equilibrium (),
(3.1) 
assuming that . Now consider the simple bilinear game . With the SGA dynamics defined in (1), one can easily verify that
Therefore, the continuous limit is cycling around a sphere, while with any practical learning rate , the distance to the Nash equilibrium can be increasing exponentially instead of converging. Per Theorem 2.1 and the discussion above, instability for SGA only occurs when the local game is approximately bilinear. From now on, we will focus on the simplest unstable form of the game, the bilinear game, to isolate the main idea behind fixing the unstable problem.
3.1 (Improved) Optimistic Mirror Descent
Daskalakis et al. (2017) employed Optimistic Mirror Descent (OMD) motivated by online learning to solve the instability problem in GANs. Here we provide a stronger result, showing that the last iterate of OMD enjoys exponential convergence for bilinear games. We note that although the lastiterate convergence of this OMD procedure was already rigorously proved in Daskalakis et al. (2017), the exponential convergence, however, is not known, to the best of our knowledge. Roughly speaking, for initialization with distance , (Daskalakis et al., 2017, Theorem 1) asserts that after iterations, the last iterate of OMD is within for a small enough learning rate (under the some additional assumptions), and the convergence only happens in the limit of vanishing step size . In contrast, we will prove after iterations, with the properly chosen step size , the last iterate of OMD is within a distance of . This improved theory also explains the exponential convergence found in simulations.
Lemma 3.1 (Exponential Convergence: OMD).
Consider a bilinear game Assume and is full rank. Then the OMD dynamics,
(3.2) 
with the learning rate
obtains an minimizer such that , as long as
under the assumption that .
3.2 (Modified) Predictive Methods
From a very different motivation in ODE, Yadav et al. (2017) proposed Predictive Methods (PM) to fix the instability problem. The intuition is to evaluate the gradient at a predictive future location, then perform the update. In this section, we propose and analyze a modified version of the predictive method (for simultaneous gradient updates), inspired by Yadav et al. (2017).
Consider the following modified PM dynamic
(3.3) 
Lemma 3.2 (Exponential Convergence: PM).
Consider a bilinear game Assume and is full rank. Fix some . Then the PM dynamics in Eqn. (3.2) with learning rate
obtains an minimizer such that , as long as
under the assumption that .
3.3 Consensus Optimization
Consensus Optimization (CO) is one other elegant attempt to fix the aforementioned problem, proposed in Mescheder et al. (2017)
. The authors’ motivation is to add an extra potential vector field (consensus part) on top of the current curl vector field (for the SGA of the bilinear problem), in order to attract the dynamics to the critical points.
Mescheder et al. (2017); Nagarajan and Kolter (2017) analyzed the infinitesimal flow version of the consensus optimization, and intuitively showed that it pushes the real part of the eigenvalue away from , to ensure asymptotic convergence. In this section, we provide a simple convergence analysis of the discretized dynamics, of the same flavor as the previous section. An upshot of the analysis is that it sheds light on possible choice of learning rate.Recall that the regularization term defining consensus optimization is given by,
(3.4) 
We are ready to state the Lemma. Surprisingly, we find that the consensus optimization coincides with the modified predictive method for the bilinear game.
4 Experiments
Lucic et al. (2017) recently conducted a largescale study of GANs and found that improvements from algorithmic changes mostly disappeared after taking into account hyperparameter tuning and randomness of initialization. They conclude that “future GAN research should be based on more systematic and objective evaluation procedures.” Inspired by this conclusion, we conduct a systematic evaluation of the proposed optimization algorithms on two basic density learning problems, and introduce corresponding objective evaluation metrics. The goal of this analysis is not to achieve stateofart performance, but rather to compare and contrast the existing proposals in a carefully controlled learning environment. We focus on the Wasserstein GAN formulation so that the value function is given by
(4.1) 
where
is a multilayer neural network with
hidden layers and rectifier nonlinearities and the input distribution was chosen to be dimensional standard Gaussian noise. Following Gulrajani et al. (2017), we impose the Lipschitz1 constraint on the discriminator network using the twosided gradient penalty term introduced in (Gulrajani et al., 2017, Eqn. (3)). The consensus optimization loss is defined with respect to the value function as in (3.4) without including gradient penalty. The combined loss function of the discriminator and generator are respectively,
(4.2)  
(4.3) 
The coefficients of the gradient penalty and consensus optimization terms were determined by a coarse parameter search and then locked to throughout. In order to make close contact with our theoretical formalism, we optimize the above loss functions on an alternating schedule using vanilla gradient descent with fixed learning rate of
. To ensure reproducibility, all algorithms were independently implemented on top of the TFGAN and Keras framework.
4.1 Learning Covariance of Multivariate Gaussian
Consider the problem of learning the covariance matrix of a
dimensional multivariate Gaussian distribution
with nondegenerate covariance . Note that the learning problem is wellspecified if we choose the generator functionto be a simple linear transformation of the
dimensional latent space (). Although the GAN approach is clearly overkill for this simple density estimation problem, we find this example illuminating because it affords some analytical tractability for the otherwise intractable general GAN value function (
1.2). Specifically, if we choose the discriminator function to be a neural network with hidden layer consisting of hidden units with rectifier nonlinearities, and set biases to zero, then the explicit functional forms of discriminator and generator are respectively,(4.4) 
where and are the discriminator and generator parameters, respectively. If, moreover, we express the covariance matrix as , then the value function can be expressed in closed form as,
(4.5) 
The above analytical form of the value function sheds some light on the nature of the local Nash equilibrium solution concept. In particular, if one solves for the condition of being a Nash equilibrium, one does not conclude that . The result depends on the rank of the matrix .
The evaluation of different optimization algorithms involved comparing the target density and the analytical generator density after training iterations (Fig. 1). For simplicity, we chose the evaluation metric to be the Frobenius norm of the difference between the covariance matrices . The covariance learning experiments were conducted in the wellspecified and overparametrized regime (, ) using hidden units for the discriminator network. We also performed a headtohead comparison of simultaneous and alternating update methods on this distribution and found negligible difference (Fig. 2).
4.2 Mixture of Gaussians
In practical applications, GANs are typically trained using the empirical distribution of the samples, where the samples are drawn from an idealized multimodal probability distribution. To capture the notion of a multimodal data distribution, we focus on a mixture of 8 Gaussians with means located at the vertices of a regular octagon inscribed in the unit circle, where each component has a fixed diagonal covariance of width . In contrast to previous visualbased evaluations, we estimate the Wasserstein1 distance between the target density and the distribution
of the random variable
implied by the trained generator network. The estimate is obtained by solving a linear program which computes the exact Wasserstein1 distance between the sample estimates
and , respectively, and approaches the population version as number of samples .The experiments with the mixture of Gaussians used 2 dimensional Gaussian as input (). Both the generator and discriminator networks consisted of hidden layers with units per hidden layer. The estimate of the Wasserstein1 distance was calculated using a sample size of after training for iterations. It is clear from Fig. 3 that the Wasserstein1 distance correlates closely with the visual fit to the target distribution. The empirical evaluation (Fig. 1) shows that the exponential separation between consensus optimization and competing algorithms disappears on the mixture distribution, suggesting that the qualitative ranking is not robust to the choice of loss landscape. These findings demand deeper understanding of the global structure of the landscape, which is not captured by our local stability analysis.
5 Technical Proofs
Proof of Theorem 2.1.
Define the line interpolation between two points,
Then the SGA dynamics can be written as
Assume that one can prove for some , and , with a proper choice of , the largest singular value is bounded above by 1,
Then due to convexity of the operator norm, the dynamics of SGA is contracting locally because,
Let’s analyze the singular values of
assuming , . Abbreviate
The largest singular value of
is the square root of the largest eigenvalue of the following symmetric matrix
It is clear that when , the largest eigenvalue of the above matrix is . Observe
If we choose to be
then we find
In this case,
Therefore, to obtain an minimizer one requires a number of steps equal to
∎
Proof of Lemma 3.1.
Recall that the OMD dynamics iteratively updates
Define the following matrices
(5.1)  
(5.2) 
It is easy to verify that
The commutative property
follows from a singular value decomposition argument: Letting
be the SVD of ( diagonal) one finds,Using the above equality, the commutative property follows
Now we have the following relations for OMD,