In this paper we consider the non-asymptotic local convergence and stability of discrete-time gradient-based optimization algorithms for solving smooth two-player zero-sum games of the form,
The motivation behind our non-asymptotic analysis follows from the observation that Generative Adversarial Networks (GANs) lack principled understanding both in the computational and the algorithmic level111We refer the readers to Huszár (2017)’s concise post on this issue.. GAN optimization is a special case of (1.1
), which has been developed for learning a complex and multi-modal probability distribution based on samples from(over ), through learning a generator function that transforms the input distribution (over ) to match the target . Ignoring the parameter regularization, the value function corresponding to a GAN is of the form,
where parametrizes the generator function and discriminator function , respectively. The original GAN (Goodfellow et al., 2014), for example, corresponds to choosing , where
is the sigmoid function; Wasserstein GAN(Arjovsky et al., 2017) considers ; -GAN (Nowozin et al., 2016) proposes to use , where denotes the Fenchel dual of . Recently, several attempts have been made to understand whether GANs learn the target distribution in the statistical sense (Liu et al., 2017; Arora and Zhang, 2017; Liang, 2017; Arora et al., 2017).
Optimization of GANs (and value functions of the form (1.1) at large) is hard, both in theory and in practice (Singh et al., 2000; Pfau and Vinyals, 2016; Salimans et al., 2016). Global optimization of a general value function with multiple saddle points is impractical and unstable, so we instead resort to the more modest problem of searching for a local saddle point such that no player has the incentive to deviate locally
For smooth value functions, the above conditions are equivalent to the following solution concept:
Definition 1 (Local Nash Equilibrium).
is called a local Nash equilibrium if
Here we use to denote the off-diagonal term , and name it the interaction term throughout the paper. denotes the gradient , and for the Hessian .
In practice, discrete-time dynamical systems are employed to numerically approach the saddle points of , as is the case in GANs (Goodfellow et al., 2014), and in primal-dual methods for non-linear optimization (Singh et al., 2000). The simplest possibility is Simultaneous Gradient Ascent (SGA), which corresponds to the following discrete-time dynamical system,
where is the step size or learning rate. In the limit of vanishing step size, SGA approximates a continuous-time autonomous dynamical system, the asymptotic convergence of which has been established in Singh et al. (2000); Cherukuri et al. (2017); Nagarajan and Kolter (2017)
. In practice, however, it has been widely reported that the discrete-time SGA dynamics for GAN optimization suffers from instabilities due to the possibility of complex eigenvalues in the operator of the dynamical system(Salimans et al., 2016; Metz et al., 2016; Nagarajan and Kolter, 2017; Mescheder et al., 2017; Heusel et al., 2017). We believe room for improvement still exists in the current theory, which we hope will render it to be more informative in practice:
Non-asymptotic convergence speed. In practice, one is concerned with finite step size
which is typically subject to extensive hyperparameter tuning. Detailed characterizations on the convergence speed, and theoretical insights on the choice of learning rate can be helpful.
Unified simple analysis for modified saddle point dynamics. Several attempts to fix GAN optimization have been put forth by independent researchers, which modify the dynamics (Mescheder et al., 2017; Daskalakis et al., 2017; Yadav et al., 2017) using very different insights. A unified analysis that reviews the deeper connections amongst these proposals helps to better understand the saddle point dynamics at large.
In this paper, we address the above points by studying the theory of non-asymptotic convergence of SGA and related discrete-time saddle point dynamics, namely, Optimistic Mirror Descent (OMD), Consensus Optimization (CO), and Predictive Method (PM). More concretely, we provide the following theoretical contributions about the crucial effect of the off-diagonal interaction term in two-player games:
Stable case: curse of the interaction term. Locally, SGA converges exponentially fast to a stable Nash equilibrium with a carefully chosen learning rate. This can be viewed as a generalization (rather than a special case) of the local convergence guarantee for single-player gradient descent for strongly-convex functions. In addition, we quantitatively isolate the slow-down in the convergence rate of two-player SGA compared to single-player gradient descent, due to the presence of the off-diagonal interaction term for the two-player game.
Unstable case: blessing of the interaction term. For unstable Nash equilibria, SGA diverges away for any non-zero learning rate. We discover a unified non-asymptotic analysis that encompasses three proposed modified dynamics — OMD, CO, and PM. The analysis shows that all these algorithms, at a high level, share the same idea of utilizing the curvature introduced by the interaction term . Unlike the slow sub-linear rate of convergence experienced by single-player gradient descent for non-strongly convex functions222In fact, Nesterov (2013) constructed a convex function that is non-strongly convex, such that all first order methods suffer slow sub-linear rate of convergence (in optimization literature, linear rate refers to exponential convergence speed)., the OMD/CO/PM effectively exploit the interaction term to achieve exponential convergence to unstable Nash equilibria. The analysis also provides specific advice on the choice of learning rate for each procedure, albeit restricted to the simple case of bi-linear games.
The organization of the paper is as follows. In Section 2 we consider the (admittedly idealized) situation when locally, the value function satisfies strict convexity/concavity. We show non-asymptotic exponential convergence to Nash equilibria for SGA, and identify an optimized learning rate. To reveal and understand the new features of the modified dynamics, we study in Section 3 the minimal unstable bilinear game, showing that proposed stabilizing techniques all achieve exponential convergence to unstable Nash equilibria. Finally, in Section 4 we take a step closer to the real world by numerically evaluating each of the proposed dynamical systems using value functions of GAN form (1.2
), under objective evaluation metrics. Detailed proofs are deferred to the Section5.
2 Stable Case: Non-asymptotic Local Convergence
In this section we will establish the non-asymptotic convergence of SGA dynamics to saddle points that are stable local Nash equilibria. With a properly chosen learning rate, the local convergence can be intuitively pictured as cycling inwards to these saddle points, where the distance to the saddle point of interest is exponentially contracting. First, let’s introduce the notion of stable equilibrium.
Definition 2 (Stable Local Nash Equilibrium).
is called a stable local Nash equilibrium if
The above notion of stability is stronger than the Definition 1, in the sense that and have smallest eigenvalues bounded away from .
Assumption 1 (Local Strong Convexity-Concavity).
Consider that is smooth and twice differentiable, and let be a stable local Nash equilibrium as in Definition 2. Assume that for some , there exists an open neighborhood near such that for all , the following strong convexity-concavity condition holds,
It will prove convenient to introduce some notation before introducing the main theorem. Let us define the following block-wise abbreviation for the matrix of second derivatives,
and define as
where denote the largest and smallest eigenvalue of matrix .
Theorem 2.1 (Exponential Convergence: SGA).
It is interesting to compare the convergence speed of the saddle point dynamics to conventional gradient descent in one variable, for a strongly-convex function. We remind the reader that to obtain an -minimizer for a strongly-convex function, one needs the following number of iterations of gradient descent,
depending on whether we are optimizing with respect to or , respectively. It is now evident that due to the presence of , the convergence of two-player SGA to a saddle-point can be significantly slower than convergence of single-player gradient descent. Note the following inequality,
which follows from the fact that the LHS is lower-bounded by
Therefore the convergence of SGA is slower than that in the conventional GD
We would like to emphasize that for the saddle point convergence, the slow-down effect of the interaction term is explicit in our non-asymptotic analysis.
The intuition that the discrete-time SGA dynamics cycles inward to a stable Nash equilibrium exponentially fast can be seen in the following way. The presence of the off-diagonal anti-symmetric component in Eqn. (2.1) means that the associated linear operator of the discrete-time dynamics has complex eigenvalues, which results in periodic cycling behavior. However, due to the explicit choice of , the distance to stable Nash equilibrium is shrinking exponentially fast. The local exponential stability in the infinitesimal/asymptotic case when has already been studied in a nice paper Nagarajan and Kolter (2017) (Theorem 3.1 therein) by showing the Jacobian matrix of a particular form of GAN objective is Hurwitz (has all strictly negative eigenvalues). There are two distinct differences in our result: (1) we provide non-asymptotic convergence, with specific guidance on the choice of learning rate
; (2) our analysis goes through analyzing the singular values (which is rather different from the modulus of eigenvalue for a general matrix), instead of involving the complex eigenvalues, and this simple technique generalizes to three other modified saddle point dynamics which we discuss in the next section.
3 Unstable Case: Local Bi-Linear Problem
Oscillation and instability for SGA occurs when the problem is non-strongly convex-concave, as in the bi-linear game (or more precisely, at least linear in one player). This observation was first pointed out using a very simple linear game in Salimans et al. (2016). More generally, as a result of Theorem 2.1, this phenomenon occurs when the local Nash equilibrium is non-stable,
Let’s consider an extreme case when . In this case, we will show that: (1) an improved analysis for Optimistic Mirror Descent (OMD) introduced in Daskalakis et al. (2017), (2) a modified version of Predictive Methods (PM) motivated from Yadav et al. (2017) and (3) Consensus Optimization (CO) introduced in Mescheder et al. (2017), can fix the oscillation problem and provide exponential convergence to unstable Nash equilibrium, using a novel unified non-asymptotic analysis. Our analysis shows that these stabilizing techniques, at a high level, all manipulate the dynamics to utilize the curvature generated by the interaction term — which we refer to as the “blessing” of the interaction term, to contrast with the “slow-down effect” of the interaction term in the strongly convex-concave case. Once again as eluded in the introduction, this fast linear rate convergence result (in the non-strongly convex-concave two-player game) should be compared to the significantly slower sub-linear convergence rate for all first-order-methods in convex but non-strongly convex optimization (one player). The latter was proved by a lower bound argument in (Nesterov, 2013, Theorem 2.1.7). We are ready to state the main result proved in this section informally.
Theorem 3.1 (Informal: Unstable Case).
All these three modified dynamics, in the bi-linear game, enjoy the last iterate exponential convergence guarantee.
We will start with the minimal unstable bilinear game, explaining why oscillation can happen when the game is non-strongly convex-concave, and to reveal and understand the new features of the modified dynamics. This bilinear game can be motivated by considering the lowest-order truncation of the Taylor expansion of a general smooth two-player game around a Nash equilibrium (),
assuming that . Now consider the simple bi-linear game . With the SGA dynamics defined in (1), one can easily verify that
Therefore, the continuous limit is cycling around a sphere, while with any practical learning rate , the distance to the Nash equilibrium can be increasing exponentially instead of converging. Per Theorem 2.1 and the discussion above, instability for SGA only occurs when the local game is approximately bi-linear. From now on, we will focus on the simplest unstable form of the game, the bi-linear game, to isolate the main idea behind fixing the unstable problem.
3.1 (Improved) Optimistic Mirror Descent
Daskalakis et al. (2017) employed Optimistic Mirror Descent (OMD) motivated by online learning to solve the instability problem in GANs. Here we provide a stronger result, showing that the last iterate of OMD enjoys exponential convergence for bi-linear games. We note that although the last-iterate convergence of this OMD procedure was already rigorously proved in Daskalakis et al. (2017), the exponential convergence, however, is not known, to the best of our knowledge. Roughly speaking, for initialization with distance , (Daskalakis et al., 2017, Theorem 1) asserts that after iterations, the last iterate of OMD is within for a small enough learning rate (under the some additional assumptions), and the convergence only happens in the limit of vanishing step size . In contrast, we will prove after iterations, with the properly chosen step size , the last iterate of OMD is within a distance of . This improved theory also explains the exponential convergence found in simulations.
Lemma 3.1 (Exponential Convergence: OMD).
Consider a bi-linear game Assume and is full rank. Then the OMD dynamics,
with the learning rate
obtains an -minimizer such that , as long as
under the assumption that .
3.2 (Modified) Predictive Methods
From a very different motivation in ODE, Yadav et al. (2017) proposed Predictive Methods (PM) to fix the instability problem. The intuition is to evaluate the gradient at a predictive future location, then perform the update. In this section, we propose and analyze a modified version of the predictive method (for simultaneous gradient updates), inspired by Yadav et al. (2017).
Consider the following modified PM dynamic
Lemma 3.2 (Exponential Convergence: PM).
Consider a bi-linear game Assume and is full rank. Fix some . Then the PM dynamics in Eqn. (3.2) with learning rate
obtains an -minimizer such that , as long as
under the assumption that .
3.3 Consensus Optimization
Consensus Optimization (CO) is one other elegant attempt to fix the aforementioned problem, proposed in Mescheder et al. (2017)
. The authors’ motivation is to add an extra potential vector field (consensus part) on top of the current curl vector field (for the SGA of the bi-linear problem), in order to attract the dynamics to the critical points.Mescheder et al. (2017); Nagarajan and Kolter (2017) analyzed the infinitesimal flow version of the consensus optimization, and intuitively showed that it pushes the real part of the eigenvalue away from , to ensure asymptotic convergence. In this section, we provide a simple convergence analysis of the discretized dynamics, of the same flavor as the previous section. An upshot of the analysis is that it sheds light on possible choice of learning rate.
Recall that the regularization term defining consensus optimization is given by,
We are ready to state the Lemma. Surprisingly, we find that the consensus optimization coincides with the modified predictive method for the bi-linear game.
Lucic et al. (2017) recently conducted a large-scale study of GANs and found that improvements from algorithmic changes mostly disappeared after taking into account hyper-parameter tuning and randomness of initialization. They conclude that “future GAN research should be based on more systematic and objective evaluation procedures.” Inspired by this conclusion, we conduct a systematic evaluation of the proposed optimization algorithms on two basic density learning problems, and introduce corresponding objective evaluation metrics. The goal of this analysis is not to achieve state-of-art performance, but rather to compare and contrast the existing proposals in a carefully controlled learning environment. We focus on the Wasserstein GAN formulation so that the value function is given by
is a multi-layer neural network withhidden layers and rectifier non-linearities and the input distribution was chosen to be -dimensional standard Gaussian noise. Following Gulrajani et al. (2017), we impose the Lipschitz-1 constraint on the discriminator network using the two-sided gradient penalty term introduced in (Gulrajani et al., 2017, Eqn. (3)). The consensus optimization loss is defined with respect to the value function as in (3.4
) without including gradient penalty. The combined loss function of the discriminator and generator are respectively,
The coefficients of the gradient penalty and consensus optimization terms were determined by a coarse parameter search and then locked to throughout. In order to make close contact with our theoretical formalism, we optimize the above loss functions on an alternating schedule using vanilla gradient descent with fixed learning rate of
. To ensure reproducibility, all algorithms were independently implemented on top of the TFGAN and Keras framework.
4.1 Learning Covariance of Multivariate Gaussian
Consider the problem of learning the covariance matrix of a
-dimensional multivariate Gaussian distributionwith non-degenerate covariance . Note that the learning problem is well-specified if we choose the generator function
to be a simple linear transformation of the-dimensional latent space (
). Although the GAN approach is clearly overkill for this simple density estimation problem, we find this example illuminating because it affords some analytical tractability for the otherwise intractable general GAN value function (1.2). Specifically, if we choose the discriminator function to be a neural network with hidden layer consisting of hidden units with rectifier nonlinearities, and set biases to zero, then the explicit functional forms of discriminator and generator are respectively,
where and are the discriminator and generator parameters, respectively. If, moreover, we express the covariance matrix as , then the value function can be expressed in closed form as,
The above analytical form of the value function sheds some light on the nature of the local Nash equilibrium solution concept. In particular, if one solves for the condition of being a Nash equilibrium, one does not conclude that . The result depends on the rank of the matrix .
The evaluation of different optimization algorithms involved comparing the target density and the analytical generator density after training iterations (Fig. 1). For simplicity, we chose the evaluation metric to be the Frobenius norm of the difference between the covariance matrices . The covariance learning experiments were conducted in the well-specified and over-parametrized regime (, ) using hidden units for the discriminator network. We also performed a head-to-head comparison of simultaneous and alternating update methods on this distribution and found negligible difference (Fig. 2).
4.2 Mixture of Gaussians
In practical applications, GANs are typically trained using the empirical distribution of the samples, where the samples are drawn from an idealized multi-modal probability distribution. To capture the notion of a multi-modal data distribution, we focus on a mixture of 8 Gaussians with means located at the vertices of a regular octagon inscribed in the unit circle, where each component has a fixed diagonal covariance of width . In contrast to previous visual-based evaluations, we estimate the Wasserstein-1 distance between the target density and the distribution
of the random variable
implied by the trained generator network. The estimate is obtained by solving a linear program which computes the exact Wasserstein-1 distance between the sample estimatesand , respectively, and approaches the population version as number of samples .
The experiments with the mixture of Gaussians used 2 dimensional Gaussian as input (). Both the generator and discriminator networks consisted of hidden layers with units per hidden layer. The estimate of the Wasserstein-1 distance was calculated using a sample size of after training for iterations. It is clear from Fig. 3 that the Wasserstein-1 distance correlates closely with the visual fit to the target distribution. The empirical evaluation (Fig. 1) shows that the exponential separation between consensus optimization and competing algorithms disappears on the mixture distribution, suggesting that the qualitative ranking is not robust to the choice of loss landscape. These findings demand deeper understanding of the global structure of the landscape, which is not captured by our local stability analysis.
5 Technical Proofs
Proof of Theorem 2.1.
Define the line interpolation between two points,
Then the SGA dynamics can be written as
Assume that one can prove for some , and , with a proper choice of , the largest singular value is bounded above by 1,
Then due to convexity of the operator norm, the dynamics of SGA is contracting locally because,
Let’s analyze the singular values of
assuming , . Abbreviate
The largest singular value of
is the square root of the largest eigenvalue of the following symmetric matrix
It is clear that when , the largest eigenvalue of the above matrix is . Observe
If we choose to be
then we find
In this case,
Therefore, to obtain an -minimizer one requires a number of steps equal to
Proof of Lemma 3.1.
Recall that the OMD dynamics iteratively updates
Define the following matrices
It is easy to verify that
The commutative property
follows from a singular value decomposition argument: Lettingbe the SVD of ( diagonal) one finds,
Using the above equality, the commutative property follows
Now we have the following relations for OMD,