1 Introduction
^{†}^{†}footnotetext: Equal contribution.Recent advances in machine learning are largely driven by the success of gradientbased optimization methods for the training process. A common learning paradigm is empirical risk minimization, where a (potentially nonconvex) objective, that depends on the data, is minimized. However, some recently introduced approaches require the joint minimization of several objectives. For example, actorcritic methods can be written as a bilevel optimization problem (Pfau and Vinyals, 2016) and generative adversarial networks (GANs) (Goodfellow et al., 2014) use a twoplayer game formulation.
Games generalize the standard optimization framework by introducing different objective functions for different optimizing agents, known as players. We are commonly interested in finding a local Nash equilibrium: a set of parameters from which no player can (locally and unilaterally) improve its objective function. Games with differentiable objectives often proceed by simultaneous or alternating gradient steps on the players’ objectives. Even though the dynamics of gradient based methods is well understood for minimization problems, new issues appear in multiplayer games. For instance, some stable stationary points of the dynamics may not be (local) Nash equilibria (Adolphs et al., 2018).
Motivated by a decreasing trend of momentum values in GAN literature (see Fig. 1), we study the effect of two particular algorithmic choices: (i) the choice between simultaneous and alternating updates, and (ii) the choice of stepsize and momentum value. The idea behind our approach is that a momentum term combined with the alternating gradient method can be used to manipulate the natural oscillatory behavior of adversarial games. We summarize our main contributions as follows:

We show in §4
that, for general dynamics, when the eigenvalues of the Jacobian have a large imaginary part, negative momentum can improve the local convergence properties of the gradient method.

We confirm the benefits of negative momentum for training GANs with the notoriously illbehaved saturating loss on both toy settings, and real datasets.
Outline.
§2 describes the fundamentals of the analytic setup that we use. §3 provides a formulation for the optimal stepsize, and discusses the constraints and intuition behind it. §4 presents our theoretical results and guarantees on negative momentum. §5 studies the properties of alternating and simultaneous methods with negative momentum on a bilinear smooth game. §6 contains experimental results on toy and real datasets. Finally, in §7, we review some of the existing work on smooth game optimization as well as GAN stability and convergence.
2 Background
Notation
In this paper, scalars are lowercase letters (e.g.,
), vectors are lowercase bold letters (e.g.,
), matrices are uppercase bold letters (e.g., ) and operators are uppercase letters (e.g., ). The spectrum of a squared matrix is denoted by , and its spectral radius is defined as . We respectively note andthe smallest and the largest positive singular values of
. The identity matrix of
is written . We use and to respectively denote the real and imaginary part of a complex number. and stand for the standard asymptotic notations. Finally, all the omitted proofs can be found in §D.Game theory formulation of GANs
Generative adversarial networks consist of a discriminator and a generator . In this game, the discriminator’s objective is to tell real from generated examples. The generator’s goal is to produce examples that are sufficiently close to real examples to confuse the discriminator.
From a game theory point of view, GAN training is a differentiable twoplayer game: the discriminator
aims at minimizing its cost function and the generator aims at minimizing its own cost function . Using the same formulation as the one in Mescheder et al. (2017) and Gidel et al. (2018), the GAN objective has the following form,(1) 
Given such a game setup, GAN training consists of finding a local Nash Equilibrium, which is a state in which neither the discriminator nor the generator can improve their respective cost by a small change in their parameters. In order to analyze the dynamics of gradientbased methods near a Nash Equilibrium, we look at the gradient vector field,
(2) 
and its associated Jacobian ,
(3) 
Games in which are called zerosum games and (1) can be reformulated as a minmax problem. This is the case for the original minmax GAN formulation, but not the case for the nonsaturating loss (Goodfellow et al., 2014) which is commonly used in practice.
For a zerosum game, we note . When the matrices and are zero, the Jacobian is antisymmetric and has pure imaginary eigenvalues. We call games with pure imaginary eigenvalues purely adversarial games. This is the case in a simple bilinear game . This game can be formulated as a GAN where the true distribution is a Dirac on 0, the generator is a Dirac on and the discriminator is linear. This setup was extensively studied in 2D by Gidel et al. (2018).
Conversely, when is zero and the matrices and are symmetric and definite positive, the Jacobian is symmetric and has real positive eigenvalues. We call games with real positive eigenvalues purely cooperative games. This is the case, for example, when the objective function is separable such as where and are two convex functions. Thus, the optimization can be reformulated as two separated minimization of and with respect to their respective parameters.
These notions of adversarial and cooperative games can be related to the notions of potential games (Monderer and Shapley, 1996) and Hamiltonian games recently introduced by Balduzzi et al. (2018): a game is a potential game (resp. Hamiltonian game) if its Jacobian is symmetric (resp. asymmetric). Our definition of cooperative game is a bit more general than the definition of potential game since some nonsymmetric matrices may have positive eigenvalues. Similarly, the notion of adversarial game generalizes the Hamiltonian games since some nonantisymmetric matrices may have pure imaginary eigenvalues, for instance,
In this work, we are interested in games in between purely adversarial games and purely cooperative ones, i.e., games which have eigenvalues with nonnegative real part (cooperative component) and nonzero imaginary part (adversarial component). For , a simple class of such games is parametrized by ,
(4) 
Simultaneous Gradient Method.
Let us consider the dynamics of the simultaneous gradient method. It is defined as the repeated application of the operator,
(5) 
where is the learning rate. Now, for brevity we write the joint parameters . For , let be the point of the sequence computed by the gradient method,
(6) 
Then, if the gradient method converges, and its limit point is a fixed point of such that is positivedefinite, then is a local Nash equilibrium. Interestingly, some of the stable stationary points of gradient dynamics may not be Nash equilibrium (Adolphs et al., 2018). In this work, we focus on the local convergence properties near the stationary points of gradient . To the best of our knowledge, there is no first order method alleviating this issue. In the following, is a stationary point of the gradient dynamics (i.e. a point such that ).
3 Tuning the StepSize
Under certain conditions on a fixed point operator, linear convergence is guaranteed in a neighborhood around a fixed point.
Theorem 1 (Prop. 4.4.1 Bertsekas (1999)).
If the spectral radius , then, for in a neighborhood of , the distance of to the stationary point converges at a linear rate of .
From the definition in (5), we have:
(7)  
If the eigenvalues of all have a positive realpart, then for small enough , the eigenvalues of are inside a convergence circle of radius , as illustrated in Fig. 3. Thm. 1 guarantees the existence of an optimal stepsize which yields a nontrivial convergence rate . Thm. 2 gives analytic bounds on the optimal stepsize , and lowerbounds the best convergence rate we can expect.
Theorem 2.
If the eigenvalues of all have a positive realpart, then, the best stepsize , which minimizes the spectral radius of , is the solution of a (convex) quadratic by parts problem, and satisfies,
(8)  
(9)  
(10) 
where are sorted such that . Particularly, when we are in the case of the top plot of Fig.3 and
When is positivedefinite, the best is attained either because of one or several limiting eigenvalues. We illustrate and interpret these two cases in Fig. 3. In multivariate convex optimization, the optimal stepsize depends on the extreme eigenvalues and their ratio, the condition number. Unfortunately, the notion of the condition number does not trivially extend to games, but Thm. 2 seems to indicate that the real part of the inverse of the eigenvalues play an important role in the dynamics of smooth games. We think that a notion of condition number might be meaningful for such games and we propose an illustrative example to discuss this point in §B. Note that when the eigenvalues are pure positive real numbers belonging to , (8) provides the standard bound obtained with a stepsize (see §D.2 for details).
Note that, in (9), we have because are sorted such that, . In (8), we can see that if the Jacobian of has an almost purely imaginary eigenvalue then is close to and thus, the convergence rate of the gradient method may be arbitrarily close to 1. Zhang and Mitliagkas (2017) provide an analysis of the momentum method for quadratics, showing that momentum can actually help to better condition the model. One interesting point from their work is that the best conditioning is achieved when the added momentum makes the Jacobian eigenvalues turn from positive reals into complex conjugate pairs. Our goal is to use momentum to wrangle game dynamics into convergence manipulating the eigenvalues of the Jacobian.
4 Negative Momentum
As shown in (8), the presence of eigenvalues with large imaginary parts can restrict us to small stepsizes and lead to slow convergence rates. In order to improve convergence, we add a negative momentum term into the update rule. Informally, one can think of negative momentum as friction that can damp oscillations. The new momentum term leads to a modification of the parameter update operator of (5). We use a similar state augmentation as Zhang and Mitliagkas (2017) to form a compound state . The update rule (5) turns into the following,
(11)  
where  (12) 
in which is the momentum parameter. Therefore, the Jacobian of has the following form,
(13) 
Note that for , we recover the gradient method.
In some situations, if is adjusted properly, negative momentum can improve the convergence rate to a local stationary point by pushing the eigenvalues of its Jacobian towards the origin. In the following theorem, we provide an explicit equation for the eigenvalues of the Jacobian of .
Theorem 3.
The eigenvalues of are
(14) 
where and is the complex square root of with positive real part^{*}^{*}* If is a negative real number we set . Moreover we have the following Taylor approximation,
(15)  
(16) 
When is small enough, is a complex number close to . Consequently, is close to the original eigenvalue for gradient dynamics , and , the eigenvalue introduced by the state augmentation, is close to 0. We formalize this intuition by providing the first order approximation of both eigenvalues.
In Fig. 4, we illustrate the effects of negative momentum on a game described in (4). Negative momentum shifts the original eigenvalues (trajectories in light red) by pushing them to the left towards the origin (trajectories in light blue).
Since our goal is to minimize the largest magnitude of the eigenvalues of which are computed in Thm. 3, we want to understand the effect of on these eigenvalues with potential large magnitude. Let , we define the (squared) magnitude that we want to optimize,
(17) 
We study the local behavior of for small . The following theorem shows that a well suited decreases , which corresponds to faster convergence.
Theorem 4.
For any s.t. ,
Particularly, we have and .
As we have seen previously in Fig. 3 and Thm. 2, there are only few eigenvalues which slow down the convergence. Thm. 4 is a local result showing that a small negative momentum can improve the magnitude of the limiting eigenvalues in the following cases: when there is only one limiting eigenvalue (since in that case the optimal stepsize is ) or when there are several limiting eigenvalues and the intersection is not empty. We point out that we do not provide any guarantees on whether this intersection is empty or not but note that if the absolute value of the argument of is larger than then by (10), our theorem provides that the optimal stepsize belongs to .
Since our result is local, it does not provide any guarantees on large negative values of . Nevertheless, we numerically optimized (17) with respect to and and found that for any nonimaginary fixed eigenvalue , the optimal momentum is negative and the associated optimal stepsize is larger than . Another interesting aspect of negative momentum is that it admits larger stepsizes (see Fig. 4 and 5).
For a game with purely imaginary eigenvalues, when , Thm. 3 shows that . Therefore, at the first order, only has an impact on the imaginary part of . Consequently cannot be pushed into the unit circle, and the convergence guarantees of Thm. 1 do not apply. In other words, the analysis above provides convergence rates for games without any pure imaginary eigenvalues. It excludes the purely adversarial bilinear example ( in Eq. 4) that is discussed in the next section.
5 Bilinear Smooth Games
In this section we analyze the dynamics of a purely adversarial game described by,
(18) 
The first order stationary condition for this game characterizes the solutions as
(19) 
If (resp. ) does not belong to the column space of (resp. ), the game (18) admits no equilibrium. In the following, we assume that an equilibrium does exist for this game. Consequently, there exist and such that and . Using the translations and , we can assume without loss of generality, that , and . We provide upper and lower bounds on the squared distance from the known equilibrium,
(20) 
where is the projection of ( onto the solution space. We show in §C, Lem. 2 that, for our methods of interest, this projection has a simple formulation that only depends on the initialization .
We aim to understand the difference between the dynamics of simultaneous steps and alternating steps. Practitioners have been widely using the latter instead of the former when optimizing GANs despite the rich optimization literature on simultaneous methods.
5.1 Simultaneous gradient descent
We define this class of methods with momentum using the following formulas,
(21)  
In our simple setting, the operator is linear. One way to study the asymptotic properties of the sequence is to compute the eigenvalues of . The following proposition characterizes these eigenvalues.
Proposition 1.
The eigenvalues of are the roots of the 4 order polynomials:
(22) 
Interestingly, these roots only depend on the product meaning that any rescaling does not change the eigenvalues of and consequently the asymptotic dynamics of the iterates . The magnitude of the eigenvalues described in (22), characterizes the asymptotic properties for the iterates of the simultaneous method (21). We report the maximum magnitude of these roots for a given and for a grid of stepsizes and momentum values in Fig 7. We observe that they are always larger than 1, which transcribes a diverging behavior. The following theorem provides an analytical rate of divergence.
Theorem 5.
For any and , the iterates of the simultaneous methods (21) diverge as,
This theorem states that the iterates of the simultaneous method (21) diverge geometrically for . Interestingly, this geometric divergence implies that even a uniform averaging of the iterates (standard in game optimization to ensure convergence (Freund et al., 1999)) cannot alleviate this divergence.
5.2 Alternating gradient descent
Alternating gradient methods take advantage of the fact that the iterates and are computed sequentially, to plug the value of (instead of for simultaneous update rule) into the update of ,
(23)  
This slight change between (21) and (23) significantly shifts the eigenvalues of the Jacobian. We first characterize them with the following proposition.
Proposition 2.
The eigenvalues of are the roots of the 4 order polynomials:
(24) 
The same way as in (22), these roots only depend on the product . The only difference is that the monomial with coefficient is of degree 2 in (22) and of degree 3 in (24). This difference is major since, for well chosen values of negative momentum, the eigenvalues described in Prop. 2 lie in the unit disk (see Fig. 7). As a consequence, the iterates of the alternating method with no momentum are bounded and do converge if we add some well chosen negative momentum:
Theorem 6.
If we set , and then we have
(25) 
If we set and , then there exists such that for any , .
Our results from this section, namely Thm. 5 and Thm. 6, are summarized in Fig. 2, and demonstrate how alternating steps can improve the convergence properties of the gradient method for bilinear smooth games. Moreover, combining them with negative momentum can surprisingly lead to a linearly convergent method. The conjecture provided in Fig. 2 (divergence of the alternating method with positive momentum) is backedup by the results provided in Fig. 5 and §A.1.
6 Experiments and Discussion
MinMax Bilinear Game
Fashion MNIST and CIFAR 10
[Fig. 6] In our third set of experiments, we use negative momentum in a GAN setup on CIFAR10 (Krizhevsky and Hinton, 2009) and FashionMNIST (Xiao et al., 2017) with saturating loss
and alternating steps. We use residual networks for both the generator and the discriminator with no batchnormalization. Following the same architecture as
Gulrajani et al. (2017), each residual block is made of two convolution layers with ReLUactivation function. Upsampling and downsampling layers are respectively used in the generator and discriminator. We experiment with different values of momentum on the discriminator and a constant value of 0.5 for the momentum of the generator. We observe that using a negative value can generally result in samples with higher quality and inception scores. Intuitively, using negative momentum only on the discriminator slows down the learning process of the discriminator and allows for better flow of the gradient to the generator. Note that we provide an additional experiment on mixture of Gaussians in § A.2.7 Related Work
Optimization
From an optimization point of view, a lot of work has been done in the context of understanding momentum and its variants (Polyak, 1964; Qian, 1999; Nesterov, 2013; Sutskever et al., 2013)
. Some recent studies have emphasized the importance of momentum tuning in deep learning such as
Sutskever et al. (2013), Kingma and Ba (2015), and Zhang and Mitliagkas (2017), however, none of them consider using negative momentum. Among recent work, using robust control theory, Lessard et al. (2016) study optimization procedures and cover a variety of algorithms including momentum methods. Their analysis is global and they establish worstcase bounds for smooth and stronglyconvex functions. Mitliagkas et al. (2016) considered negative momentum in the context of asynchronous singleobjective minimization. They show that asynchronousparallel dynamics ‘bleed’ into optimization updates introducing momentumlike behavior into SGD. They argue that algorithmic momentum and asynchronyinduced momentum add up to create an effective ‘total momentum’ value. They conclude that to attain the optimal (positive) effective momentum in an asynchronous system, one would have to reduce algorithmic momentum to small or sometimes negative values. This differs from our work where we show that for games the optimal effective momentum may be negative. Ghadimi et al. (2015) analyze momentum and provide global convergence properties for functions with Lipschitzcontinuous gradients. However, all the results mentioned above are restricted to minimization problems. The purpose of our work is to try to understand how momentum influences game dynamics which is intrinsically different from minimization dynamics.GANs as games
A lot of recent work has attempted to make GAN training easier with new optimization methods. Daskalakis et al. (2018) extrapolate the next value of the gradient using previous history and Gidel et al. (2018) explore averaging and introduce a variant of the extragradient algorithm.
Balduzzi et al. (2018) develop new methods to understand the dynamics of general games: they decompose secondorder dynamics into two components using Helmholtz decomposition and use the fact that the optimization of Hamiltonian games is well understood. It differs from our work since we do not consider any decomposition of the Jacobian but focus on the manipulation of its eigenvalues. Recently, Liang and Stokes (2018) provide a unifying theory for smooth twoplayer games for nonasymptotic local convergence. They also provide theory for choosing the right stepsize required for convergence.
From another perspective, Odena et al. (2018) show that in a GAN setup, the average conditioning of the Jacobian of the generator becomes illconditioned during training. They propose Jacobian clamping to improve the inception score and Frechet Inception Distance. Mescheder et al. (2017) provide discussion on how the eigenvalues of the Jacobian govern the local convergence properties of GANs. They argue that the presence of eigenvalues with zero realpart and large imaginarypart results in oscillatory behavior but do not provide results on the optimal stepsize and on the impact of momentum. Nagarajan and Kolter (2017) also analyze the local stability of GANs as an approximated continuous dynamical system. They show that during training of a GAN, the eigenvalues of the Jacobian of the corresponding vector field are pushed away from one along the real axis.
8 Conclusion
In this paper, we show analytically and empirically that alternating updates with negative momentum is the only method within our study parameters (Fig.2) that converges in bilinear smooth games. We study the effects of using negative values of momentum in a GAN setup both theoretically and experimentally. We show that, for a large class of adversarial games, negative momentum may improve the convergence rate of gradientbased methods by shifting the eigenvalues of the Jacobian appropriately into a smaller convergence disk. We found that, in simple yet intuitive examples, using negative momentum makes convergence to the Nash Equilibrium easier. Our experiments support the use of negative momentum for saturating losses on mixtures of Gaussians, as well as on other tasks using CIFAR10 and fashion MNIST. Altogether, fully stabilizing learning in GANs requires a deep understanding of the underlying highly nonlinear dynamics. We believe our work is a step towards a better understanding of these dynamics. We encourage deep learning researchers and practitioners to include negative values of momentum in their hyperparameter search.
We believe that our results explain a decreasing trend in momentum values used for training GANs in the past few years reported in Fig. 4. Some of the most successful papers use zero momentum (Arjovsky et al., 2017; Gulrajani et al., 2017) for architectures that would otherwise call for high momentum values in a nonadversarial setting.
Acknowledgments
This research was partially supported by the Canada CIFAR AI Chair Program, the FRQNT nouveaux chercheurs program, 2019NC257943, the Canada Excellence Research Chair in “Data Science for Realtime Decisionmaking”, by the NSERC Discovery Grant RGPIN201706936, a Google Focused Research Award and an IVADO grant. Authors would like to thank NVIDIA corporation for providing the NVIDIA DGX1 used for this research. Authors are also grateful to Frédéric Bastien, Florian Bordes, Adam Beberg, Cam Moore and Nithya Natesan for their support.
Bibliography
 Adolphs et al. (2018) L. Adolphs, H. Daneshmand, A. Lucchi, and T. Hofmann. Local saddle point optimization: A curvature exploitation approach. arXiv preprint arXiv:1805.05751, 2018.
 Arjovsky et al. (2017) M. Arjovsky, S. Chintala, and L. Bottou. Wasserstein generative adversarial networks. In ICML, 2017.
 Balduzzi et al. (2018) D. Balduzzi, S. Racaniere, J. Martens, J. Foerster, K. Tuyls, and T. Graepel. The mechanics of nplayer differentiable games. In ICML, 2018.
 Bertsekas (1999) D. P. Bertsekas. Nonlinear programming. Athena scientific Belmont, 1999.
 Daskalakis et al. (2018) C. Daskalakis, A. Ilyas, V. Syrgkanis, and H. Zeng. Training GANs with optimism. In ICLR, 2018.
 Denton et al. (2015) E. L. Denton, S. Chintala, R. Fergus, et al. Deep generative image models using a laplacian pyramid of adversarial networks. In Advances in neural information processing systems, pages 1486–1494, 2015.
 Freund et al. (1999) Y. Freund, R. E. Schapire, et al. Adaptive game playing using multiplicative weights. Games and Economic Behavior, 1999.
 Ghadimi et al. (2015) E. Ghadimi, H. R. Feyzmahdavian, and M. Johansson. Global convergence of the heavyball method for convex optimization. In ECC, 2015.
 Gidel et al. (2018) G. Gidel, H. Berard, P. Vincent, and S. LacosteJulien. A variational inequality perspective on generative adversarial nets. arXiv preprint arXiv:1802.10551, 2018.
 Goodfellow et al. (2014) I. Goodfellow, J. PougetAbadie, M. Mirza, B. Xu, D. WardeFarley, S. Ozair, A. Courville, and Y. Bengio. Generative adversarial nets. In NIPS, 2014.
 Gulrajani et al. (2017) I. Gulrajani, F. Ahmed, M. Arjovsky, V. Dumoulin, and A. C. Courville. Improved training of wasserstein GANs. In NIPS, 2017.
 Kingma and Ba (2015) D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. In ICLR, 2015.
 Krizhevsky and Hinton (2009) A. Krizhevsky and G. Hinton. Learning multiple layers of features from tiny images. Technical report, Citeseer, 2009.
 Lessard et al. (2016) L. Lessard, B. Recht, and A. Packard. Analysis and design of optimization algorithms via integral quadratic constraints. SIAM Journal on Optimization, 2016.
 Liang and Stokes (2018) T. Liang and J. Stokes. Interaction matters: A note on nonasymptotic local convergence of generative adversarial networks. arXiv preprint arXiv:1802.06132, 2018.
 Mescheder et al. (2017) L. Mescheder, S. Nowozin, and A. Geiger. The numerics of GANs. In NIPS, 2017.
 Mirza and Osindero (2014) M. Mirza and S. Osindero. Conditional generative adversarial nets. arXiv preprint arXiv:1411.1784, 2014.
 Mitliagkas et al. (2016) I. Mitliagkas, C. Zhang, S. Hadjis, and C. Ré. Asynchrony begets momentum, with an application to deep learning. In 2016 54th Annual Allerton Conference on Communication, Control, and Computing (Allerton), 2016.
 Miyato et al. (2018) T. Miyato, T. Kataoka, M. Koyama, and Y. Yoshida. Spectral normalization for generative adversarial networks. In ICLR, 2018.
 Monderer and Shapley (1996) D. Monderer and L. S. Shapley. Potential games. Games and economic behavior, 1996.
 Nagarajan and Kolter (2017) V. Nagarajan and J. Z. Kolter. Gradient descent GAN optimization is locally stable. In NIPS, 2017.
 Nesterov (2013) Y. Nesterov. Introductory lectures on convex optimization: A basic course, volume 87. Springer Science & Business Media, 2013.
 Odena et al. (2018) A. Odena, J. Buckman, C. Olsson, T. B. Brown, C. Olah, C. Raffel, and I. Goodfellow. Is generator conditioning causally related to gan performance? In ICML, 2018.
 Pfau and Vinyals (2016) D. Pfau and O. Vinyals. Connecting generative adversarial networks and actorcritic methods. arXiv preprint arXiv:1610.01945, 2016.
 Polyak (1964) B. T. Polyak. Some methods of speeding up the convergence of iteration methods. USSR Computational Mathematics and Mathematical Physics, 1964.
 Qian (1999) N. Qian. On the momentum term in gradient descent learning algorithms. Neural networks, 1999.
 Radford et al. (2015) A. Radford, L. Metz, and S. Chintala. Unsupervised representation learning with deep convolutional generative adversarial networks. arXiv preprint arXiv:1511.06434, 2015.
 Sutskever et al. (2013) I. Sutskever, J. Martens, G. Dahl, and G. Hinton. On the importance of initialization and momentum in deep learning. In ICML, 2013.
 Xiao et al. (2017) H. Xiao, K. Rasul, and R. Vollgraf. FashionMNIST: a novel image dataset for benchmarking machine learning algorithms. arXiv preprint arXiv:1708.07747, 2017.
 Zhang (2006) F. Zhang. The Schur complement and its applications. Springer Science & Business Media, 2006.
 Zhang and Mitliagkas (2017) J. Zhang and I. Mitliagkas. Yellowfin and the art of momentum tuning. arXiv preprint arXiv:1706.03471, 2017.

Zhu et al. (2017)
J.Y. Zhu, T. Park, P. Isola, and A. A. Efros.
Unpaired imagetoimage translation using cycleconsistent adversarial networks.
InProceedings of the IEEE International Conference on Computer Vision
, pages 2223–2232, 2017.
Bibliography
 Adolphs et al. (2018) L. Adolphs, H. Daneshmand, A. Lucchi, and T. Hofmann. Local saddle point optimization: A curvature exploitation approach. arXiv preprint arXiv:1805.05751, 2018.
 Arjovsky et al. (2017) M. Arjovsky, S. Chintala, and L. Bottou. Wasserstein generative adversarial networks. In ICML, 2017.
 Balduzzi et al. (2018) D. Balduzzi, S. Racaniere, J. Martens, J. Foerster, K. Tuyls, and T. Graepel. The mechanics of nplayer differentiable games. In ICML, 2018.
 Bertsekas (1999) D. P. Bertsekas. Nonlinear programming. Athena scientific Belmont, 1999.
 Daskalakis et al. (2018) C. Daskalakis, A. Ilyas, V. Syrgkanis, and H. Zeng. Training GANs with optimism. In ICLR, 2018.
 Denton et al. (2015) E. L. Denton, S. Chintala, R. Fergus, et al. Deep generative image models using a laplacian pyramid of adversarial networks. In Advances in neural information processing systems, pages 1486–1494, 2015.
 Freund et al. (1999) Y. Freund, R. E. Schapire, et al. Adaptive game playing using multiplicative weights. Games and Economic Behavior, 1999.
 Ghadimi et al. (2015) E. Ghadimi, H. R. Feyzmahdavian, and M. Johansson. Global convergence of the heavyball method for convex optimization. In ECC, 2015.
 Gidel et al. (2018) G. Gidel, H. Berard, P. Vincent, and S. LacosteJulien. A variational inequality perspective on generative adversarial nets. arXiv preprint arXiv:1802.10551, 2018.
 Goodfellow et al. (2014) I. Goodfellow, J. PougetAbadie, M. Mirza, B. Xu, D. WardeFarley, S. Ozair, A. Courville, and Y. Bengio. Generative adversarial nets. In NIPS, 2014.
 Gulrajani et al. (2017) I. Gulrajani, F. Ahmed, M. Arjovsky, V. Dumoulin, and A. C. Courville. Improved training of wasserstein GANs. In NIPS, 2017.
 Kingma and Ba (2015) D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. In ICLR, 2015.
 Krizhevsky and Hinton (2009) A. Krizhevsky and G. Hinton. Learning multiple layers of features from tiny images. Technical report, Citeseer, 2009.
 Lessard et al. (2016) L. Lessard, B. Recht, and A. Packard. Analysis and design of optimization algorithms via integral quadratic constraints. SIAM Journal on Optimization, 2016.
 Liang and Stokes (2018) T. Liang and J. Stokes. Interaction matters: A note on nonasymptotic local convergence of generative adversarial networks. arXiv preprint arXiv:1802.06132, 2018.
 Mescheder et al. (2017) L. Mescheder, S. Nowozin, and A. Geiger. The numerics of GANs. In NIPS, 2017.
 Mirza and Osindero (2014) M. Mirza and S. Osindero. Conditional generative adversarial nets. arXiv preprint arXiv:1411.1784, 2014.
 Mitliagkas et al. (2016) I. Mitliagkas, C. Zhang, S. Hadjis, and C. Ré. Asynchrony begets momentum, with an application to deep learning. In 2016 54th Annual Allerton Conference on Communication, Control, and Computing (Allerton), 2016.
 Miyato et al. (2018) T. Miyato, T. Kataoka, M. Koyama, and Y. Yoshida. Spectral normalization for generative adversarial networks. In ICLR, 2018.
 Monderer and Shapley (1996) D. Monderer and L. S. Shapley. Potential games. Games and economic behavior, 1996.
 Nagarajan and Kolter (2017) V. Nagarajan and J. Z. Kolter. Gradient descent GAN optimization is locally stable. In NIPS, 2017.
 Nesterov (2013) Y. Nesterov. Introductory lectures on convex optimization: A basic course, volume 87. Springer Science & Business Media, 2013.
 Odena et al. (2018) A. Odena, J. Buckman, C. Olsson, T. B. Brown, C. Olah, C. Raffel, and I. Goodfellow. Is generator conditioning causally related to gan performance? In ICML, 2018.
 Pfau and Vinyals (2016) D. Pfau and O. Vinyals. Connecting generative adversarial networks and actorcritic methods. arXiv preprint arXiv:1610.01945, 2016.
 Polyak (1964) B. T. Polyak. Some methods of speeding up the convergence of iteration methods. USSR Computational Mathematics and Mathematical Physics, 1964.
 Qian (1999) N. Qian. On the momentum term in gradient descent learning algorithms. Neural networks, 1999.
 Radford et al. (2015) A. Radford, L. Metz, and S. Chintala. Unsupervised representation learning with deep convolutional generative adversarial networks. arXiv preprint arXiv:1511.06434, 2015.
 Sutskever et al. (2013) I. Sutskever, J. Martens, G. Dahl, and G. Hinton. On the importance of initialization and momentum in deep learning. In ICML, 2013.
 Xiao et al. (2017) H. Xiao, K. Rasul, and R. Vollgraf. FashionMNIST: a novel image dataset for benchmarking machine learning algorithms. arXiv preprint arXiv:1708.07747, 2017.
 Zhang (2006) F. Zhang. The Schur complement and its applications. Springer Science & Business Media, 2006.
 Zhang and Mitliagkas (2017) J. Zhang and I. Mitliagkas. Yellowfin and the art of momentum tuning. arXiv preprint arXiv:1706.03471, 2017.

Zhu et al. (2017)
J.Y. Zhu, T. Park, P. Isola, and A. A. Efros.
Unpaired imagetoimage translation using cycleconsistent adversarial networks.
InProceedings of the IEEE International Conference on Computer Vision
, pages 2223–2232, 2017.
Appendix A Additionnal Figures
a.1 Maximum magnitude of the eigenvalues gradient descent with negative momentum on a bilinear objective
In Figure 7 we numerically (using the formula provided in Proposition 1 and 2) computed the maximum magnitude of the eigenvalues gradient descent with negative momentum on a bilinear objective as a function of the step size and the momentum . We can notice that on one hand, for simultaneous gradient method, no value of and provide a maximum magnitude smaller than 1, causing a divergence of the algorithm. On the other hand, for alternating gradient method there exists a sweet spot where the maximum magnitude of the eigenvalues of the operator is smaller than 1 insuring that this method does converge linearly (since the Jacobian of a bilinear minmax proble is constant).
a.2 Mixture of Gaussian
[Fig. 8] In this set of experiments we evaluate the effect of using negative momentum for a GAN with saturating loss
and alternating steps. The data in this experiment comes from eight Gaussian distributions which are distributed uniformly around the unit circle. The goal is to force the generator to generate 2D samples that are coming from
all of the 8 distributions. Although this looks like a simple task, many GANs fail to generate diverse samples in this setup. This experiment shows whether the algorithm prevents mode collapse or not.We use a fully connected network with 4 hidden ReLU layers where each layer has 256 hidden units. The latent code of the generator is an 8dimensional multivariate Gaussian. The model is trained for 100,000 iterations with a learning rate of
for stochastic gradient descent along with values of zero,
and momentum. We observe that negative momentum considerably improves the results compared to positive or zero momentum.Appendix B Discussion on Momentum and Conditioning
In this section, we analyze the effect of the conditioning of the problem on the optimal value of momentum. Consider the following formulation as an extension of the bilinear minmax game discussed in §5, Eq. 4 (),
(26) 
where is a square diagonal positivedefinite matrix,
(27) 
and its condition number is . Thus, we can rewrite the vector field and the Jacobian as a function of and ,
(28) 
The corresponding eigenvalues of the Jacobian are,
(29) 
For simplicity, in the following we will note for .
Using Thm. (3), the eigenvalues of are,
(30) 
where and is the complex square root of with positive real part.
Hence the spectral radius of can be explicitly formulated as a function of and ,
(31) 
In Figure 9, we numerically computed the optimal that minimizes as a function of the stepsize , for , and . To balance the game between the adversarial part and the cooperative part, we normalize the matrix such that the sum of its diagonal elements is . It can be seen that there is a competition between the type of the game (adversarial and cooperative) versus the conditioning of the matrix . In a more cooperative regime, increasing results in more positive values of momentum which is consistent with the intuition that cooperative games are almost minimization problems where the optimum value for the momentum is known (Polyak, 1964) to be . Interestingly, even if the condition number of is large, when the game is adversarial enough, the optimum value for the momentum is negative. This experimental setting seems to suggest the existence of a multidimensional condition number taking into account the difficulties introduced by the ill conditioning of as well as the adversarial component of the game.
Appendix C Lemmas and Definitions
Recall that the spectral radius of a matrix is the maximum magnitude of its eigenvalues.
(32) 
For a symmetric matrix, this is equal to the spectral norm, which is the operator norm induced by the vector 2norm. However, we are dealing with general matrices, so these two values may be different. The spectral radius is always smaller than the spectral norm, but it’s not a norm itself, as illustrated by the example below:
where we used the fact that the spectral norm is also the square root of the largest singular value.
In this section we will introduce three lemmas that we will use in the proofs of §D.
The first lemma is about the determinant of a block matrix.
Lemma 1.
Let four matrices such that and commute. Then
(33) 
where is the determinant of .
Proof.
See (Zhang, 2006, Section 0.3). ∎
The second lemma is about the iterates of the simultaneous and the alternating methods introduced in §5 for the bilinear game. It shows that we can pick a subspace where the iterates will remain.
Lemma 2.
Proof of Lemma 2.
Let us start with the simultaneous updates (21).
Let the SVD of where and are orthogonal matrices and
(35) 
where is the rank of and are the (positive) singular values of . The update rules (21) implies that,
(36) 
Consequently, for any and we have that,
(37) 
Since the solutions of (18) verify the following first order conditions:
(38) 
One can set as in (37) to be a couple of solution of (18) such that and . By an immediate recurrence, using (36) we have that for any initialization there exists a couple such that that for any ,
(39) 
Consequently,
(40) 
The proof for the alternated updates (23) are the same since we only use the fact that the iterates stay on the span of interest. ∎
Lemma 3.
Let and a sequence such that, , then we have three cases of interest for the spectral radius :

If , and is diagonalizable, then .

If , then there exist such that .

If , and is diagonalizable then .
Proof.
For that section we note the norm of :

If :
We have for and any ,
Comments
There are no comments yet.