1 Introduction
Many of the core applications encountered in the field of machine learning are framed as minimizing a differentiable objective. Often, the method of choice to optimize such a function is a gradientdescent (or a related) method which consists in simply stepping in the negative gradient direction. This algorithm has become a defacto due to its simplicity and the existence of convergence guarantees, even for functions that are not necessarily convex.
On the other hand, the rise of generative models, particularly the GAN framework from Goodfellow et al. (2014), has triggered significant interest in the machine learning community for optimizing minimax objectives of the form
(1) 
The latter objective is a considerably more challenging problem to solve in a general setting as it requires optimizing multiple objectives jointly. The typical notion of optimality used for such games is the concept of Nash equilibrium (NE) where no player can improve its objective by unilaterally changing their strategy. As discussed in (Jin et al., 2019), finding a global NE is NPhard in a general setting where the minimax objective is not convexconcave. Instead, one has to settle for a local NE or a different type of local optimality.
In practice, minimax problems are still solved using gradientbased algorithms, especially gradient descentascent (GDA) that simply alternates between a gradient descent step for and a gradient ascent step for . There are (at least) three problems with GDA in GANs and games in general: (i) the potential existence of cycles implies there are no convergence guarantees (Mescheder et al., 2018), (ii) even when gradient descent converges, the rate may be too slow in practice because the recurrent dynamics require extremely small learning rates (Mescheder et al., 2018; Gidel et al., 2018b) and (iii) since there is no single objective, there is no way to measure progress (Balduzzi et al., 2018). Recently, (Grnarova et al., 2019) addressed (iii) by proposing the duality gap (DG) as a metric for evaluating GANs that naturally arises from the gametheoretic aspect. The authors show the metric correlates highly with the performance and quality of GANs. Concretely, the evolution of the DG is shown to track convergence of the algorithm to an optimum. In this work, we go further and propose to use DG as a training objective in order to address (i) and (ii). In fact, we argue that the duality gap is the objective that should be minimized for training a GAN.
Intuitively, there are several reasons for this. First, instead of looking for a Nash equilibrium of the minimax problem, one can interpret the optimization of the DG as a minimization problem whose solution obeys the typical criticality condition used in optimization.
This is illustrated in Fig. 1 that shows the landscape of a minimax objective for a nonconvexnonconcave game and the landscape of the corresponding Duality Gap function. Concretely, a pure NE (if it exists) turns into a global minimum and the new goal becomes to converge to a local or global minimum as in standard optimization, where both players are jointly minimizing the same objective. This has multiple advantages such as: i) one can simply rely on minimization optimization methods for nonconvex functions, for which stronger convergence guarantees exist (Jain and Kar, 2017; Dauphin et al., 2014; Nesterov and Polyak, 2006), and ii) it suppresses the competition aspect between players, which is responsible for cyclic behaviors and the lack of stability near stationary points. We support the latter claim in the experimental section, where we demonstrate that, GDA might converge to points that are undesired solutions, which are turned into maxima or saddle points when optimizing the duality gap, and thus become easy to avoid. Lastly, moving to the standard setting of optimization allows us to have access to train and validation curves which enables practitioners to track progress and avoid overfitting.
In order to verify the validity of our approach, we derive a convergence rate under a similar set of assumptions as the ones typically found in the GAN literature for minmax objectives (Goodfellow et al., 2014; Nowozin et al., 2016) ^{1}^{1}1Note that these prior works assume convexity in the function space. Since one can not typically optimize in the function space, we naturally place our assumptions on the parameter space.
. Concretely, we prove the limit point of our algorithm is guaranteed to converge to a point of zero divergence and we also derive a rate of convergence. We then check the validity of our theoretical results on a wide range of (assumptionfree) practical problems. Finally, we empirically demonstrate that changing the nature of the problem from an adversarial game theoretic to an optimization setting yields a training algorithm that is more robust to the choice of hyperparameters. For instance, it avoids having to choose different learning rates for each player, which is known to be an effective technique for GANs
(Heusel et al., 2017).In summary, we make the following contributions:

[noitemsep,topsep=0pt]

We propose a new objective function for training GANs without relying on a complex minimax structure and we derive a rate of convergence.

We prove that adaptive optimization methods are suitable for training our objective and that they exploit certain properties of the objective that allow for faster rates of convergence.

We propose a simple practical algorithm that turns the game theoretic formulation of GAN (as well as WGAN, BEGAN etc.) into an optimization problem.

We validate our approach empirically on several toy and real datasets and show it exhibits desirable convergence and stability properties, while attaining improved sample fidelity.
2 Duality Gap as a Training Objective
We now introduce some key gametheoretic concepts that will be necessary to present our algorithm.
A zerosum game consists of two players and who choose a decision from their respective decision sets and . A game objective defines the utilities of the players. Concretely, upon choosing a pure strategy the utility of is , while the utility of is . The goal of either / is to maximize their worst case utilities; thus,
(2)  
The above formulation raises the question of whether there exists a solution to which both players may jointly converge. The latter only occurs if there exists such that neither nor may increase their utility by unilateral deviation. Such a solution is a pure equilibrium, and is formally defined as follows,
This notion of equilibrium gives rise to the natural performance measure of a given pure strategy:
Definition 1 (Duality Gap).
For a given pure strategy we define,
(3) 
2.1 Properties of the Duality Gap
In the context of GANs, (Grnarova et al., 2019) showed as long as is not equal to the true distribution then the duality gap is always positive. In particular, the duality gap is at least as large as the JensenShannon divergence between true and fake distributions (which is always nonnegative)^{2}^{2}2Similarly DG is larger than other divergences for other minimax GAN formulations, such as WGAN (Grnarova et al., 2019). Furthermore, if outputs the true distribution, then there exists a discriminator such that the duality gap is zero.
Building on the property of the duality gap as an upper bound on the divergence between the data and the model distributions, it appears intuitive that one could train a generative model by optimizing the duality gap to act as a surrogate objective function. Doing so has the advantage of reducing the training objective to a standard optimization problem, therefore bypassing the game theoretic formulation of the original GAN problem. This however raises several questions about the practicality of such an approach, its rate of convergence and its empirical performance compared to existing approaches. We set out to answer these questions next. Formally, the problem we consider is:
(4) 
There are several types of convergence guarantees that are desirable and commonly found in the GAN literature, including (i) stability around stationary points as in (Mescheder et al., 2018), and (ii) global convergence guarantees as in (Goodfellow et al., 2014; Nowozin et al., 2016). We start by deriving the second types of guarantees in Section 2.2. We then present a practical algorithm whose stability is discussed in Section 2.3.
2.2 Theoretical guarantees
In this section we analyze the performance of firstorder methods for optimizing the DG objective (Eq. 4). We prove that standard adaptive methods (i.e., AdaGrad) yield faster convergence compared to standard primaldual methods such as gradient descentascent and extragradient, which are commonly used for GANs.
We develop our analysis under an assumption known as realizability, which can be enforced during training using techniques introduced in prior work, e.g. (Dumoulin et al., 2016). We prove that, in a stochastic setting, AdaGrad (Duchi et al., 2011) converges to the optimum of Eq. (4) at a rate of , where is the number of gradient updates (proportional to the number of stochastic samples). This directly translates to an approximate pure equilibrium for the original minimax problem, which substantially improves over the rate of for the stochastic minimax setting (Juditsky et al., 2011)
. Even under this realizability assumption, we are not aware of a similar result for minimax primaldual algorithms such as stochastic gradient descentascent and extragradient.
Minimax problems & Realizability.
Formally, we consider stochastic minimax problems:
(5) 
We make the following assumption,
Assumption 1 (Realizability).
There exits a pure equilibrium of such that is also the pure equilibrium of for any , where supp() is the support of .
Let be the duality gap defined as in Eq. (4) for . Then realizability implies that the original minimax problem is equivalent to solving the following stochastic minimization problem,
(6) 
Remark: In Assumption 1 we assume perfect realizability which might be too restrictive. In the appendix we extend our discussion to the case where realizability only holds approximately; in this case we show an approximate equivalence between the formulations of Equations (5) and (6).
Realizability in the context of GANs. Training GANs is equivalent to solving a stochastic minimax problem with the following objective,
(7) 
where and
are the respective weights of the generator and discriminator, and the source of randomization is a vector
where corresponds to random samples associated to the true data, and is the noise that is injected in the generator. Commonly, is distributed uniformly over the true samples, andis a standard random normal vector. The joint distribution
over can be any distribution such that marginal distribution of (respectively ) is uniform (respectively Normal) ^{3}^{3}3 This is since is separated into two additive terms, each depends on either or .. While the most natural choice for is to sample independently, we shall allow them to depend on each other.Thus, in the context of GANs, realizability implies that there exists a pair of generatordiscriminator such that for any sample then is also a pure equilibrium of . A natural case where realizability applies is when the following assumption holds,
Assumption 2 (“Perfect" generator).
There exist such for any .
This means that there exists a “perfect" generator that for any pair of true sample and matching noise injection , then perfectly reproduces from . In this case the optimal discriminator outputs . As seen in Fig. 6, the above assumption holds for well trained GANs in practice and is quite concentrated around for both real and generated samples.
The question of sampling from the joint distribution of matching (true data, noise injection) pairs , was recently addressed in (Dumoulin et al., 2016), where the authors devise a novel GAN architecture that produces these pairs throughout the training. In our experiments we use their architecture to sample matching dependently. Additionally, (Dumoulin et al., 2016) also show that our “perfect generator“ assumption holds approximately well in practice (see figures 24 therein).
Fast convergence for DG optimization under realizability.
We have seen that realizability implies we can turn a stochastic minimax problem as in Eq. (5), into a stochastic minimization problem as in Eq. (6). Notably, this enables the use of standard optimization algorithms to train a generative model. It is well known that under realizability, SGD achieves a convergence rate of for stochastic convex optimization problems (Moulines and Bach, 2011; Needell et al., 2014), which improves over the standard rate. However, in the realizable case, SGD requires prior knowledge about the smoothness of the problem which is usually unknown in advance. Moreover, in practice one often uses adaptive methods such as AdaGrad (Duchi et al., 2011) and Adam (Kingma and Ba, 2015) for training generative models. Next we demonstrate a simple version of Adagrad can in fact achieve the fast rate of , without knowing the smoothness constant of the objective, .
Our goal is to solve stochastic optimization problems as in Eq. (5), and we assume that each is convexconcave. Under realizability, this translates to solving a stochastic convex optimization problem as in Eq. (6). To solve this we consider the following update,
(simplified AdaGrad)  (8) 
where
is an unbiased estimate of
. Note that does not depend on the smoothness constant . Next we show that this AdaGrad variant ensures an rate under realizability. Let be the concatenation of the gradients of w.r.t. and .Lemma 1.
Consider a stochastic optimization problem in the form of Eq. (6). Further assume that is smooth ^{4}^{4}4smoothness means that the gradients of are Lipschitz, i.e. . and convex , and the realizability assumption holds, i.e., there exists such that . Then applying AdaGrad to this objective ensures an convergence rate where is the diameter of .
2.3 Algorithm
In the previous section we established convergence guarantees for DG that are on par with the ones found in the GAN literature, e.g. (Goodfellow et al., 2014; Nowozin et al., 2016). However, computing the DG exactly requires finding the worst case max/min player or discriminator/generator, , and in Eq. 4
. In practice, this is not always feasible or computationally efficient, and we therefore estimate
and using gradientbased optimization with a finite number of steps denoted by (see Alg. 1). Similar procedures are used in the GAN literature, where each player only solves its own loss approximately, potentially using more steps as suggested in (Goodfellow et al., 2014; Arjovsky et al., 2017) or using an unrolling procedure (Metz et al., 2016). To speed up the optimization, we initialize the networks using the parameters of the adversary at the particular step being evaluated. As discussed in (Grnarova et al., 2019) this initialization scheme does not only speed up optimization, but also ensures that the practical DG is nonnegative as well. Next we demonstrate that the approximation of the DG enjoys the desired properties and still leads to learning good generative models.The effect of for approximating the DG has been analyzed in (Grnarova et al., 2019). In summary, even for small values of , the DG accurately reflects the (true) DG and the quality of the generative model. This is further supported by (Schäfer et al., 2019) that shows it takes a few update steps for the discriminator to pick up on the quality of the generator.
Landscape As mentioned, the DG converts the setting from adversarial minimax game with the goal to find a (local) NE to an optimization setting. In Fig. 2 we show the minimax landscape of a game with 3 NE and one bad stationary point (Mazumdar et al., 2019) and its transformation when the objective changes to DG. We show both the true (theoretical) DG, as well as approximations for various values of , and demonstrate they are able to closely approximate the true landscape, especially around the critical points. In particular, the DG value is the lowest for the NE, and is high for the bad stationary point across all approximations.
Convergence In Fig. 3 we empirically demonstrate that using DG as an objective in a convexconcave setting yields convergence to the solution of the game, both for the true and the approximate DG (similar examples for nonconvexconcave setting can be found in the appendix). Since this example is commonly analyzed in the literature, we follow by a stability analysis of the practical algorithm.
2.4 Stability behavior
Finally, we analyze the stability behavior of optimizing the duality gap for the game shown in Fig. 3. We provide a brief summary and give a detailed derivation in Appendix C.
Sgd
The SGD updates are:
The eigenvalues of the system are
and , which leads either to oscillations or divergent behavior depending on the value of c.Duality gap
The duality gap function can be defined as with updates:
The eigenvalues of the above system are
We therefore have stability as long as , which requires the number of updates to be . As can be seen by examining the DG updates with a finite , there is an additional term that appears in the updates that "contracts" the dynamics and leads to convergence, even for .
3 Related work
Stabilizing GAN training has been an active research area: in proposing new objectives (Arjovsky et al., 2017), adding regularizers (Gulrajani et al., 2017; Roth et al., 2017) or designing better architectures (Radford et al., 2015). In terms of optimization, the rotational dynamics of games has been pointed out as a cause of the complexity (Mescheder et al., 2018; Balduzzi et al., 2018), which was recently empirically demonstrated as well (Berard et al., 2019). To overcome oscillations various works have explored iterate or model averaging (Gidel et al., 2018a; Grnarova et al., 2017) or used secondorder information (Balduzzi et al., 2018; Wang et al., 2019). Other alternatives include training using optimism (Daskalakis et al., 2017) which extrapolates the next value of the gradient or Gidel et al. (2018a) proposing a variant of extragradient that anticipates the opponent’s actions, although was recently shown to break in simple settings (Chavdarova et al., 2019). Razaviyayn et al. (2020) give a survey of recent advances in minmax optimization.
The idea of duality for improving GAN training has been explored in simple settings; e.g (Li et al., 2017) show stabilization using the dual of linear discriminators. (Farnia and Tse, 2018) rely on duality to give a different interpretation to GANs with constrained discriminators and (Gemici et al., 2018) use the dual formulation of WGANs to train the decoder. A recent approach (Chen et al., 2018) uses a type of Lagrangian duality in order to derive an objective to train GANs although it is not directly relatable to the original GAN as it is based on an adhoc assumption on finite set. Finally, Grnarova et al. (2019)
proposed to use the duality gap, although not for optimization, but as an evaluation metric to monitor the training progress.
4 Experiments
We turn to an empirical evaluation of our theory for a wide range of problems commonly discussed in the literature (Berard et al., 2019; Metz et al., 2016; Wang et al., 2019).
4.1 Convergence analysis
As previously discussed, gradient descent ascent (GDA) and many related gradientbased algorithms exhibit undesirable stability properties when used for solving games. Most of these instabilities can in fact be observed on simple toy problems where we will start our investigation before moving on to more complex problems. First, we demonstrate two fundamental properties of our algorithm (i) convergence to (local) minima (which correspond to (local) Nash equilibria in the original game objective) while GDA either diverges or goes into limit cycles and (ii) avoiding convergence to bad critical points to which GDA is attracted to.
To that end, we compare DG to a variety of existing optimization algorithms: (i) Gradient Descent Ascent (GDA), (ii) Optimistic Gradient Descent Ascent (OGDA) (Daskalakis et al., 2017), (iii) Extragradient (EG) (Korpelevich, 1976), (iv) Symplectic Gradient Adjustment (SGA) (Balduzzi et al., 2018), (v) Concensus Optimization (CO) (Mescheder et al., 2018), (vi) Unrolled SGDA (Metz et al., 2016) and (vii) FollowtheRidge (FR) (Wang et al., 2019) on three simple lowdimensional problems (Fig. 4). These functions were proposed in (Wang et al., 2019):
The first two functions (see Fig. 4, left and middle panels) are twodimensional quadratic problems while the third function (Fig. 4 right) has a more complicated landscape due to the sixthorder polynomial being scaled by an exponential. The first function has a local (and global) minimax at (0, 0). In Fig. 4 (left) it can be seen that only DG, FR, SGA and CO converge to it, while other methods diverge. For the second function, (0, 0) is not a local minimax (it is a local min for the max player); yet all algorithms except for DG, Unrolled GD and FR converge to this undesired stationary point. Finally, for the polynomial function (right), (0, 0) is again a local minimax, but most methods cycle around the equilibrium. Again, DG and FR are able to avoid the oscillating behaviour and converge to the correct solution.
Overall it can be observed that most existing algorithms fail on even simple toy examples, an observation made in previous works as well (e.g. (Wang et al., 2019; Mescheder et al., 2018; Gidel et al., 2018a)). In contrast, we observe a positive behavior for the DG objective, both around good as well as bad stationary points. The reason for this is the "correction" term that appears due to the updates. For analytical analysis and further intuition, see Appendix D.1.1.
4.2 Generative Adversarial Networks
We now turn to the problem of training GANs using DG as an objective function. As discussed in Sec. 2.2, we sample pairs such that they satisfy the condition , for some . This can be achieved by using an existing GAN architecture with an encoder such as BiGAN (Donahue et al., 2016), ALI (Dumoulin et al., 2016) and BigBiGAN (Donahue and Simonyan, 2019). In a nutshell, the encoder
provides an inverse mapping by projecting data back into the latent space. The discriminator then not only classifies a real/generated datapoint, but instead receives as an input a pair of a datapoint and a corresponding latent vector 
and . In (Donahue and Simonyan, 2019), the authors demonstrate that a certain reconstruction ensures the condition holds with . We exploit such a construction for optimizing the DG objective. Halfway through the training, we start training with pairs and in order to achieve the computation of the DG with pairs that satisfy the aforementioned condition. The number of update steps is chosen from and we optimize using Adam. All training details and additional baselines can be found in Appendix D.4.2.1 Mixture of Gaussians
We first evaluate 5 different algorithms (GDA, ALI, EG, CO and DG) on a Gaussian mixture with the original GAN saturating loss in a setting shown to be difficult for standard GDA (Wang et al., 2019) (Fig. 6). The model does not converge with GDA or EG and suffers from mode collapse. Only DG and CO are able to learn the data distribution, however DG quickly converges to the solution. We also include results for ALI that show its instability on this toy problem. In particular, the stability of all models can be seen by the progression of the DG metric throughout the training (last row in Fig. 6). For DG, we see the gap quickly goes to zero.
GAN based  WGAN based  

Model  IS  Model  IS 
GAN (Adam)  8.58 0.006  WGAN  7.41 0.029 
GAN (ExtraAdam)  8.80 0.021  
GAN (SGD)  8.19 0.017  
GAN (RMSProp) 
8.69 0.013  
BiGAN/ALI  8.65 0.081  WGAN GP  9.28 0.004 
DG  9.265 0.021  DG WGAN  9.59 0.014 
4.2.2 Generating images
We showed that when training by optimizing DG, the algorithm (i) exhibits desirable convergence properties and (ii) is more stable. Next we look at how this leads to improvements in generating real images.
Exploring the optimization landscape. We train a GAN on MNIST with a DCGAN architecture (Radford et al., 2015) and spectral normalization. The adversarial losses we consider are: (i) variants of the GAN objective (optimized with different algorithms including SGD, Adam, RmsProp and ExtraAdam (Gidel et al., 2018a)) and (ii) several variants of WGAN (Arjovsky et al., 2017) (and with gradient penalty WGANGP (Gulrajani et al., 2017)). Note that our optimization setting can be applied to any minimax formulation of GANs by optimizing the DG for the specific game objective respectively (GAN, WGAN etc.). We denote DG applied to the GAN objective as DG GAN and to the WGAN objective as DG WGAN, and report the Inception Score (IS) (Salimans et al., 2016) in Tab. 1 computed using an MNIST classifier. The optimization setup improves the scores for both objectives. In fact, there is a gap in the scores between the standard, adversarial, and our optimization setting throughout the entire training (see Fig. 5).
In addition, we follow the setup from (Berard et al., 2019) and investigate extensively the rotational and convergence behaviours of the models (See Fig. 13). Overall, while GANs exhibit rotations and converge to points that are a saddle, instead of a local for the generator, DG is more stable, converges to local NEs and avoids the recurrent dynamics, which is also reflected by the improved IS scores.
Fréchet Inception Distance (FID).
In Tab.2 we further compare DG to various GAN algorithms on different datasets (ranging from simple to advanced complexity): MM GAN (saturating GAN), NSGAN (nonsaturating GAN), LSGAN (Mao et al., 2017), WGAN, WGANGP, DRAGAN (Kodali et al., 2017), BEGAN (Berthelot et al., 2017), SNGAN (Miyato et al., 2018), ExtraAdam (Gidel et al., 2018b) and DCGAN. FID (Heusel et al., 2017) is computed using 10K samples across 10 different runs using the features from the Inception Net for all datasets except CelebA, for which we use VGG. We again see that minimizing the DG translates to practical improvement through obtaining better (or comparable) results.
Alg/FID  MNIST  FMNIST  CIFAR10  CelebA 
MMGAN  9.8 ± 0.9  29.6 ± 1.6  72.7 ± 3.6  65.6 ± 4.2 
NSGAN  6.8 ± 0.5  26.5 ± 1.6  58.5 ± 1.9  55.0 ± 3.3 
LSGAN  7.8 ± 0.6  30.7 ± 2.2  87.1 ± 47.5  53.9 ± 2.8 
WGAN  6.7 ± 0.4  21.5 ± 1.6  55.2 ± 2.3  41.3 ± 2.0 
WGAN GP  20.3 ± 5.0  24.5 ± 2.1  55.8 ± 0.9  30.0 ± 1.0 
DRAGAN  7.6 ± 0.4  27.7 ± 1.2  69.8 ± 2.0  42.3 ± 3.0 
BEGAN  13.1 ± 1.0  13.1 ± 1.0  71.4 ± 1.6  38.9 ± 0.9 
SNGAN  6.5 ± 0.2  24.3 ± 1.2  19.22 ± 2.4  46.2 ± 3.4 
ExtraAdam  6.6 ± 0.8  24.1 ± 0.8  17.31 ± 1.8  40.1 ± 2.8 
DCGAN  7.8 ± 0.3  29.01 ± 0.4  21.12 ± 2.0  54.6 ± 3.2 
DG (k=10)  6.0 ± 0.6  20.03 ± 1.1  13.66 ± 2.2  30.1 ± 2.0 
.
Training BigBiGANs by optimizing DG. As previously discussed, moving from an adversarial to an optimization setting presents several benefits which we highlight in the following experiment. We train a lowresolution BigBiGAN (Donahue and Simonyan, 2019) on MNIST, FashionMNIST and Cifar10 and report the corresponding FID for conditional and unconditional models trained via DG and the BigBiGAN objective. Tab. 3 shows that the FID scores for DG are substantially improved, consistently across all datasets and settings. In fact, one can demonstrate increased benefits of the optimization setting by further training the final converged models obtained with the adversarial loss. This yields significant improvements, as shown in the last rows of Table 3 (see GAN + DG).
CIFAR U  MNIST U  FMNIST U  

DG  22.14  6.11  20.33 
GAN  24.41  10.56  22.55 
GAN+DG  22.26  6.76  20.41 
CIFAR C  MNIST C  FMNIST C  
DG  21.45  6.05  20.21 
GAN  23.00  22.03  25.86 
GAN+DG  21.42  9.91  21.19 
DG again exhibits more stable behaviour throughout the training, as can be seen in Fig. 7 a by the progression of the DG when trained on MNIST. There is a consistent gap in terms of DG for the models that can be improved by further training the BigBiGAN with the DG objective. We observe similar behaviour across the different datasets (additional plots in App. D). Moreover, we show the progression of the MNIST IS and MNIST FID for the two models, which again point out to DG being more stable (Fig. 7 b and c).
Apart from improved stability and performance, this experiment also demonstrates another property of optimizing without the competitive aspect of the minimax objective. Since the DG becomes lower when either player gets better (irrespective of the other player), there is no competition between the two players as in standard GAN training. Instead, in GANs one player is trying to maximize, while the other is trying to minimize the objective, so in order for one player to improve their utility, they need to do better with respect to their opponent. This ultimately means that one needs to carefully adjust the learning rates of both players; if one player is dominant, the game may become unstable and stop at a suboptimal point (Heusel et al., 2017). In contrast, optimizing the DG (i.e. both players have the same goal), relaxes the need for tuning learning rates (Fig. 7 d).
5 Conclusion
Training GANs is a notoriously hard problem due to the minmax nature of the GAN objective. Practitioners often face a difficult hyperparameter tuning step having to ensure none of the players overpowers the other. In this work, we proposed an alternative objective function based on the theoretically motivated concept of duality. We proved convergence of this algorithm under commonly used assumptions in the literature and we further supported our claim on a wide range of problems. Empirically, we have seen that optimizing the duality gap yields a more stable algorithm. An interesting direction for future work would be to loosen the convexity assumptions on one side of (Nouiehed et al., 2019). Further relaxations have not yet been shown to be possible and are still an open question. Finally, one could explore alternative optimization methods for optimizing the DG such as secondorder methods.
References
 Lastiterate convergence rates for minmax optimization. arXiv preprint arXiv:1906.02027. Cited by: Appendix C.

Investigating under and overfitting in wasserstein generative adversarial networks
. arXiv preprint arXiv:1910.14137. Cited by: §D.4.  Wasserstein gan. arXiv preprint arXiv:1701.07875. Cited by: §2.3, §3, §4.2.2.
 The mechanics of nplayer differentiable games. arXiv preprint arXiv:1802.05642. Cited by: item (iv), §1, §3, §4.1.
 A closer look at the optimization landscapes of generative adversarial networks. arXiv preprint arXiv:1906.04848. Cited by: §D.3, §D.3, §3, §4.2.2, §4.
 Began: boundary equilibrium generative adversarial networks. arXiv preprint arXiv:1703.10717. Cited by: §4.2.2.

Reducing noise in gan training with variance reduced extragradient
. In Advances in Neural Information Processing Systems, pp. 391–401. Cited by: §3.  Training generative adversarial networks via primaldual subgradient methods: a lagrangian perspective on gan. arXiv preprint arXiv:1802.01765. Cited by: §3.
 Training gans with optimism. arXiv preprint arXiv:1711.00141. Cited by: item (ii), §3, §4.1.
 Identifying and attacking the saddle point problem in highdimensional nonconvex optimization. In Advances in neural information processing systems, pp. 2933–2941. Cited by: §1.
 Adversarial feature learning. arXiv preprint arXiv:1605.09782. Cited by: §4.2.
 Large scale adversarial representation learning. In Advances in Neural Information Processing Systems, pp. 10541–10551. Cited by: §D.4, §4.2.2, §4.2.
 Adaptive subgradient methods for online learning and stochastic optimization. Journal of machine learning research 12 (Jul), pp. 2121–2159. Cited by: §2.2, §2.2.
 Adversarially learned inference. arXiv preprint arXiv:1606.00704. Cited by: §2.2, §2.2, §4.2.
 A convex duality framework for gans. In Advances in Neural Information Processing Systems, pp. 5248–5258. Cited by: §3.
 Primaldual wasserstein gan. arXiv preprint arXiv:1805.09575. Cited by: §3.
 A variational inequality perspective on generative adversarial networks. arXiv preprint arXiv:1802.10551. Cited by: §3, §4.1, §4.2.2.
 Negative momentum for improved game dynamics. arXiv preprint arXiv:1807.04740. Cited by: §1, §4.2.2.
 Generative adversarial nets. In Advances in neural information processing systems, pp. 2672–2680. Cited by: §1, §1, §2.1, §2.3.
 An online learning approach to generative adversarial networks. arXiv preprint arXiv:1706.03269. Cited by: §3.
 A domain agnostic measure for monitoring and evaluating gans. In Advances in Neural Information Processing Systems, pp. 12069–12079. Cited by: §D.4, §1, §2.1, §2.2, §2.3, §2.3, §3, footnote 2.
 Improved training of wasserstein gans. In Advances in neural information processing systems, pp. 5767–5777. Cited by: §3, §4.2.2.
 Gans trained by a two timescale update rule converge to a local nash equilibrium. In Advances in neural information processing systems, pp. 6626–6637. Cited by: §D.4, §1, §4.2.2, §4.2.2.
 Nonconvex optimization for machine learning. arXiv preprint arXiv:1712.07897. Cited by: §1.
 What is local optimality in nonconvexnonconcave minimax optimization?. arXiv preprint arXiv:1902.00618. Cited by: §1.
 First order methods for nonsmooth convex largescale optimization, ii: utilizing problems structure. Optimization for Machine Learning 30 (9), pp. 149–183. Cited by: §2.2.
 Adam: a method for stochastic optimization in: proceedings of the 3rd international conference for learning representations (iclr’15). San Diego. Cited by: §2.2.
 On convergence and stability of gans. arXiv preprint arXiv:1705.07215. Cited by: §4.2.2.
 The extragradient method for finding saddle points and other problems. Matecon 12, pp. 747–756. Cited by: item (iii), §4.1.
 Online to offline conversions, universality and adaptive minibatch sizes. In Advances in Neural Information Processing Systems, pp. 1613–1622. Cited by: Appendix B.
 Dualing gans. In Advances in Neural Information Processing Systems, pp. 5606–5616. Cited by: §3.
 Are gans created equal? a largescale study. In Advances in neural information processing systems, pp. 700–709. Cited by: Table 2.

Least squares generative adversarial networks.
In
Proceedings of the IEEE international conference on computer vision
, pp. 2794–2802. Cited by: §4.2.2.  On finding local nash equilibria (and only local nash equilibria) in zerosum games. arXiv preprint arXiv:1901.00838. Cited by: §2.3.
 Which training methods for gans do actually converge?. arXiv preprint arXiv:1801.04406. Cited by: Appendix C, item (v), §1, §2.1, §3, §4.1, §4.1.
 Unrolled generative adversarial networks. arXiv preprint arXiv:1611.02163. Cited by: item (vi), §2.3, §4.1, §4.
 Spectral normalization for generative adversarial networks. arXiv preprint arXiv:1802.05957. Cited by: §D.3, §4.2.2.

Nonasymptotic analysis of stochastic approximation algorithms for machine learning
. In Advances in Neural Information Processing Systems, pp. 451–459. Cited by: §2.2.  Stochastic gradient descent, weighted sampling, and the randomized kaczmarz algorithm. In Advances in neural information processing systems, pp. 1017–1025. Cited by: §2.2.
 Cubic regularization of newton method and its global performance. Mathematical Programming 108 (1), pp. 177–205. Cited by: §1.
 Solving a class of nonconvex minmax games using iterative first order methods. In Advances in Neural Information Processing Systems, pp. 14934–14942. Cited by: §5.
 Fgan: training generative neural samplers using variational divergence minimization. In Advances in neural information processing systems, pp. 271–279. Cited by: §1, §2.1, §2.3.
 Unsupervised representation learning with deep convolutional generative adversarial networks. arXiv preprint arXiv:1511.06434. Cited by: §D.3, §3, §4.2.2.
 Nonconvex minmax optimization: applications, challenges, and recent theoretical advances. IEEE Signal Processing Magazine 37 (5), pp. 55–66. Cited by: §3.
 Stabilizing training of generative adversarial networks through regularization. In Advances in neural information processing systems, pp. 2018–2028. Cited by: §3.
 Improved techniques for training gans. In Advances in neural information processing systems, pp. 2234–2242. Cited by: §4.2.2.
 Implicit competitive regularization in gans. arXiv preprint arXiv:1910.05852. Cited by: §D.4, §2.3.
 On solving minimax optimization locally: a followtheridge approach. arXiv preprint arXiv:1910.07512. Cited by: item (vii), §D.1.1, §D.1.1, §D.1, §D.2, §3, §4.1, §4.1, §4.2.1, §4.
Appendix
Appendix A Approximate Realizability
Recall that in Section 2.2 we show that under the perfect realizability assumption (Assumption 1) then the original stochastic minimmax problem of Eq. (5) is equivalent to solving the stochastic DG problem of Eq. (6). And this facilitates the use of the stochastic DG formulation for solving the original problem.
Here we relax the assumption of perfect realizability and show that under an approximate realizability assumption one can still relate the solution of the stochastic DG problem to the original minimax objective.
First let us define approximate realizability with respect to the stochastic minimax problem (Eq. (5)),
Assumption 3 (Approximate Realizability).
There exists a solution , and , such that is an approximate equilibrium of for any , i.e.,
where supp() is the support of .
Note that taking in the above definition is equivalent to the perfect realizability assumption (Assumption 1).
Clearly, under Assumption 3, the original stochastic minimax problem is no longer equivalent to the stochastic DG minimization problem. Nevertheless, next we show that in this case, solving the stochastic DG minimization problem will yield a solution which is optimal with respect to the original minimax problem.
Lemma 2.
Proof.
Let be the duality gap of the original minimax problem. Our goal is to show that . Indeed,
where we have used the definition of as the minimizer of , as well as the realizability assumption. ∎
Appendix B Proof of Lemma 1
Before we prove the lemma, we denote , and for any we denote,
Next we rewrite the problem formulation and our assumptions using the above notation.
Our objective can now be written as follows,
We assume that each ’s is convex and smooth meaning,
where for any we denote,
We further assume there exists such,
We analyze the following version of AdaGrad,
where is an unbiased gradient estimate of , which uses a sample . We also assume that are samples i.i.d. from . Note that is the orthogonal projection onto , defined as . Finally, denotes the diameter of , i.e., .
Next we restate the lemma using these notations and provide the proof.
Lemma 3 (Lemma 1, Restated).
Consider a stochastic optimization problem of the form . Further assume that is smooth and convex for any in the support of , and that realizabilty assumption holds, i.e., there exists such that for any in the support of . Then applying AdaGrad to this objective ensures an convergence rate where is the diameter of .
Proof of Lemma 3.
Step 1.
Our first step is to bound the second moment of the gradient estimates. To do so we shall require the following lemma regarding smooth functions (see proof in Appendix
B.1),Lemma 4.
Let be smooth function and let be its global minima, i.e. .Then the following holds,
Applying the above lemma, and using realizability immediately implies,
And therefore,
(9) 
Step 2. Standard analysis of AdaGrad gives the following bound (see e.g.; [Levy, 2017]),
Using the above together with Jensen’s inequality with respect to the concave function , and together with Eq. (9) gives,
Rearranging the above immediately gives,
Taking and using the above together with Jensen’s inequality implies,
which establishes an rate for this case.
∎
b.1 Proof of Lemma 4
Proof.
The smoothness of means the following to hold ,
Taking we get,
Thus:
where in the last inequality we used which holds since is the global minimum. ∎
Appendix C Bilinear Games
Throughout the paper we use several bilinear toy games. Concretely, Fig. 1 is created using the function:
for .
The functions used for Fig. 3 and Fig. 9 are:

Convexconcave:
(10) 
Nonconvexnonconcave:
(11) (12)
as suggested in [Abernethy et al., 2019].
Fig. 8 also shows the behavior of the algorithm for different values of .
In Fig. 3 we empirically demonstrate that using DG as an objective in a convexconcave setting (Fig. 3 (a) and b)), and a simple nonconvexnonconcave setting (Fig. 9 (c)) yields convergence to the solution of the game.
Analysing the updates.
The SGD updates for the game Equation 10 are:
Concretely,
The eigenvalues of the system are and , which leads either to oscillations or divergent behavior depending on the value of c [Mescheder et al., 2018].
The duality gap function can be defined as and the corresponding updates are:
Finally, the two combined lead to updates of the form:
The algorithm converges for all . As can be seen, when optimizing the practical DG there is an additional term that appears in the updates that "contracts" everything and leads to convergence, even for .
To study stability, we need to find the eigenvalues of the above matrix which we denote by . We set
(13) 
The eigenvalues are
(14) 
In order to get stability, we need the modulus of , i.e.
(15) 
which holds for .
Appendix D Experiments
In the following we give experimental details such as hyperparameters, architectures and additional baselines.
d.1 Toy problems
The three low dimensional problems we consider are:
as suggested in [Wang et al., 2019]. The algorithms we compare with are:

Gradient Descent Ascent (GDA)

Optimistic Gradient Descent Ascent (OGDA) [Daskalakis et al., 2017]

Extragradient (EG) [Korpelevich, 1976]

Symplectic Gradient Adjustment (SGA) [Balduzzi et al., 2018]

Concensus Optimization (CO) [Mescheder et al., 2018]

Unrolled SGDA [Metz et al., 2016]

FollowtheRidge (FR) [Wang et al., 2019]
The learning rate for all algorithms is set to of . In addition,