1 Introduction
Since their introduction in [16], generative adversarial networks (GANs) have gained great success in many tasks of learning the distribution of observed samples. Unlike the traditional approaches to distribution learning, GANs view the learning problem as a zerosum game between the following two players: 1) generator aiming to generate reallike samples from a random noise input, 2) discriminator trying to distinguish ’s generated samples from real training data. This game is commonly formulated through a minimax optimization problem as follows:
(1.1) 
Here, and are respectively the generator and discriminator function sets, commonly chosen as two deep neural nets, and denotes the minimax objective for generator and discriminator capturing how dissimilar the generated samples and training data are.
GAN optimization problems are commonly solved by alternating gradient methods, which under proper regularization have resulted in stateoftheart generative models for various benchmark datasets. However, GAN minimax optimization has led to several theoretical and empirical challenges in the machine learning literature. Training GANs is widely known as a challenging optimization task requiring an exhaustive hyperparameter and architecture search and demonstrating an unstable behavior. While a few regularization schemes have achieved empirical success in training GANs
[41, 2, 18, 33], still little is known about the conditions under which GAN minimax optimization can be successfully solved by firstorder optimization methods.To understand the minimax optimization in GANs, one needs to first answer the following question: What is the proper notion of equilibrium in the GAN zerosum game? In other words, what are the optimality criteria in the GAN’s minimax optimization problem? A classical notion of equilibrium in the game theory literature is the
Nash equilibrium, a state in which no player can raise its individual gain by choosing a different strategy. According to this definition, a Nash equilibrium for the GAN minimax problem (1.1) must satisfy the following for every and :(1.2) 
As a wellknown result, for a generator expressive enough to reproduce the distribution of observed samples, Nash equilibrium exists for the generator producing the data distribution [15]. However, such a Nash equilibrium would be of little interest from a learning perspective, since the trained generator merely overfits the empirical distribution of training samples [3]. More importantly, stateoftheart GAN architectures [18, 33, 49, 7] commonly restrict the generator function through various means of regularization such as batch or spectral normalization. Such regularization mechanisms do not allow the generator to produce the empirical distribution of observed datapoints. Since the realizability assumption does not apply to such regularized GANs, the existence of Nash equilibria will not be guaranteed in their minimax problems.
The above discussion motivates studying the equilibrium of GAN zerosum games in the nonrealizable settings where the generator cannot express the empirical distribution of training data. Here, a natural question is whether a Nash equilibrium still exists for the GAN minimax problem. In this work, we focus on this question and demonstrate through several theoretical and numerical results that:

[leftmargin=*]

Nash equilibrium may not exist in GAN zerosum games.
We provide theoretical examples of wellknown GAN formulations including the vanilla GAN [16], Wasserstein GAN (WGAN) [2], GAN [38], and the secondorder Wasserstein GAN (W2GAN) [13] where no local Nash equilibria exist in their minimax optimization problems. We further perform numerical experiments on widelyused GAN architectures which suggest that an empirically successful GAN training may converge to nonNash equilibrium solutions.
Next, we focus on characterizing a new notion of equilibrium for GAN problems. To achieve this goal, we consider the Nash equilibrium of a new zerosum game where the objective function is given by the following proximal operator applied to the minimax objective with respect to a norm on discriminator functions:
(1.3) 
We refer to the Nash equilibrium of the new zerosum game as the proximal equilibrium. Given the inherent sequential nature of GAN problems where the generator moves first followed by the discriminator, we consider a Stackelberg game for its representation and focus on the subgame perfect equilibrium (SPE) of the game as the right notion of equilibrium for such problems [22]. We prove that the proximal equilibrium of Wasserstein GANs provides an SPE for the GAN problem. This result applies to both the firstorder and secondorder Wasserstein GANs. In these cases, we show a proximal equilibrium exists for the optimal generator minimizing the distance to the data distribution.
Inspired by these theoretical results, we propose a proximal approach for training GANs, which we call proximal training, by changing the original minimax objective to the proximal objective in (1.3). In addition to preserving the optimal solution to the GAN minimax problem, proximal training can further enjoy the existence of Nash equilibrium solutions in the new minimax objective. We discuss numerical results supporting the proximal training approach and the role of proximal equilibrium solutions in various GAN problems.
2 Related Work
Understanding the minimax optimization in modern machine learning applications including GANs has been a subject of great interest in the machine learning literature. A large body of recent works [8, 37, 34, 45, 50, 30, 14, 47, 27] have analyzed the convergence properties of firstorder optimization methods in solving different classes of minimax games.
In a related work, [22] proposes a new notion of local optimality, called local minimax, designed for general sequential machine learning games. Compared to the notion of local minimax, the proximal equilibrium proposed in our work gives a notion of global optimality, which as we show directly applies to Wasserstein GANs. [22] also provides examples of minimax problems where Nash equilibria do not exist; however, the examples do not represent GAN minimax problems. Some recent works [27, 26, 48] have analyzed the convergence of different optimization methods to local minimax solutions.
In another related work, [9] analyzes the stable points of the gradient descent ascent (GDA) and optimistic GDA [8] algorithms, proving that they will give strict supersets of the local saddle points. Regarding the stability of GAN algorithms, [36] proves that the GDA algorithm will be locally stable for the vanilla and regularized Wasserstein GAN problems. [13] shows the GDA algorithm is globally stable for W2GANs with linear generator and quadratic discriminator functions.
Regarding the equilibrium in GANs, [3] studies the Nash equilibrium of GAN minimax games in realizable settings. Also, [3, 21] develop methods for finding mixed strategy Nash equilibria. On the other hand, our results focus on the pure strategies in nonrealizable settings. [12] empirically studies the equilibrium of GAN problems regularized via the gradient penalty, reporting positive results on the stability of regularized GANs. However, our focus is on the existence of pure Nash equilibrium solutions. [35]
suggests a moment matching GAN formulation using the Sobolev norm. As a different direction, we use the Sobolev norm to analyze equilibrium in GANs. Finally, developing GAN architectures with improved equilibrium and stability properties has been studied in several recent works
[41, 4, 19, 32, 40, 24, 31, 51].3 An Initial Experiment on Equilibrium in GANs
To examine whether the Nash equilibrium exists in GAN problems empirically, we performed a simple numerical experiment. In this experiment, we applied three standard GAN implementations including the Wasserstein GAN with weightclipping (WGANWC) [2], the improved Wasserstein GAN with gradient penalty (WGANGP) [18], and the spectrallynormalized vanilla GAN (SNGAN) [33], to the two benchmark MNIST [25] and CelebA [29] databases. We used the convolutional architecture of the DCGAN [39] optimized with the Adam [23]
or RMSprop
[20] (only for WGANWC) optimizers.We performed each of the GAN experiments for 200,000 generator iterations to reach with and denoting the trained generator and discriminator parameters at the end of the 200,000 iterations. Our goal is to examine whether the solution pair represents a Nash equilibrium or not. To do this, we fixed the trained discriminator and kept optimizing the generator, i.e. continuing optimizing the generator without changing the discriminator . Here we solved the following optimization problem initialized at using the default firstorder optimizer for the generator function for 10,000 iterations:
(3.1) 
If the pair was in fact a Nash equilibrium, it would give a local saddle point to the minimax optimization and the above optimization could not make the objective any smaller than its initial value. Also, the image samples generated by the generator should have improved or at least preserved their initial quality during this optimization, since the discriminator would be the optimal discriminator against all generator functions.
Despite the above predictions, we observed that none of the mentioned statements hold in reality for any of the six experiments with the three standard GAN implementations and the two datasets. The optimization objective decreased rapidly from the beginning of the optimization, and the pictures sampled from the generator completely lost their quality over this optimization. Figures 0(a), 0(b) show the objective for the SNGAN experiments over the 10,000 steps of the above optimization. These figures also demonstrate the SNGAN generated samples before and during the optimization, which shows the significant drop in the quality of generated pictures. We defer the results for the WGANWC and WGANGP problems to the Appendix.
The results of the above experiments show that practical GAN experiments may not converge to local Nash equilibrium solutions. After fixing the trained discriminator, the trained generator can be further optimized using a firstorder optimization method to reach smaller values of the generator objective. More importantly, this optimization not only does not improve the quality of the generator’s output samples, but also totally disturbs the trained generator. As demonstrated in these experiments, simultaneous optimization of the two players is in fact necessary for the proper convergence and stability behavior in GAN minimax optimization. The above experiments suggest that practical GAN solutions are not local Nash equilibrium. In the upcoming sections, we review some standard GAN formulations and then show that there are examples of GAN minimax problems for which no Nash equilibrium exists. Those theoretical results will further support our observations in the above experiments.
4 Review of GAN Formulations
4.1 Vanilla GAN & Gan
Consider samples observed independently from distribution . Our goal is to find a generator function where maps a random noise input from a known to an output distributed as
, i.e., we aim to match the probability distributions
and . To find such a generator function, [16] proposes the following minimax problem which is commonly referred to as the vanilla GAN problem:(4.1) 
Here and represent the set of generator and discriminator functions, respectively. In this formulation, the discriminator is optimized to map real samples from to larger values than the values assigned to generated samples from .
As shown in [16], the above minimax problem for an unconstrained containing all realvalued functions reduces to the following divergence minimization problem:
(4.2) 
where denotes the JensenShannon (JS) divergence defined in terms of KLdivergence as
GANs extend the vanilla GAN problem by generalizing the JSdivergence to a general divergence. For a convex function with , the divergence corresponding to is defined as
(4.3) 
Notice that the JSdivergence is a special case of divergence with . [38] shows that generalizing the divergence minimization (4.2) to minimizing a divergence results in the following minimax problem called GAN:
(4.4) 
where denotes the Fenchelconjugate to defined as . The space implied by the fdivergence minimization will be the set of all functions, but a similar interpretation further applies to a constrained [28, 10]. Several examples of GANs have been formulated and discussed in [38].
4.2 Wasserstein GANs
To resolve GAN training issues, [2] proposes to formulate a GAN problem by minimizing the optimal transport costs which unlike divergences change continuously with the input distributions. Given a transportation cost for transporting to , the optimal transport cost is defined as
(4.5) 
where
denotes the set of all joint distributions on
with marginally distributed as , respectively. An important special case is the firstorder Wasserstein distance (distance) corresponding to . In this special case, the KantorovichRubinstein duality shows(4.6) 
Here denotes the expected value with respect to distribution and denotes the Lipschitz constant of function which is defined as the smallest satisfying for every . Formulating a GAN problem minimizing the distance, [2] states the Wasserstein GAN (WGAN) problem as follows:
(4.7) 
The above Wasserstein GAN problem can be generalized to a general optimal transport cost with arbitrary cost function . The generalization is as follows:
(4.8) 
where the ctransform is defined as and a function is called cconcave if it is the ctransform of some valid function. In particular, the optimal transport GAN formulation with the quadratic cost results in the secondorder Wasserstein GAN (W2GAN) problem which has been studied in several recent works [13, 42, 43, 44].
5 Existence of Nash Equilibrium Solutions in GANs
Consider a general GAN minimax problem (1.1) with a minimax objective . As discussed in the previous section, the optimal generator is defined to minimize the GAN’s target divergence to the data distribution. The following proposition is a wellknown result regarding the Nash equilibrium of the GAN game in realizable settings where there exists a generator producing the data distribution.
Proposition 1.
Assume that generator results in the distribution of data, i.e., we have . Then, for each of the GAN problems discussed in Section 4 there exists a constant discriminator function which together with results in a Nash equilibrium to the GAN game, and hence satisfies the following for every and :
Proof.
This proposition is wellknown for the vanilla GAN [17]. In the Appendix, we provide a proof for general GANs and Wasserstein GANs. ∎
The above proposition shows that in a realizable setting with a generator function generating the distribution of observed samples, a Nash equilibrium exists for that optimal generator. However, the realizability assumption in this proposition does not always hold in real GAN experiments. For example, in the GAN experiments discussed in Section 3
, we observed that the divergence estimate never reached the zero value because of regularizing the generator function. Therefore, the Nash equilibrium described in Proposition
1 does not apply to the trained generator and discriminator in such GAN experiments.Here, we address the question of the existence of Nash equilibrium solutions for nonrealizable settings, where no generator can produce the data distribution. Do Nash equilibria always exist in nonrealizable GAN zerosum games? The following theorem shows that the answer is in general no. Note that
in this theorem denotes the maximum singular value, i.e., the spectral norm.
Theorem 1.
Consider a GAN minimax problem for learning a normally distributed
with zero mean and scalar covariance matrix where . In the GAN formulation, we use a linear generator function where the weight matrix is spectrallyregularized to satisfy. Suppose that the Gaussian latent vector is normally distributed as
with zero mean and identity covariance matrix. Then,
[leftmargin=*]

For the GAN problem corresponding to an with nondecreasing over and an unconstrained discriminator where the dimensions of data and latent match, the fGAN minimax problem has no Nash equilibrium solutions.

For the W2GAN problem with discriminator trained over concave functions, where is the quadratic cost, the W2GAN minimax problem has no Nash equilibrium solutions. Also, given a quadratic discriminator parameterized by , the W2GAN problem has no local Nash equilibria.

For the WGAN problem with dimensional and a discriminator trained over 1Lipschitz functions, the WGAN minimax problem has no Nash equilibria.
Proof.
We defer the proof to the Appendix. Note that the condition on the GAN holds for all GAN examples in [38] including the vanilla GAN. ∎
The above theorem shows that under the stated assumptions the GAN zerosum game does not have Nash equilibrium solutions. Consequently, the optimal divergenceminimizing generative model does not result in a Nash equilibrium. In contrast to Theorem 1, the following remark shows that the GAN zerosum game in a nonrealizable case may have Nash equilibrium solutions, of course if Theorem 1’s assumptions do not hold.
Remark 1.
Proof.
We defer the proof to the Appendix. ∎
The above remark explains that the phenomenon shown in Theorem 1 does not always hold in nonrealizable GAN settings. As a result, we need other notions of equilibrium which consistently explain optimality in GAN games.
6 Proximal Equilibrium: A Relaxation of Nash Equilibrium
To define a proper notion of equilibrium for GANs, note that due to the sequential nature of GAN games the equilibrium notion should be flexible to allow to some extent the optimization of the discriminator around the equilibrium solution. This property is in fact consistent with the stability feature observed for the firstorder GAN training methods where the alternating firstorder method stabilizes around a certain solution. To this end, we consider the following objective for a GAN problem with minimax objective :
(6.1) 
The above definition represents the application of a proximal operator to , which further optimizes the original objective in the proximity of discriminator . To keep the function variable close to , we penalize the distance among the two functions in the proximal optimization. Here the distance is measured using a norm on the discriminator function space.
To extend the notion of Nash equilibrium to general minimax problems, we propose considering the Nash equilibria of the defined .
Definition 1.
We call a proximal equilibrium for if it represents a Nash equilibrium for , i.e. for every and
(6.2) 
The next proposition provides necessary and sufficient conditions in terms of the original objective for the proximal equilibrium solutions.
Proposition 2.
is a proximal equilibrium if and only if for every and we have
Therefore, if is a proximal equilibrium it will give a global minimax solution, i.e., minimizes the worstcase objective, , with being its optimal solution.
Proof.
We defer the proof to the Appendix. ∎
The following result shows the proximal equilibria provide a hierarchy of equilibrium solutions for different values.
Proposition 3.
Define to be the set of the proximal equilibria for . Then, if ,
(6.3) 
Proof.
We defer the proof to the Appendix. ∎
Note that as approaches infinity, tends to the original , implying that is the set of ’s Nash equilibria. In contrast, for the proximal objective becomes the worstcase objective . As a result, is the set of global minimax solutions described in Proposition 2.
Concerning the proximal optimization problem in (6.1), the following proposition shows that if the original minimax objective is a smooth function of the discriminator parameters, the proximal optimization can be solved efficiently and therefore one can efficiently compute the gradient of the proximal objective.
Proposition 4.
Consider the maximization problem in the definition of proximal objective (6.1) where generator and discriminator are parameterized by vectors , respectively. Suppose that

[leftmargin=*]

For the considered discriminator norm , is strongly convex in for any function , i.e. for any :

For every , The GAN minimax objective is smooth in , i.e. i.e. for any :
Under the above assumptions, if , the maximization objective in (6.1) is strongly concave. Then, the maximization problem has a unique solution and if is differentiable with respect to we have
(6.4) 
Proof.
We defer the proof to the Appendix. ∎
The above proposition suggests that under the mentioned assumptions, one can efficiently compute the optimal solution to the proximal maximization through a firstorder optimization method. The assumptions require the smoothness of the GAN minimax objective with respect to the discriminator parameters, which can be imposed by applying normbased regularization tools to neural network discriminators.
7 Proximal Equilibrium in Wasserstein GANs
As shown earlier, GAN minimax games may not have any Nash equilibria in nonrealizable settings. As a result, we seek for a different notion of equilibrium which remains applicable to GAN problems. Here, we show the proposed proximal equilibrium provides such an equilibrium notion for Wasserstein GAN problems.
To define a proper proximal operator for defining proximal equilibria in Wasserstein GAN problems, we use the secondorder Sobolev seminorm averaged over the underlying distribution of data. Given the underlying distribution , we define the Sobolev seminorm as
(7.1) 
The above seminorm is induced by the following semiinner product and therefore leads to a semiHilbert space of functions:
(7.2) 
Throughout our discussion, we consider a parameterized set of generators . For a GAN minimax objective , we define to be the optimal discriminator function for the parameterized generator :
(7.3) 
The following theorem shows that the Wasserstein distanceminimizing generator function in the secondorder Wasserstein GAN problem satisfies the conditions of a proximal equilibrium based on the Sobolev seminorm defined in (7.1).
Theorem 2.
Proof.
We defer the proof to the Appendix. ∎
The above theorem shows that while, as demonstrated in Theorem 1, the W2GAN problem may have no local Nash equilibrium solutions, the proximal equilibrium exists for the W2GAN problem and holds at the Wassersteindistance minimizing generator . The next theorem extends this result to the firstorder Wasserstein GAN (WGAN) problem.
Theorem 3.
Consider the WGAN problem (4.7) minimizing the firstorder Wasserstein distance. For each , define to be the magnitude of the resulted optimal transport map from to , i.e. shares the same distribution with .^{1}^{1}1Note that as shown in the proof such a mapping exists under mild regularity assumptions. Given these definitions, assume that

[leftmargin=*]

is a convex set,

for every and , holds for constant .
Then, for the Wasserstein distanceminimizing generator function provides an proximal equilibrium with respect to the Sobolev norm in (7.1).
Proof.
We defer the proof to the Appendix. ∎
The above theorem shows that if the magnitude of optimal transport map is everywhere lowerbounded by , then the Wasserstein distanceminimizing generator in the WGAN problem yields a proximal equilibrium.
8 Proximal Training
As shown for Wasserstein GAN problems, given the defined Sobolev norm and a small enough the proximal objective will possess a Nash equilibrium solution. This result motivates performing the minimax optimization for the proximal objective instead of the original objective . Therefore, we propose proximal training in which we solve the following minimax optimization problem:
(8.1) 
with the proximal operator defined according to the Sobolev norm in (7.1).
In order to take the gradient of with respect to , Proposition 4 suggests solving the proximal optimization followed by computing the gradient of the original objective where the discriminator is parameterized with the optimal solution to the proximal optimization.
Algorithm 1 summarizes the main two steps of proximal training. At every iteration, the discriminator is optimized with an additive Sobolev norm penalty forcing the discriminator to remain in the proximity of the current discriminator. Next, the generator is optimized using a gradient descent method with the gradient evaluated at the optimal discriminator solving the proximal optimization. The stepsize parameter can be adaptively selected at every iteration . In practice, we can solve the proximal maximization problem via a firstorder optimization method for a certain number of iterations. Assuming the conditions of Proposition 4 hold, the proximal optimization leads to the maximization of a stronglyconcave objective which can be solved linearly fast through firstorder optimization methods.
9 Numerical Experiments
To experiment the theoretical results of this work, we performed several experiments using the [18]’s implementation of Wasserstein GANs with the code available at the paper’s Github repository. In addition, we used the implementations of [33, 11] for applying spectral regularization to the discriminator network. In the experiments, we used the DCGAN 4layer CNN architecture for both the discriminator and generator functions [39] and ran each experiment for 200,000 generator iterations with 5 discriminator updates per generator update. We used the RMSprop optimzier [20] for WGAN experiments with weight clipping or spectral normalization and the Adam optimizer [23] for the other experiments.
9.1 Proximal Equilibrium in Wasserstein and Lipschitz GANs
We examined whether the solutions found by Wasserstein and Lipschitz vanilla GANs represent proximal equilibria. Toward this goal, we performed similar experiments to Section 3’s experiments for the WGANWC [2], WGANGP [18], and SNGAN [33] problems over the MNIST and CelebA datasets. In Section 3, we observed that after fixing the trained discriminator the GAN’s minimax objective kept decreasing when we optimized only the generator . In the new experiments, we similarly fixed the trained discriminator resulted from the 200,000 training iterations, but instead of optimizing the GAN minimax objective we optimized the proximal objective defined by the norm (7.1) with . Thus, we solved the following optimization problem initialized at which denotes the parameters of the trained generator:
(9.1) 
We computed the gradient of the above proximal objective by applying the Adam optimizer for steps to approximate the solution to the proximal optimization (6.1) which at every iteration was initialized at . Figures 1(a) and 1(b) show that in the SNGAN experiments the original minimax objective had only minor changes, compared to the results in Section 3, and the quality of generated samples did not change significantly during the optimization. We defer the similar numerical results of the WGANWC and WGANGP experiments to the Appendix. These numerical results suggest that while Wasserstein and Lipschitz GANs may not converge to local Nash equilibrium solutions as shown in Section 3, their found solutions can still represent a local proximal equilibrium.
9.2 Proximal Training Improves Lipschitz GANs
GAN Problem  Ordinary  Proximal 

WGANWC (DIM=64)  
WGANWC (DIM=128)  
SNGAN (DIM=64)  
SNGAN (DIM=128) 
We applied the proximal training in Algorithm 1 to the WGANWC and SNGAN problems. To compute the gradient of the proximal minimax objective, we solved the maximization problem in the Algorithm 1’s first step in the for loop by applying steps of Adam optimization initialized at the discriminator parameters at that iteration. Applying the proximal training to MNIST, CIFAR10, and CelebA datasets, we qualitatively observed slightly visually better generated pictures. We postpone the generated samples to the Appendix.
To quantitatively compare the proximal and ordinary nonproximal GAN training, we measured the Inception scores of the samples generated in the CIFAR10 experiments. As shown in Table 1, proximal training results in an improved inception score. In this table, DIM stands for the dimension parameter of the DCGAN’s CNN networks.
References
 [1] (2013) A user’s guide to optimal transport. In Modelling and optimisation of flows on networks, pp. 1–155. Cited by: §C.6.
 [2] (2017) Wasserstein gan. arXiv preprint arXiv:1701.07875. Cited by: §1, §1, §3, §4.2, §9.1.
 [3] (2017) Generalization and equilibrium in generative adversarial nets (gans). In Proceedings of the 34th International Conference on Machine LearningVolume 70, pp. 224–232. Cited by: §1, §2.
 [4] (2017) Began: boundary equilibrium generative adversarial networks. arXiv preprint arXiv:1703.10717. Cited by: §2.
 [5] (1997) Nonlinear programming. Journal of the Operational Research Society 48 (3), pp. 334–334. Cited by: §C.5.
 [6] (2004) Convex optimization. Cambridge university press. Cited by: §C.2.
 [7] (2018) Large scale gan training for high fidelity natural image synthesis. arXiv preprint arXiv:1809.11096. Cited by: §1.
 [8] (2017) Training gans with optimism. arXiv preprint arXiv:1711.00141. Cited by: §2, §2.
 [9] (2018) The limit points of (optimistic) gradient descent in minmax optimization. In Advances in Neural Information Processing Systems, pp. 9236–9246. Cited by: §2.
 [10] (2018) A convex duality framework for gans. In Advances in Neural Information Processing Systems, pp. 5248–5258. Cited by: §4.1.
 [11] (2019) Generalizable adversarial training via spectral normalization. In International Conference on Learning Representations, Cited by: §9.
 [12] (2017) Many paths to equilibrium: gans do not need to decrease a divergence at every step. arXiv preprint arXiv:1710.08446. Cited by: §2.
 [13] (2017) Understanding gans: the lqg setting. arXiv preprint arXiv:1710.10793. Cited by: §1, §2, §4.2, footnote 2.
 [14] (2019) Convergence of learning dynamics in stackelberg games. arXiv preprint arXiv:1906.01217. Cited by: §2.
 [15] (2016) Deep learning. Vol. 1, MIT Press. Cited by: §1.
 [16] (2014) Generative adversarial nets. In Advances in neural information processing systems, pp. 2672–2680. Cited by: §1, §1, §4.1, §4.1.
 [17] (2016) NIPS 2016 tutorial: generative adversarial networks. arXiv preprint arXiv:1701.00160. Cited by: §5.
 [18] (2017) Improved training of wasserstein gans. In Advances in neural information processing systems, pp. 5767–5777. Cited by: §1, §1, §3, §9.1, §9.
 [19] (2017) Gans trained by a two timescale update rule converge to a local nash equilibrium. In Advances in neural information processing systems, pp. 6626–6637. Cited by: §2.
 [20] (2012) Neural networks for machine learning lecture 6a overview of minibatch gradient descent. 14 (8). Cited by: §3, §9.
 [21] (2018) Finding mixed nash equilibria of generative adversarial networks. arXiv preprint arXiv:1811.02002. Cited by: §2.
 [22] (2019) Minmax optimization: stable limit points of gradient descent ascent are locally optimal. arXiv preprint arXiv:1902.00618. Cited by: §1, §2.
 [23] (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: §3, §9.
 [24] (2017) On convergence and stability of gans. arXiv preprint arXiv:1705.07215. Cited by: §2.

[25]
(1998)
The mnist database of handwritten digits
. http://yann. lecun. com/exdb/mnist/. Cited by: §3.  [26] (2019) SGD learns onelayer networks in wgans. arXiv preprint arXiv:1910.07030. Cited by: §2.
 [27] (2019) On gradient descent ascent for nonconvexconcave minimax problems. arXiv preprint arXiv:1906.00331. Cited by: §2, §2.
 [28] (2017) Approximation and convergence properties of generative adversarial learning. In Advances in Neural Information Processing Systems, pp. 5545–5553. Cited by: §4.1.

[29]
(201512)
Deep learning face attributes in the wild.
In
Proceedings of International Conference on Computer Vision (ICCV)
, Cited by: §3.  [30] (2019) On finding local nash equilibria (and only local nash equilibria) in zerosum games. arXiv preprint arXiv:1901.00838. Cited by: §2.
 [31] (2018) Which training methods for gans do actually converge?. arXiv preprint arXiv:1801.04406. Cited by: §2.
 [32] (2017) The numerics of gans. In Advances in Neural Information Processing Systems, pp. 1825–1835. Cited by: §2.
 [33] (2018) Spectral normalization for generative adversarial networks. arXiv preprint arXiv:1802.05957. Cited by: §1, §1, §3, §9.1, §9.
 [34] (2019) A unified analysis of extragradient and optimistic gradient methods for saddle point problems: proximal point approach. arXiv preprint arXiv:1901.08511. Cited by: §2.
 [35] (2017) Sobolev gan. arXiv preprint arXiv:1711.04894. Cited by: §2.
 [36] (2017) Gradient descent gan optimization is locally stable. In Advances in neural information processing systems, pp. 5585–5595. Cited by: §2.
 [37] (2019) Solving a class of nonconvex minmax games using iterative first order methods. In Advances in Neural Information Processing Systems, pp. 14905–14916. Cited by: §2.
 [38] (2016) Fgan: training generative neural samplers using variational divergence minimization. In Advances in neural information processing systems, pp. 271–279. Cited by: §1, §4.1, §5.
 [39] (2015) Unsupervised representation learning with deep convolutional generative adversarial networks. arXiv preprint arXiv:1511.06434. Cited by: §3, §9.
 [40] (2017) Stabilizing training of generative adversarial networks through regularization. In Advances in neural information processing systems, pp. 2018–2028. Cited by: §2.
 [41] (2016) Improved techniques for training gans. In Advances in neural information processing systems, pp. 2234–2242. Cited by: §1, §2.
 [42] (2018) Improving gans using optimal transport. arXiv preprint arXiv:1803.05573. Cited by: §4.2.
 [43] (2018) On the convergence and robustness of training gans with regularized optimal transport. In Advances in Neural Information Processing Systems, pp. 7091–7101. Cited by: §4.2.
 [44] (2019) 2wasserstein approximation via restricted convex potentials with application to improved training for gans. arXiv preprint arXiv:1902.07197. Cited by: §4.2.
 [45] (2019) Efficient algorithms for smooth minimax optimization. In Advances in Neural Information Processing Systems, pp. 12659–12670. Cited by: §2.
 [46] (2008) Optimal transport: old and new. Vol. 338, Springer Science & Business Media. Cited by: §C.2, §C.7, footnote 2.
 [47] (2019) On the convergence and robustness of adversarial training. In International Conference on Machine Learning, pp. 6586–6595. Cited by: §2.
 [48] (2020) On solving minimax optimization locally: a followtheridge approach. In International Conference on Learning Representations, Cited by: §2.
 [49] (2018) Selfattention generative adversarial networks. arXiv preprint arXiv:1805.08318. Cited by: §1.
 [50] (2019) Policy optimization provably converges to nash equilibria in zerosum linear quadratic games. In Advances in Neural Information Processing Systems, pp. 11598–11610. Cited by: §2.
 [51] (2019) Lipschitz generative adversarial nets. arXiv preprint arXiv:1902.05687. Cited by: §2.
Appendix A Numerical Results for Section 3
Here, we provide the complete numerical results for the experiments discussed in Section 3 of the main text. Regarding the plots shown in Section 3 for the SNGAN implementation, here we present the same plots for the Wasserstein GAN with weight clipping (WGANWC) and with gradient penalty (WGANGP) problems. Figures 2(a)3(b) repeat the experiments of Figures 1,2 in the main text for the WGANWC and WGANGP problems. These plots suggest that a similar result also holds for the WGANWC and WGANGP problems, where the objective and the generated samples’ quality were decreasing during the generator optimization. For a larger set of generated samples in the main text’s Figures 1,2 and Figures 2(a)3(b), we refer the readers to Figures 4(a)6(b).
Appendix B Numerical Results for Section 9
Here, we present the complete numerical results for the experiments of Section 9 in the main text. Figures 7(a)8(b) demonstrate the results of the main text’s Figures 3,4 for the WGANWC and WGANGP problems. Here, except the WGANGP experiment on the CelebA dataset, we observed that the objective and the generated samples’ quality did not significantly decrease over the generator optimization. Even for the WGANGP experiment on the CelebA data, we observed that the objective value decreased three times less than in minimizing the original objective rather than the proximal objective. These experiments suggest that the Wasserstein and Lipschitz GAN problems can converge to local proximal equilibrium solutions. We also show a larger group of generated samples at the beginning and final iterations of Figures 3,4 in the main text and Figures 7(a)8(b) in Figures 9(a)11(b).
For the proximal training experiments, Figures 1314(b) show the samples generated by the SNGAN and WGANWC proximally trained on CIFAR10 and CelebA data with the results for the baseline regular training on the top of the figure and the results for proximal training on the bottom. We observed a somewhat improved quality achieved by proximal training, which was further supported by the inception scores for the CIFAR10 experiments reported in the main text.
Appendix C Proofs
c.1 Proof of Proposition 1
Proof for GANs:
Consider the following GAN minimax problem corresponding to the convex function :
(C.1) 
Due to the realizability assumption, given we assume that the data distribution and the generative model are identical, i.e., . Then, the minimax objective for reduces to
(C.2) 
The above objective decouples across outcomes. As a result, the maximizing discriminator will be a constant function where the constant value follows from the optimization problem:
(C.3) 
Note that the objective is a concave function of whose derivative is zero at , because the Fenchelconjugate of a convex satisfies .
So far we have proved that the constant function provides the optimal discriminator for generator . Therefore, for every discriminator we have
(C.4) 
where denotes the GAN’s minimax objective. Moreover, note that for a constant the value of the minimax objective does not change with generator . As a result, for every
(C.5) 
Then, (C.4) and (C.5) collectively prove that for every and we have
which completes the proof for GANs.
Proof for Wasserstein GANs:
Consider a general Wasserstein GAN problem with a cost function satisfying for every . Notice that this property holds for all Wasserstein distance measures corresponding to cost function for . The generalized Wasserstein GAN minimax problem is as follows:
(C.6) 
Due to the realizability assumption, a generator function results in the data distribution such that . Then, the above minimax objective for reduces to
(C.7) 
Since the cost is assumed to take a zero value given identical inputs, we have:
As a result, holds for every . Hence, the objective in (C.7) will be nonpositive and takes its maximum zero value for any constant function , which by definition satisfies concavity. Therefore, letting denote the GAN minimax objective, for every we have
(C.8) 
We also know that for a constant discriminator the value of the minimax objective is independent from the generator function. Therefore, for every we have
(C.9) 
As a result, (C.8) and (C.9) together show that for every and
(C.10) 
which makes the proof complete for Wasserstein GANs.
c.2 Proof of Theorem 1 & Remark 1
Proof for GANs:
Lemma 1.
Consider two random vectors
with probability density functions
, respectively. Suppose that are nonzero everywhere. Then, considering the following variational representation of ,(C.11) 
the optimal solution will satisfy
(C.12) 
Proof.
Let us rewrite the divergence’s variational representation as
where the last equality holds, since the maximization objective decouples across values. It can be seen that the inside optimization problem for each is maximizing a concave objective in which by setting the derivative to zero we obtain
(C.13) 
As a property of the Fenchelconjugate of a convex , we know which combined with the above equation implies that
(C.14) 
The above result completes Lemma 1’s proof. ∎
Consider the GAN problem with the generator function specified in the theorem:
(C.15) 
Note that and . Notice that if was not fullrank, the maximized discriminator objective would be achieved by a assigning an infinity value to the points not included in the rankconstrained support set of generator . This will not result in a solution to the GAN problem, because we assume that the dimensions of and match each other and hence there exists a fullrank with a finite maximized objective, i.e. divergence value. Therefore, in a Nash equilibrium of the GAN problem, the solution must be fullrank and invertible.
Lemma 1 results in the following equation for the optimal discriminator given generator parameters :
As a result, the function appearing in the GAN’s minimax objective will be
Claim: is a strictly convex function of .
To show this claim, note that the following expression is a stronglyconvex quadratic function of , since we have assumed that the spectral norm of is bounded as :
For simplicity, we denote the above stronglyconvex function with and define the function as
According to the above definitions, is the composition of and stronglyconvex . Note that is a monotonically increasing function, since defining we have
(C.16) 
which follows from the equality
that is a consequence of the definition of Fenchelconjugate, implying that for the convex . Note that holds everywhere, because is assumed to be strictly convex. This proves that is strictly increasing. Furthermore, is a convex function, because is nondecreasing due to the assumption that is nondecreasing over . As a result, is an increasing convex function.
Therefore, is a composition of a stronglyconvex and an increasing convex . Therefore, as a wellknown result in convex optimization [6], the claim is true and is a strictly convex function of .
We showed that the claim is true for every feasible . Now, we prove that the pair will not be a local Nash equilibrium for any feasible . If the pair was a local Nash equilibrium, would be a local minimum for the following minimax objective where is fixed to be :
(C.17) 
However, as shown earlier, for any feasible , is a strictlyconvex function of , which in turn shows that (C.17) is a strictlyconcave function of variable . This consequence proves that the objective has no local minima for the unconstrained variable . Due to the shown contradiction, a pair with the form cannot be a local Nash equilibrium in parameters . Consequently, the minimax problem has no pure Nash equilibrium solutions, since in a pure Nash equilibrium the discriminator will be by definition optimal against the choice of generator.
Proof for W2GANs:
Consider the W2GAN problem with the assumed generator function:
(C.18) 
where the ctransform is defined for the quadratic cost function . Similar to the GAN case, define to be the optimal discriminator for the generator function parameterized by . Note that and .
According to the Brenier’s theorem [46], the optimal transport from the Gaussian data distribution to the Gaussian generative model will be
As a wellknown result regarding the secondorder optimal transport map between two Gaussian distributions, the optimal transport will be a linear transformation as
. This result shows that(C.19) 
Note that the ctransform for cost satisfies where denotes ’s Fenchelconjugate. For general convex quadratic function we have where denotes ’s Moore Penrose pseudoinverse. Therefore, for the ctransform of the optimal discriminator we will have
Comments
There are no comments yet.