# GANs May Have No Nash Equilibria

Generative adversarial networks (GANs) represent a zero-sum game between two machine players, a generator and a discriminator, designed to learn the distribution of data. While GANs have achieved state-of-the-art performance in several benchmark learning tasks, GAN minimax optimization still poses great theoretical and empirical challenges. GANs trained using first-order optimization methods commonly fail to converge to a stable solution where the players cannot improve their objective, i.e., the Nash equilibrium of the underlying game. Such issues raise the question of the existence of Nash equilibrium solutions in the GAN zero-sum game. In this work, we show through several theoretical and numerical results that indeed GAN zero-sum games may not have any local Nash equilibria. To characterize an equilibrium notion applicable to GANs, we consider the equilibrium of a new zero-sum game with an objective function given by a proximal operator applied to the original objective, a solution we call the proximal equilibrium. Unlike the Nash equilibrium, the proximal equilibrium captures the sequential nature of GANs, in which the generator moves first followed by the discriminator. We prove that the optimal generative model in Wasserstein GAN problems provides a proximal equilibrium. Inspired by these results, we propose a new approach, which we call proximal training, for solving GAN problems. We discuss several numerical experiments demonstrating the existence of proximal equilibrium solutions in GAN minimax problems.

## Authors

• 6 publications
• 20 publications
• ### Beyond Local Nash Equilibria for Adversarial Networks

Save for some special cases, current training methods for Generative Adv...
06/18/2018 ∙ by Frans A. Oliehoek, et al. ∙ 4

• ### DO-GAN: A Double Oracle Framework for Generative Adversarial Networks

In this paper, we propose a new approach to train Generative Adversarial...
02/17/2021 ∙ by Aye Phyu Phyu Aung, et al. ∙ 8

• ### Coulomb GANs: Provably Optimal Nash Equilibria via Potential Fields

Generative adversarial networks (GANs) evolved into one of the most succ...
08/29/2017 ∙ by Thomas Unterthiner, et al. ∙ 0

• ### Many Paths to Equilibrium: GANs Do Not Need to Decrease a Divergence At Every Step

Generative adversarial networks (GANs) are a family of generative models...
10/23/2017 ∙ by William Fedus, et al. ∙ 0

• ### Negative Momentum for Improved Game Dynamics

Games generalize the optimization paradigm by introducing different obje...
07/12/2018 ∙ by Gauthier Gidel, et al. ∙ 10

• ### Geometric GAN

Generative Adversarial Nets (GANs) represent an important milestone for ...
05/08/2017 ∙ by Jae Hyun Lim, et al. ∙ 0

• ### GANs Trained by a Two Time-Scale Update Rule Converge to a Local Nash Equilibrium

Generative Adversarial Networks (GANs) excel at creating realistic image...
06/26/2017 ∙ by Martin Heusel, et al. ∙ 0

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Since their introduction in [16], generative adversarial networks (GANs) have gained great success in many tasks of learning the distribution of observed samples. Unlike the traditional approaches to distribution learning, GANs view the learning problem as a zero-sum game between the following two players: 1) generator aiming to generate real-like samples from a random noise input, 2) discriminator trying to distinguish ’s generated samples from real training data. This game is commonly formulated through a minimax optimization problem as follows:

 minG∈GmaxD∈DV(G,D). (1.1)

Here, and are respectively the generator and discriminator function sets, commonly chosen as two deep neural nets, and denotes the minimax objective for generator and discriminator capturing how dissimilar the generated samples and training data are.

GAN optimization problems are commonly solved by alternating gradient methods, which under proper regularization have resulted in state-of-the-art generative models for various benchmark datasets. However, GAN minimax optimization has led to several theoretical and empirical challenges in the machine learning literature. Training GANs is widely known as a challenging optimization task requiring an exhaustive hyper-parameter and architecture search and demonstrating an unstable behavior. While a few regularization schemes have achieved empirical success in training GANs

[41, 2, 18, 33], still little is known about the conditions under which GAN minimax optimization can be successfully solved by first-order optimization methods.

To understand the minimax optimization in GANs, one needs to first answer the following question: What is the proper notion of equilibrium in the GAN zero-sum game? In other words, what are the optimality criteria in the GAN’s minimax optimization problem? A classical notion of equilibrium in the game theory literature is the

Nash equilibrium, a state in which no player can raise its individual gain by choosing a different strategy. According to this definition, a Nash equilibrium for the GAN minimax problem (1.1) must satisfy the following for every and :

 V(G∗,D)≤V(G∗,D∗)≤V(G,D∗). (1.2)

As a well-known result, for a generator expressive enough to reproduce the distribution of observed samples, Nash equilibrium exists for the generator producing the data distribution [15]. However, such a Nash equilibrium would be of little interest from a learning perspective, since the trained generator merely overfits the empirical distribution of training samples [3]. More importantly, state-of-the-art GAN architectures [18, 33, 49, 7] commonly restrict the generator function through various means of regularization such as batch or spectral normalization. Such regularization mechanisms do not allow the generator to produce the empirical distribution of observed data-points. Since the realizability assumption does not apply to such regularized GANs, the existence of Nash equilibria will not be guaranteed in their minimax problems.

The above discussion motivates studying the equilibrium of GAN zero-sum games in the non-realizable settings where the generator cannot express the empirical distribution of training data. Here, a natural question is whether a Nash equilibrium still exists for the GAN minimax problem. In this work, we focus on this question and demonstrate through several theoretical and numerical results that:

• [leftmargin=*]

• Nash equilibrium may not exist in GAN zero-sum games.

We provide theoretical examples of well-known GAN formulations including the vanilla GAN [16], Wasserstein GAN (WGAN) [2], -GAN [38], and the second-order Wasserstein GAN (W2GAN) [13] where no local Nash equilibria exist in their minimax optimization problems. We further perform numerical experiments on widely-used GAN architectures which suggest that an empirically successful GAN training may converge to non-Nash equilibrium solutions.

Next, we focus on characterizing a new notion of equilibrium for GAN problems. To achieve this goal, we consider the Nash equilibrium of a new zero-sum game where the objective function is given by the following proximal operator applied to the minimax objective with respect to a norm on discriminator functions:

 Vprox(G,D)\coloneqqmax˜D∈DV(G,˜D)−∥∥˜D−D∥∥2. (1.3)

We refer to the Nash equilibrium of the new zero-sum game as the proximal equilibrium. Given the inherent sequential nature of GAN problems where the generator moves first followed by the discriminator, we consider a Stackelberg game for its representation and focus on the subgame perfect equilibrium (SPE) of the game as the right notion of equilibrium for such problems [22]. We prove that the proximal equilibrium of Wasserstein GANs provides an SPE for the GAN problem. This result applies to both the first-order and second-order Wasserstein GANs. In these cases, we show a proximal equilibrium exists for the optimal generator minimizing the distance to the data distribution.

Inspired by these theoretical results, we propose a proximal approach for training GANs, which we call proximal training, by changing the original minimax objective to the proximal objective in (1.3). In addition to preserving the optimal solution to the GAN minimax problem, proximal training can further enjoy the existence of Nash equilibrium solutions in the new minimax objective. We discuss numerical results supporting the proximal training approach and the role of proximal equilibrium solutions in various GAN problems.

## 2 Related Work

Understanding the minimax optimization in modern machine learning applications including GANs has been a subject of great interest in the machine learning literature. A large body of recent works [8, 37, 34, 45, 50, 30, 14, 47, 27] have analyzed the convergence properties of first-order optimization methods in solving different classes of minimax games.

In a related work, [22] proposes a new notion of local optimality, called local minimax, designed for general sequential machine learning games. Compared to the notion of local minimax, the proximal equilibrium proposed in our work gives a notion of global optimality, which as we show directly applies to Wasserstein GANs. [22] also provides examples of minimax problems where Nash equilibria do not exist; however, the examples do not represent GAN minimax problems. Some recent works [27, 26, 48] have analyzed the convergence of different optimization methods to local minimax solutions.

In another related work, [9] analyzes the stable points of the gradient descent ascent (GDA) and optimistic GDA [8] algorithms, proving that they will give strict supersets of the local saddle points. Regarding the stability of GAN algorithms, [36] proves that the GDA algorithm will be locally stable for the vanilla and regularized Wasserstein GAN problems. [13] shows the GDA algorithm is globally stable for W2GANs with linear generator and quadratic discriminator functions.

Regarding the equilibrium in GANs, [3] studies the Nash equilibrium of GAN minimax games in realizable settings. Also, [3, 21] develop methods for finding mixed strategy Nash equilibria. On the other hand, our results focus on the pure strategies in non-realizable settings. [12] empirically studies the equilibrium of GAN problems regularized via the gradient penalty, reporting positive results on the stability of regularized GANs. However, our focus is on the existence of pure Nash equilibrium solutions. [35]

suggests a moment matching GAN formulation using the Sobolev norm. As a different direction, we use the Sobolev norm to analyze equilibrium in GANs. Finally, developing GAN architectures with improved equilibrium and stability properties has been studied in several recent works

[41, 4, 19, 32, 40, 24, 31, 51].

## 3 An Initial Experiment on Equilibrium in GANs

To examine whether the Nash equilibrium exists in GAN problems empirically, we performed a simple numerical experiment. In this experiment, we applied three standard GAN implementations including the Wasserstein GAN with weight-clipping (WGAN-WC) [2], the improved Wasserstein GAN with gradient penalty (WGAN-GP) [18], and the spectrally-normalized vanilla GAN (SN-GAN) [33], to the two benchmark MNIST [25] and CelebA [29] databases. We used the convolutional architecture of the DC-GAN [39] optimized with the Adam [23]

or RMSprop

[20] (only for WGAN-WC) optimizers.

We performed each of the GAN experiments for 200,000 generator iterations to reach with and denoting the trained generator and discriminator parameters at the end of the 200,000 iterations. Our goal is to examine whether the solution pair represents a Nash equilibrium or not. To do this, we fixed the trained discriminator and kept optimizing the generator, i.e. continuing optimizing the generator without changing the discriminator . Here we solved the following optimization problem initialized at using the default first-order optimizer for the generator function for 10,000 iterations:

 minθV(Gθ,Dwfinal). (3.1)

If the pair was in fact a Nash equilibrium, it would give a local saddle point to the minimax optimization and the above optimization could not make the objective any smaller than its initial value. Also, the image samples generated by the generator should have improved or at least preserved their initial quality during this optimization, since the discriminator would be the optimal discriminator against all generator functions.

Despite the above predictions, we observed that none of the mentioned statements hold in reality for any of the six experiments with the three standard GAN implementations and the two datasets. The optimization objective decreased rapidly from the beginning of the optimization, and the pictures sampled from the generator completely lost their quality over this optimization. Figures 0(a), 0(b) show the objective for the SN-GAN experiments over the 10,000 steps of the above optimization. These figures also demonstrate the SN-GAN generated samples before and during the optimization, which shows the significant drop in the quality of generated pictures. We defer the results for the WGAN-WC and WGAN-GP problems to the Appendix.

The results of the above experiments show that practical GAN experiments may not converge to local Nash equilibrium solutions. After fixing the trained discriminator, the trained generator can be further optimized using a first-order optimization method to reach smaller values of the generator objective. More importantly, this optimization not only does not improve the quality of the generator’s output samples, but also totally disturbs the trained generator. As demonstrated in these experiments, simultaneous optimization of the two players is in fact necessary for the proper convergence and stability behavior in GAN minimax optimization. The above experiments suggest that practical GAN solutions are not local Nash equilibrium. In the upcoming sections, we review some standard GAN formulations and then show that there are examples of GAN minimax problems for which no Nash equilibrium exists. Those theoretical results will further support our observations in the above experiments.

## 4 Review of GAN Formulations

### 4.1 Vanilla GAN & f-Gan

Consider samples observed independently from distribution . Our goal is to find a generator function where maps a random noise input from a known to an output distributed as

, i.e., we aim to match the probability distributions

and . To find such a generator function, [16] proposes the following minimax problem which is commonly referred to as the vanilla GAN problem:

 minG∈GmaxD∈DE[log(D(X))]+E[log(1−D(G(Z)))]. (4.1)

Here and represent the set of generator and discriminator functions, respectively. In this formulation, the discriminator is optimized to map real samples from to larger values than the values assigned to generated samples from .

As shown in [16], the above minimax problem for an unconstrained containing all real-valued functions reduces to the following divergence minimization problem:

 minG∈GJSD(PX,PG(Z)), (4.2)

where denotes the Jensen-Shannon (JS) divergence defined in terms of KL-divergence as

 JSD(P,Q)\coloneqq12KL(P,P+Q2)+12KL(Q,P+Q2).

-GANs extend the vanilla GAN problem by generalizing the JS-divergence to a general -divergence. For a convex function with , the -divergence corresponding to is defined as

 df(P,Q)\coloneqq∫p(x)f(q(x)p(x))dx. (4.3)

Notice that the JS-divergence is a special case of -divergence with . [38] shows that generalizing the divergence minimization (4.2) to minimizing a -divergence results in the following minimax problem called -GAN:

 minG∈GmaxD∈DE[D(X)]−E[f∗(D(G(Z)))], (4.4)

where denotes the Fenchel-conjugate to defined as . The space implied by the f-divergence minimization will be the set of all functions, but a similar interpretation further applies to a constrained [28, 10]. Several examples of -GANs have been formulated and discussed in [38].

### 4.2 Wasserstein GANs

To resolve GAN training issues, [2] proposes to formulate a GAN problem by minimizing the optimal transport costs which unlike -divergences change continuously with the input distributions. Given a transportation cost for transporting to , the optimal transport cost is defined as

 Wc(P,Q)=infM∈Π(P,Q)EM[c(X,X′)] (4.5)

where

denotes the set of all joint distributions on

with marginally distributed as , respectively. An important special case is the first-order Wasserstein distance (-distance) corresponding to . In this special case, the Kantorovich-Rubinstein duality shows

 W1(P,Q)=max∥D∥Lip≤1EP[D(X)]−EQ[D(X)]. (4.6)

Here denotes the expected value with respect to distribution and denotes the Lipschitz constant of function which is defined as the smallest satisfying for every . Formulating a GAN problem minimizing the -distance, [2] states the Wasserstein GAN (WGAN) problem as follows:

 minG∈Gmax∥D∥Lip≤1E[D(X)]−E[D(G(Z))]. (4.7)

The above Wasserstein GAN problem can be generalized to a general optimal transport cost with arbitrary cost function . The generalization is as follows:

 minG∈GmaxDc-concaveE[D(X)]−E[Dc(G(Z))], (4.8)

where the c-transform is defined as and a function is called c-concave if it is the c-transform of some valid function. In particular, the optimal transport GAN formulation with the quadratic cost results in the second-order Wasserstein GAN (W2GAN) problem which has been studied in several recent works [13, 42, 43, 44].

## 5 Existence of Nash Equilibrium Solutions in GANs

Consider a general GAN minimax problem (1.1) with a minimax objective . As discussed in the previous section, the optimal generator is defined to minimize the GAN’s target divergence to the data distribution. The following proposition is a well-known result regarding the Nash equilibrium of the GAN game in realizable settings where there exists a generator producing the data distribution.

###### Proposition 1.

Assume that generator results in the distribution of data, i.e., we have . Then, for each of the GAN problems discussed in Section 4 there exists a constant discriminator function which together with results in a Nash equilibrium to the GAN game, and hence satisfies the following for every and :

 V(G∗,D)≤V(G∗,Dconstant)≤V(G,Dconstant).
###### Proof.

This proposition is well-known for the vanilla GAN [17]. In the Appendix, we provide a proof for general -GANs and Wasserstein GANs. ∎

The above proposition shows that in a realizable setting with a generator function generating the distribution of observed samples, a Nash equilibrium exists for that optimal generator. However, the realizability assumption in this proposition does not always hold in real GAN experiments. For example, in the GAN experiments discussed in Section 3

, we observed that the divergence estimate never reached the zero value because of regularizing the generator function. Therefore, the Nash equilibrium described in Proposition

1 does not apply to the trained generator and discriminator in such GAN experiments.

Here, we address the question of the existence of Nash equilibrium solutions for non-realizable settings, where no generator can produce the data distribution. Do Nash equilibria always exist in non-realizable GAN zero-sum games? The following theorem shows that the answer is in general no. Note that

in this theorem denotes the maximum singular value, i.e., the spectral norm.

###### Theorem 1.

Consider a GAN minimax problem for learning a normally distributed

with zero mean and scalar covariance matrix where . In the GAN formulation, we use a linear generator function where the weight matrix is spectrally-regularized to satisfy

. Suppose that the Gaussian latent vector is normally distributed as

with zero mean and identity covariance matrix. Then,

• [leftmargin=*]

• For the -GAN problem corresponding to an with non-decreasing over and an unconstrained discriminator where the dimensions of data and latent match, the f-GAN minimax problem has no Nash equilibrium solutions.

• For the W2GAN problem with discriminator trained over -concave functions, where is the quadratic cost, the W2GAN minimax problem has no Nash equilibrium solutions. Also, given a quadratic discriminator parameterized by , the W2GAN problem has no local Nash equilibria.

• For the WGAN problem with -dimensional and a discriminator trained over 1-Lipschitz functions, the WGAN minimax problem has no Nash equilibria.

###### Proof.

We defer the proof to the Appendix. Note that the condition on the -GAN holds for all -GAN examples in [38] including the vanilla GAN. ∎

The above theorem shows that under the stated assumptions the GAN zero-sum game does not have Nash equilibrium solutions. Consequently, the optimal divergence-minimizing generative model does not result in a Nash equilibrium. In contrast to Theorem 1, the following remark shows that the GAN zero-sum game in a non-realizable case may have Nash equilibrium solutions, of course if Theorem 1’s assumptions do not hold.

###### Remark 1.

Consider the same setting as in Theorem 1. However, unlike Theorem 1 suppose that and where stands for the minimum singular value. Then, for the WGAN and W2GAN problems described in Theorem 1, the Wasserstein distance-minimizing generator results in a Nash equilibrium.

###### Proof.

We defer the proof to the Appendix. ∎

The above remark explains that the phenomenon shown in Theorem 1 does not always hold in non-realizable GAN settings. As a result, we need other notions of equilibrium which consistently explain optimality in GAN games.

## 6 Proximal Equilibrium: A Relaxation of Nash Equilibrium

To define a proper notion of equilibrium for GANs, note that due to the sequential nature of GAN games the equilibrium notion should be flexible to allow to some extent the optimization of the discriminator around the equilibrium solution. This property is in fact consistent with the stability feature observed for the first-order GAN training methods where the alternating first-order method stabilizes around a certain solution. To this end, we consider the following objective for a GAN problem with minimax objective :

 Vproxλ(G,D)\coloneqqmax˜D∈DV(G,˜D)−λ2∥∥˜D−D∥∥2. (6.1)

The above definition represents the application of a proximal operator to , which further optimizes the original objective in the proximity of discriminator . To keep the function variable close to , we penalize the distance among the two functions in the proximal optimization. Here the distance is measured using a norm on the discriminator function space.

To extend the notion of Nash equilibrium to general minimax problems, we propose considering the Nash equilibria of the defined .

###### Definition 1.

We call a -proximal equilibrium for if it represents a Nash equilibrium for , i.e. for every and

 Vproxλ(G∗,D)≤Vproxλ(G∗,D∗)≤Vproxλ(G,D∗). (6.2)

The next proposition provides necessary and sufficient conditions in terms of the original objective for the proximal equilibrium solutions.

###### Proposition 2.

is a -proximal equilibrium if and only if for every and we have

 V(G∗,D)≤V(G∗,D∗)≤max˜D∈DV(G,˜D)−λ2∥∥˜D−D∗∥∥2

Therefore, if is a -proximal equilibrium it will give a global minimax solution, i.e., minimizes the worst-case objective, , with being its optimal solution.

###### Proof.

We defer the proof to the Appendix. ∎

The following result shows the proximal equilibria provide a hierarchy of equilibrium solutions for different values.

###### Proposition 3.

Define to be the set of the -proximal equilibria for . Then, if ,

 PEλ2(V)⊆PEλ1(V). (6.3)
###### Proof.

We defer the proof to the Appendix. ∎

Note that as approaches infinity, tends to the original , implying that is the set of ’s Nash equilibria. In contrast, for the proximal objective becomes the worst-case objective . As a result, is the set of global minimax solutions described in Proposition 2.

Concerning the proximal optimization problem in (6.1), the following proposition shows that if the original minimax objective is a smooth function of the discriminator parameters, the proximal optimization can be solved efficiently and therefore one can efficiently compute the gradient of the proximal objective.

###### Proposition 4.

Consider the maximization problem in the definition of proximal objective (6.1) where generator and discriminator are parameterized by vectors , respectively. Suppose that

• [leftmargin=*]

• For the considered discriminator norm , is -strongly convex in for any function , i.e. for any :

 ∥∥∇w∥Dw−D∥2−∇w∥Dw′−D∥2∥∥2≥η1∥∥w−w′∥∥2,
• For every , The GAN minimax objective is -smooth in , i.e. i.e. for any :

 ∥∥∇wV(Gθ,Dw)−∇wV(Gθ,Dw′)∥∥2≤η2∥w−w′∥2.

Under the above assumptions, if , the maximization objective in (6.1) is -strongly concave. Then, the maximization problem has a unique solution and if is differentiable with respect to we have

 ∇θVproxλ(Gθ,Dw)=∇θV(Gθ,Dw∗). (6.4)
###### Proof.

We defer the proof to the Appendix. ∎

The above proposition suggests that under the mentioned assumptions, one can efficiently compute the optimal solution to the proximal maximization through a first-order optimization method. The assumptions require the smoothness of the GAN minimax objective with respect to the discriminator parameters, which can be imposed by applying norm-based regularization tools to neural network discriminators.

## 7 Proximal Equilibrium in Wasserstein GANs

As shown earlier, GAN minimax games may not have any Nash equilibria in non-realizable settings. As a result, we seek for a different notion of equilibrium which remains applicable to GAN problems. Here, we show the proposed proximal equilibrium provides such an equilibrium notion for Wasserstein GAN problems.

To define a proper proximal operator for defining proximal equilibria in Wasserstein GAN problems, we use the second-order Sobolev semi-norm averaged over the underlying distribution of data. Given the underlying distribution , we define the Sobolev semi-norm as

 ∥∥D∥∥˙H1\coloneqq√EPX[∥∥∇xD(X)∥∥22]. (7.1)

The above semi-norm is induced by the following semi-inner product and therefore leads to a semi-Hilbert space of functions:

 ⟨D1,D2⟩˙H1\coloneqqEPX[∇D1(X)T∇D2(X)]. (7.2)

Throughout our discussion, we consider a parameterized set of generators . For a GAN minimax objective , we define to be the optimal discriminator function for the parameterized generator :

 Dθ\coloneqqargmaxD∈DV(Gθ,D). (7.3)

The following theorem shows that the Wasserstein distance-minimizing generator function in the second-order Wasserstein GAN problem satisfies the conditions of a proximal equilibrium based on the Sobolev semi-norm defined in (7.1).

###### Theorem 2.

Consider the second-order Wasserstein GAN problem (4.8) with a quadratic cost . Suppose that the set of optimal discriminators is convex. Then, for the Wasserstein distance-minimizing generator will provide a -proximal equilibrium with respect to the Sobolev norm in (7.1).

###### Proof.

We defer the proof to the Appendix. ∎

The above theorem shows that while, as demonstrated in Theorem 1, the W2GAN problem may have no local Nash equilibrium solutions, the proximal equilibrium exists for the W2GAN problem and holds at the Wasserstein-distance minimizing generator . The next theorem extends this result to the first-order Wasserstein GAN (WGAN) problem.

###### Theorem 3.

Consider the WGAN problem (4.7) minimizing the first-order Wasserstein distance. For each , define to be the magnitude of the resulted optimal transport map from to , i.e. shares the same distribution with .111Note that as shown in the proof such a mapping exists under mild regularity assumptions. Given these definitions, assume that

• [leftmargin=*]

• is a convex set,

• for every and , holds for constant .

Then, for the Wasserstein distance-minimizing generator function provides an -proximal equilibrium with respect to the Sobolev norm in (7.1).

###### Proof.

We defer the proof to the Appendix. ∎

The above theorem shows that if the magnitude of optimal transport map is everywhere lower-bounded by , then the Wasserstein distance-minimizing generator in the WGAN problem yields a -proximal equilibrium.

## 8 Proximal Training

As shown for Wasserstein GAN problems, given the defined Sobolev norm and a small enough the proximal objective will possess a Nash equilibrium solution. This result motivates performing the minimax optimization for the proximal objective instead of the original objective . Therefore, we propose proximal training in which we solve the following minimax optimization problem:

 minGθ∈GmaxDw∈DVproxλ(Gθ,Dw), (8.1)

with the proximal operator defined according to the Sobolev norm in (7.1).

In order to take the gradient of with respect to , Proposition 4 suggests solving the proximal optimization followed by computing the gradient of the original objective where the discriminator is parameterized with the optimal solution to the proximal optimization.

Algorithm 1 summarizes the main two steps of proximal training. At every iteration, the discriminator is optimized with an additive Sobolev norm penalty forcing the discriminator to remain in the proximity of the current discriminator. Next, the generator is optimized using a gradient descent method with the gradient evaluated at the optimal discriminator solving the proximal optimization. The stepsize parameter can be adaptively selected at every iteration . In practice, we can solve the proximal maximization problem via a first-order optimization method for a certain number of iterations. Assuming the conditions of Proposition 4 hold, the proximal optimization leads to the maximization of a strongly-concave objective which can be solved linearly fast through first-order optimization methods.

## 9 Numerical Experiments

To experiment the theoretical results of this work, we performed several experiments using the [18]’s implementation of Wasserstein GANs with the code available at the paper’s Github repository. In addition, we used the implementations of [33, 11] for applying spectral regularization to the discriminator network. In the experiments, we used the DC-GAN 4-layer CNN architecture for both the discriminator and generator functions [39] and ran each experiment for 200,000 generator iterations with 5 discriminator updates per generator update. We used the RMSprop optimzier [20] for WGAN experiments with weight clipping or spectral normalization and the Adam optimizer [23] for the other experiments.

### 9.1 Proximal Equilibrium in Wasserstein and Lipschitz GANs

We examined whether the solutions found by Wasserstein and Lipschitz vanilla GANs represent proximal equilibria. Toward this goal, we performed similar experiments to Section 3’s experiments for the WGAN-WC [2], WGAN-GP [18], and SN-GAN [33] problems over the MNIST and CelebA datasets. In Section 3, we observed that after fixing the trained discriminator the GAN’s minimax objective kept decreasing when we optimized only the generator . In the new experiments, we similarly fixed the trained discriminator resulted from the 200,000 training iterations, but instead of optimizing the GAN minimax objective we optimized the proximal objective defined by the norm (7.1) with . Thus, we solved the following optimization problem initialized at which denotes the parameters of the trained generator:

 minθVproxλ=0.1(Gθ,Dwfinal). (9.1)

We computed the gradient of the above proximal objective by applying the Adam optimizer for steps to approximate the solution to the proximal optimization (6.1) which at every iteration was initialized at . Figures 1(a) and 1(b) show that in the SN-GAN experiments the original minimax objective had only minor changes, compared to the results in Section 3, and the quality of generated samples did not change significantly during the optimization. We defer the similar numerical results of the WGAN-WC and WGAN-GP experiments to the Appendix. These numerical results suggest that while Wasserstein and Lipschitz GANs may not converge to local Nash equilibrium solutions as shown in Section 3, their found solutions can still represent a local proximal equilibrium.

### 9.2 Proximal Training Improves Lipschitz GANs

We applied the proximal training in Algorithm 1 to the WGAN-WC and SN-GAN problems. To compute the gradient of the proximal minimax objective, we solved the maximization problem in the Algorithm 1’s first step in the for loop by applying steps of Adam optimization initialized at the discriminator parameters at that iteration. Applying the proximal training to MNIST, CIFAR-10, and CelebA datasets, we qualitatively observed slightly visually better generated pictures. We postpone the generated samples to the Appendix.

To quantitatively compare the proximal and ordinary non-proximal GAN training, we measured the Inception scores of the samples generated in the CIFAR-10 experiments. As shown in Table 1, proximal training results in an improved inception score. In this table, DIM stands for the dimension parameter of the DC-GAN’s CNN networks.

## References

• [1] L. Ambrosio and N. Gigli (2013) A user’s guide to optimal transport. In Modelling and optimisation of flows on networks, pp. 1–155. Cited by: §C.6.
• [2] M. Arjovsky, S. Chintala, and L. Bottou (2017) Wasserstein gan. arXiv preprint arXiv:1701.07875. Cited by: §1, §1, §3, §4.2, §9.1.
• [3] S. Arora, R. Ge, Y. Liang, T. Ma, and Y. Zhang (2017) Generalization and equilibrium in generative adversarial nets (gans). In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pp. 224–232. Cited by: §1, §2.
• [4] D. Berthelot, T. Schumm, and L. Metz (2017) Began: boundary equilibrium generative adversarial networks. arXiv preprint arXiv:1703.10717. Cited by: §2.
• [5] D. P. Bertsekas (1997) Nonlinear programming. Journal of the Operational Research Society 48 (3), pp. 334–334. Cited by: §C.5.
• [6] S. Boyd and L. Vandenberghe (2004) Convex optimization. Cambridge university press. Cited by: §C.2.
• [7] A. Brock, J. Donahue, and K. Simonyan (2018) Large scale gan training for high fidelity natural image synthesis. arXiv preprint arXiv:1809.11096. Cited by: §1.
• [8] C. Daskalakis, A. Ilyas, V. Syrgkanis, and H. Zeng (2017) Training gans with optimism. arXiv preprint arXiv:1711.00141. Cited by: §2, §2.
• [9] C. Daskalakis and I. Panageas (2018) The limit points of (optimistic) gradient descent in min-max optimization. In Advances in Neural Information Processing Systems, pp. 9236–9246. Cited by: §2.
• [10] F. Farnia and D. Tse (2018) A convex duality framework for gans. In Advances in Neural Information Processing Systems, pp. 5248–5258. Cited by: §4.1.
• [11] F. Farnia, J. Zhang, and D. Tse (2019) Generalizable adversarial training via spectral normalization. In International Conference on Learning Representations, Cited by: §9.
• [12] W. Fedus, M. Rosca, B. Lakshminarayanan, A. M. Dai, S. Mohamed, and I. Goodfellow (2017) Many paths to equilibrium: gans do not need to decrease a divergence at every step. arXiv preprint arXiv:1710.08446. Cited by: §2.
• [13] S. Feizi, F. Farnia, T. Ginart, and D. Tse (2017) Understanding gans: the lqg setting. arXiv preprint arXiv:1710.10793. Cited by: §1, §2, §4.2, footnote 2.
• [14] T. Fiez, B. Chasnov, and L. J. Ratliff (2019) Convergence of learning dynamics in stackelberg games. arXiv preprint arXiv:1906.01217. Cited by: §2.
• [15] I. Goodfellow, Y. Bengio, and A. Courville (2016) Vol. 1, MIT Press. Cited by: §1.
• [16] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio (2014) Generative adversarial nets. In Advances in neural information processing systems, pp. 2672–2680. Cited by: §1, §1, §4.1, §4.1.
• [17] I. Goodfellow (2016) NIPS 2016 tutorial: generative adversarial networks. arXiv preprint arXiv:1701.00160. Cited by: §5.
• [18] I. Gulrajani, F. Ahmed, M. Arjovsky, V. Dumoulin, and A. C. Courville (2017) Improved training of wasserstein gans. In Advances in neural information processing systems, pp. 5767–5777. Cited by: §1, §1, §3, §9.1, §9.
• [19] M. Heusel, H. Ramsauer, T. Unterthiner, B. Nessler, and S. Hochreiter (2017) Gans trained by a two time-scale update rule converge to a local nash equilibrium. In Advances in neural information processing systems, pp. 6626–6637. Cited by: §2.
• [20] G. Hinton, N. Srivastava, and K. Swersky (2012) Neural networks for machine learning lecture 6a overview of mini-batch gradient descent. 14 (8). Cited by: §3, §9.
• [21] Y. Hsieh, C. Liu, and V. Cevher (2018) Finding mixed nash equilibria of generative adversarial networks. arXiv preprint arXiv:1811.02002. Cited by: §2.
• [22] C. Jin, P. Netrapalli, and M. I. Jordan (2019) Minmax optimization: stable limit points of gradient descent ascent are locally optimal. arXiv preprint arXiv:1902.00618. Cited by: §1, §2.
• [23] D. P. Kingma and J. Ba (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: §3, §9.
• [24] N. Kodali, J. Abernethy, J. Hays, and Z. Kira (2017) On convergence and stability of gans. arXiv preprint arXiv:1705.07215. Cited by: §2.
• [25] Y. LeCun (1998)

The mnist database of handwritten digits

.
http://yann. lecun. com/exdb/mnist/. Cited by: §3.
• [26] Q. Lei, J. D. Lee, A. G. Dimakis, and C. Daskalakis (2019) SGD learns one-layer networks in wgans. arXiv preprint arXiv:1910.07030. Cited by: §2.
• [27] T. Lin, C. Jin, and M. I. Jordan (2019) On gradient descent ascent for nonconvex-concave minimax problems. arXiv preprint arXiv:1906.00331. Cited by: §2, §2.
• [28] S. Liu, O. Bousquet, and K. Chaudhuri (2017) Approximation and convergence properties of generative adversarial learning. In Advances in Neural Information Processing Systems, pp. 5545–5553. Cited by: §4.1.
• [29] Z. Liu, P. Luo, X. Wang, and X. Tang (2015-12) Deep learning face attributes in the wild. In

Proceedings of International Conference on Computer Vision (ICCV)

,
Cited by: §3.
• [30] E. V. Mazumdar, M. I. Jordan, and S. S. Sastry (2019) On finding local nash equilibria (and only local nash equilibria) in zero-sum games. arXiv preprint arXiv:1901.00838. Cited by: §2.
• [31] L. Mescheder, A. Geiger, and S. Nowozin (2018) Which training methods for gans do actually converge?. arXiv preprint arXiv:1801.04406. Cited by: §2.
• [32] L. Mescheder, S. Nowozin, and A. Geiger (2017) The numerics of gans. In Advances in Neural Information Processing Systems, pp. 1825–1835. Cited by: §2.
• [33] T. Miyato, T. Kataoka, M. Koyama, and Y. Yoshida (2018) Spectral normalization for generative adversarial networks. arXiv preprint arXiv:1802.05957. Cited by: §1, §1, §3, §9.1, §9.
• [34] A. Mokhtari, A. Ozdaglar, and S. Pattathil (2019) A unified analysis of extra-gradient and optimistic gradient methods for saddle point problems: proximal point approach. arXiv preprint arXiv:1901.08511. Cited by: §2.
• [35] Y. Mroueh, C. Li, T. Sercu, A. Raj, and Y. Cheng (2017) Sobolev gan. arXiv preprint arXiv:1711.04894. Cited by: §2.
• [36] V. Nagarajan and J. Z. Kolter (2017) Gradient descent gan optimization is locally stable. In Advances in neural information processing systems, pp. 5585–5595. Cited by: §2.
• [37] M. Nouiehed, M. Sanjabi, T. Huang, J. D. Lee, and M. Razaviyayn (2019) Solving a class of non-convex min-max games using iterative first order methods. In Advances in Neural Information Processing Systems, pp. 14905–14916. Cited by: §2.
• [38] S. Nowozin, B. Cseke, and R. Tomioka (2016) F-gan: training generative neural samplers using variational divergence minimization. In Advances in neural information processing systems, pp. 271–279. Cited by: §1, §4.1, §5.
• [39] A. Radford, L. Metz, and S. Chintala (2015) Unsupervised representation learning with deep convolutional generative adversarial networks. arXiv preprint arXiv:1511.06434. Cited by: §3, §9.
• [40] K. Roth, A. Lucchi, S. Nowozin, and T. Hofmann (2017) Stabilizing training of generative adversarial networks through regularization. In Advances in neural information processing systems, pp. 2018–2028. Cited by: §2.
• [41] T. Salimans, I. Goodfellow, W. Zaremba, V. Cheung, A. Radford, and X. Chen (2016) Improved techniques for training gans. In Advances in neural information processing systems, pp. 2234–2242. Cited by: §1, §2.
• [42] T. Salimans, H. Zhang, A. Radford, and D. Metaxas (2018) Improving gans using optimal transport. arXiv preprint arXiv:1803.05573. Cited by: §4.2.
• [43] M. Sanjabi, J. Ba, M. Razaviyayn, and J. D. Lee (2018) On the convergence and robustness of training gans with regularized optimal transport. In Advances in Neural Information Processing Systems, pp. 7091–7101. Cited by: §4.2.
• [44] A. Taghvaei and A. Jalali (2019) 2-wasserstein approximation via restricted convex potentials with application to improved training for gans. arXiv preprint arXiv:1902.07197. Cited by: §4.2.
• [45] K. K. Thekumparampil, P. Jain, P. Netrapalli, and S. Oh (2019) Efficient algorithms for smooth minimax optimization. In Advances in Neural Information Processing Systems, pp. 12659–12670. Cited by: §2.
• [46] C. Villani (2008) Optimal transport: old and new. Vol. 338, Springer Science & Business Media. Cited by: §C.2, §C.7, footnote 2.
• [47] Y. Wang, X. Ma, J. Bailey, J. Yi, B. Zhou, and Q. Gu (2019) On the convergence and robustness of adversarial training. In International Conference on Machine Learning, pp. 6586–6595. Cited by: §2.
• [48] Y. Wang, G. Zhang, and J. Ba (2020) On solving minimax optimization locally: a follow-the-ridge approach. In International Conference on Learning Representations, Cited by: §2.
• [49] H. Zhang, I. Goodfellow, D. Metaxas, and A. Odena (2018) Self-attention generative adversarial networks. arXiv preprint arXiv:1805.08318. Cited by: §1.
• [50] K. Zhang, Z. Yang, and T. Basar (2019) Policy optimization provably converges to nash equilibria in zero-sum linear quadratic games. In Advances in Neural Information Processing Systems, pp. 11598–11610. Cited by: §2.
• [51] Z. Zhou, J. Liang, Y. Song, L. Yu, H. Wang, W. Zhang, Y. Yu, and Z. Zhang (2019) Lipschitz generative adversarial nets. arXiv preprint arXiv:1902.05687. Cited by: §2.

## Appendix A Numerical Results for Section 3

Here, we provide the complete numerical results for the experiments discussed in Section 3 of the main text. Regarding the plots shown in Section 3 for the SN-GAN implementation, here we present the same plots for the Wasserstein GAN with weight clipping (WGAN-WC) and with gradient penalty (WGAN-GP) problems. Figures 2(a)-3(b) repeat the experiments of Figures 1,2 in the main text for the WGAN-WC and WGAN-GP problems. These plots suggest that a similar result also holds for the WGAN-WC and WGAN-GP problems, where the objective and the generated samples’ quality were decreasing during the generator optimization. For a larger set of generated samples in the main text’s Figures 1,2 and Figures 2(a)-3(b), we refer the readers to Figures 4(a)-6(b).

## Appendix B Numerical Results for Section 9

Here, we present the complete numerical results for the experiments of Section 9 in the main text. Figures 7(a)-8(b) demonstrate the results of the main text’s Figures 3,4 for the WGAN-WC and WGAN-GP problems. Here, except the WGAN-GP experiment on the CelebA dataset, we observed that the objective and the generated samples’ quality did not significantly decrease over the generator optimization. Even for the WGAN-GP experiment on the CelebA data, we observed that the objective value decreased three times less than in minimizing the original objective rather than the proximal objective. These experiments suggest that the Wasserstein and Lipschitz GAN problems can converge to local proximal equilibrium solutions. We also show a larger group of generated samples at the beginning and final iterations of Figures 3,4 in the main text and Figures 7(a)-8(b) in Figures 9(a)-11(b).

For the proximal training experiments, Figures 13-14(b) show the samples generated by the SN-GAN and WGAN-WC proximally trained on CIFAR-10 and CelebA data with the results for the baseline regular training on the top of the figure and the results for proximal training on the bottom. We observed a somewhat improved quality achieved by proximal training, which was further supported by the inception scores for the CIFAR-10 experiments reported in the main text.

## Appendix C Proofs

### c.1 Proof of Proposition 1

Proof for -GANs:

Consider the following -GAN minimax problem corresponding to the convex function :

 minG∈GmaxDE[D(X)]−E[f∗(D(G(Z)))]. (C.1)

Due to the realizability assumption, given we assume that the data distribution and the generative model are identical, i.e., . Then, the minimax objective for reduces to

 (C.2)

The above objective decouples across outcomes. As a result, the maximizing discriminator will be a constant function where the constant value follows from the optimization problem:

 f′(1)=argmaxu∈Ru−f∗(u). (C.3)

Note that the objective is a concave function of whose derivative is zero at , because the Fenchel-conjugate of a convex satisfies .

So far we have proved that the constant function provides the optimal discriminator for generator . Therefore, for every discriminator we have

 V(G∗,D)≤V(G∗,Dconstant), (C.4)

where denotes the -GAN’s minimax objective. Moreover, note that for a constant the value of the minimax objective does not change with generator . As a result, for every

 V(G,Dconstant)=V(G∗,Dconstant). (C.5)

Then, (C.4) and (C.5) collectively prove that for every and we have

 V(G∗,D)≤V(G∗,Dconstant)≤V(G,Dconstant),

which completes the proof for -GANs.

Proof for Wasserstein GANs:

Consider a general Wasserstein GAN problem with a cost function satisfying for every . Notice that this property holds for all Wasserstein distance measures corresponding to cost function for . The generalized Wasserstein GAN minimax problem is as follows:

 minG∈GmaxD\rmc-concaveE[D(X)]−E[Dc(G(Z))]. (C.6)

Due to the realizability assumption, a generator function results in the data distribution such that . Then, the above minimax objective for reduces to

 EPX[D(X)−Dc(X)]. (C.7)

Since the cost is assumed to take a zero value given identical inputs, we have:

 Dc(x): =maxx′D(x′)−c(x,x′) ≥D(x)−c(x,x) =D(x).

As a result, holds for every . Hence, the objective in (C.7) will be non-positive and takes its maximum zero value for any constant function , which by definition satisfies -concavity. Therefore, letting denote the GAN minimax objective, for every we have

 V(G∗,D)≤V(G∗,Dconstant). (C.8)

We also know that for a constant discriminator the value of the minimax objective is independent from the generator function. Therefore, for every we have

 V(G∗,Dconstant)=V(G,Dconstant). (C.9)

As a result, (C.8) and (C.9) together show that for every and

 V(G∗,D)≤V(G∗,Dconstant)≤V(G,Dconstant), (C.10)

which makes the proof complete for Wasserstein GANs.

### c.2 Proof of Theorem 1 & Remark 1

Proof for -GANs:

###### Lemma 1.

Consider two random vectors

, respectively. Suppose that are non-zero everywhere. Then, considering the following variational representation of ,

 df(P,Q)=maxDE[D(X)]−E[f∗(D(˜X))], (C.11)

the optimal solution will satisfy

 D∗(x)=f′(p(x)q(x)). (C.12)
###### Proof.

Let us rewrite the -divergence’s variational representation as

 df(P,Q) =maxDE[D(X)]−E[f∗(D(˜X))] =maxD∫[p(x)D(x)−q(x)f∗(D(x))]dx =∫maxD(x)[p(x)D(x)−q(x)f∗(D(x))]dx

where the last equality holds, since the maximization objective decouples across values. It can be seen that the inside optimization problem for each is maximizing a concave objective in which by setting the derivative to zero we obtain

 f∗′(D∗(x))=p(x)q(x). (C.13)

As a property of the Fenchel-conjugate of a convex , we know which combined with the above equation implies that

 D∗(x)=f′(p(x)q(x)). (C.14)

The above result completes Lemma 1’s proof. ∎

Consider the -GAN problem with the generator function specified in the theorem:

 minW,u:∥W∥2≤1maxDE[D(X)]−E[f∗(D(WZ+u))]. (C.15)

Note that and . Notice that if was not full-rank, the maximized discriminator objective would be achieved by a assigning an infinity value to the points not included in the rank-constrained support set of generator . This will not result in a solution to the -GAN problem, because we assume that the dimensions of and match each other and hence there exists a full-rank with a finite maximized objective, i.e. -divergence value. Therefore, in a Nash equilibrium of the -GAN problem, the solution must be full-rank and invertible.

Lemma 1 results in the following equation for the optimal discriminator given generator parameters :

 D∗W,u(x) =f′(√det(WWT)σ2kexp{12xT((WWT)−1−σ−2I)x−uT(WWT)−1/2x+uT(WWT)−1u}).

As a result, the function appearing in the -GAN’s minimax objective will be

 f∗(D∗W,u(x))= −

Claim: is a strictly convex function of .

To show this claim, note that the following expression is a strongly-convex quadratic function of , since we have assumed that the spectral norm of is bounded as :

 12xT((WTW)−1−σ−2I)x−uT(WTW)−1/2xuT(WTW)−1u.

For simplicity, we denote the above strongly-convex function with and define the function as

 h(y):=f∗(f′(√det(WWT)σ2k×ey)).

According to the above definitions, is the composition of and strongly-convex . Note that is a monotonically increasing function, since defining we have

 h′(y)=(cey)2f′′(cey)≥0, (C.16)

which follows from the equality

 f∗(f′(z)):=supu{uf′(z)−f(u)}=zf′(z)−f(z)

that is a consequence of the definition of Fenchel-conjugate, implying that for the convex . Note that holds everywhere, because is assumed to be strictly convex. This proves that is strictly increasing. Furthermore, is a convex function, because is non-decreasing due to the assumption that is non-decreasing over . As a result, is an increasing convex function.

Therefore, is a composition of a strongly-convex and an increasing convex . Therefore, as a well-known result in convex optimization [6], the claim is true and is a strictly convex function of .

We showed that the claim is true for every feasible . Now, we prove that the pair will not be a local Nash equilibrium for any feasible . If the pair was a local Nash equilibrium, would be a local minimum for the following minimax objective where is fixed to be :

 E[D∗(X)]−E[f∗(D∗(WZ+u))]. (C.17)

However, as shown earlier, for any feasible , is a strictly-convex function of , which in turn shows that (C.17) is a strictly-concave function of variable . This consequence proves that the objective has no local minima for the unconstrained variable . Due to the shown contradiction, a pair with the form cannot be a local Nash equilibrium in parameters . Consequently, the minimax problem has no pure Nash equilibrium solutions, since in a pure Nash equilibrium the discriminator will be by definition optimal against the choice of generator.

Proof for W2GANs:

Consider the W2GAN problem with the assumed generator function:

 minW,u:∥W∥2≤1maxD\rmc% -concaveE[D(X)]−E[Dc(WZ+u)], (C.18)

where the c-transform is defined for the quadratic cost function . Similar to the -GAN case, define to be the optimal discriminator for the generator function parameterized by . Note that and .

According to the Brenier’s theorem [46], the optimal transport from the Gaussian data distribution to the Gaussian generative model will be

 ψopt(x)=x−∇xD∗W,u(x).\lx@notefootnoteNoticethechangeofvariable$D(x)=12∥x∥2−ψ(x)$comparedtotheformulationdiscussedat\@@cite[cite][\@@bibrefvillani2008optimal,feizi2017understanding]whicharebasedonthefunction$ψ$.

As a well-known result regarding the second-order optimal transport map between two Gaussian distributions, the optimal transport will be a linear transformation as

. This result shows that

 ∇xD∗W,u(x)=(I−1σ(WWT)1/2)x−u. (C.19)

Note that the c-transform for cost satisfies where denotes ’s Fenchel-conjugate. For general convex quadratic function we have where denotes ’s Moore Penrose pseudoinverse. Therefore, for the c-transform of the optimal discriminator we will have

 ∇xD