# Local Stability and Performance of Simple Gradient Penalty mu-Wasserstein GAN

Wasserstein GAN(WGAN) is a model that minimizes the Wasserstein distance between a data distribution and sample distribution. Recent studies have proposed stabilizing the training process for the WGAN and implementing the Lipschitz constraint. In this study, we prove the local stability of optimizing the simple gradient penalty μ-WGAN(SGP μ-WGAN) under suitable assumptions regarding the equilibrium and penalty measure μ. The measure valued differentiation concept is employed to deal with the derivative of the penalty terms, which is helpful for handling abstract singular measures with lower dimensional support. Based on this analysis, we claim that penalizing the data manifold or sample manifold is the key to regularizing the original WGAN with a gradient penalty. Experimental results obtained with unintuitive penalty measures that satisfy our assumptions are also provided to support our theoretical results.

## Authors

• 2 publications
• 1 publication
• 5 publications
• ### Varying k-Lipschitz Constraint for Generative Adversarial Networks

Generative Adversarial Networks (GANs) are powerful generative models, b...
03/16/2018 ∙ by Kanglin Liu, et al. ∙ 0

• ### On the convergence properties of GAN training

Recent work has shown local convergence of GAN training for absolutely c...
01/13/2018 ∙ by Lars Mescheder, et al. ∙ 0

• ### Sparse Optimization on Measures with Over-parameterized Gradient Descent

Minimizing a convex function of a measure with a sparsity-inducing penal...
07/24/2019 ∙ by Lenaïc Chizat, et al. ∙ 0

• ### A Wasserstein GAN model with the total variational regularization

It is well known that the generative adversarial nets (GANs) are remarka...
12/03/2018 ∙ by Lijun Zhang, et al. ∙ 0

• ### Minimal penalties and the slope heuristics: a survey

Birgé and Massart proposed in 2001 the slope heuristics as a way to choo...
01/22/2019 ∙ by Sylvain Arlot, et al. ∙ 0

• ### Towards Efficient and Unbiased Implementation of Lipschitz Continuity in GANs

Lipschitz continuity recently becomes popular in generative adversarial ...
04/02/2019 ∙ by Zhiming Zhou, et al. ∙ 0

• ### A Wasserstein Minimum Velocity Approach to Learning Unnormalized Models

Score matching provides an effective approach to learning flexible unnor...
02/18/2020 ∙ by Ziyu Wang, et al. ∙ 9

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Deep generative models reached a turning point after generative adversarial networks (GANs) were proposed by [3]

. GANs are capable of modeling data with complex structures. For example, DCGAN can sample realistic images using a convolutional neural network (CNN) structure

[13]

. GANs have been implemented in many applications in the field of computer vision with good results, such as super-resolution, image translation, and text-to-image generation

[8, 7, 18, 14].

However, despite these successes, GANs are affected by training instability and mode collapse problems. GANs often fail to converge, which can result in unrealistic fake samples. Furthermore, even if GANs successfully synthesize realistic data, the fake samples exhibit little variability. This problem is due to Jensen–Shannon divergence and the low dimensionality of the data manifold.

A common solution to this problem is injecting an instance noise and finding different divergences. The injection of instance noise into real and fake samples during the training procedure was proposed by [17], where its positive impact on the low dimensional support for the data distribution was shown to be a regularizing factor based on the Wasserstein distance, as demonstrated analytically by [1]. In -GAN, -divergence between the target and generator distributions was suggested which generalizes the divergence between two distributions[12]

. In addition, a gradient penalty term which is related with Sobolev IPM(Integral Probability Metric) between data distribution and sample distribution was suggested by

[10].

The Wasserstein GAN (WGAN) is known to resolve the problems of generic GANs by selecting the Wasserstein distance as the divergence[2]. However, WGAN often fails with simple examples because the Lipschitz constraint on discriminator is rarely achieved during the optimization process and weight clipping. Thus, mimicking the Lipschitz constraint on the discriminator by using a gradient penalty was proposed by [4].

Noise injection and regularizing with a gradient penalty appear to be equivalent. The addition of instance noise in -GAN can be approximated to adding a zero centered gradient penalty[15]. Thus, regularizing GAN with a simple gradient penalty term was suggested by [9] who provided a proof of its stability.

Based on a theoretical analysis of the convergence, [11]

proved the local exponential stability of the gradient-based optimization dynamics in GANs by treating the simultaneous gradient descent algorithm with a dynamic system approach. These previous studies were useful because they showed that the local behavior of GANs can be explained using dynamic system tools and the related Jacobian’s eigenvalues. An alternative gradient descent algorithm and the optimal step size for discrete updating were also studied by

[9].

In this study, we aim to prove the convergence property of the simple gradient penalty -Wasserstein GAN(SGP -WGAN) dynamic system under general gradient penalty measures . To the best of our knowledge, our study is the first theoretical approach to GAN stability analysis which deals with abstract singular penalty measure. In addition, measure valued differentiation[5] is applied to take the derivative on the integral with a parametric measure, which is helpful for handling an abstract measure and its integral in our proof.

The main contributions of this study are as follows.

• We prove the regularized effect and local stability of the dynamic system for a general penalty measure under suitable assumptions. The assumptions are written as both a tractable strong version and intractable weak version. To prove the main theorem, we also introduce the measure valued differentiation concept to handle the parametric measure.

• Based on the proof of the stability, we explain the reason for the success of previous penalty measures. We claim that the support of a penalty measure will be strongly related to the stability, where the weight on the limiting penalty measure might affect the speed of convergence.

• We experimentally examined the general convergence results by applying two test penalty measures to several examples. The proposed test measures are unintuitive but they still satisfy the assumptions and similar convergence results were obtained in the experiment.

## 2 Preliminaries

First, we introduce our notations and basic measure-theoretic concepts. Second, we define our SGP -WGAN optimization problem and treat this problem as a continuous dynamic system. Preliminary measure theoretic concepts are required to justify that the dynamic system changes in a sufficiently smooth manner as the parameter changes, so it is possible to use linearization theorem. They are also important for dealing with the parametric measure and its derivative. The problem setting with a simple gradient term is also discussed. The squared gradient size and simple gradient penalty term are used to build a differentiable dynamic system and to apply soft regularization as a resolving constraint, respectively. The continuous dynamic system approach, which is a so-called ODE method, is used to analyze the GAN optimization problem with the simultaneous gradient descent algorithm, as described by [11].

### 2.1 Notations and Preliminaries Regarding Measure Theory

is a discriminator function with its parameter and is a generator function with its parameter . is the distribution of real data and is the distribution of the generated samples in , which is induced from the generator function and a known initial distribution in the latent space . denotes the Euclidean norm if no special subscript is present.

The concept of weak convergence for finite measures is used to ensure the continuity of the integral term over the measure in the dynamic system, which must be checked before applying the theorems related to stability. Throughout this study, we assume that the measures in the sample space are all finite and bounded.

###### Definition 1.

For a set of finite measures in the metric space with metric and Borel -algebra , is referred to as bounded if there exists some such that for all ,

 μi(X)≤M

For instance, can be set as 1 if are probability measures on . Assuming that the penalty measures are bounded, Portmanteau theorem offers the equivalent definition of the weak convergence for finite measures. This definition is important for ensuring that the integrals over and in the dynamic system change continuously.

###### Definition 2.

(Portmanteau Theorem) For a bounded sequence of finite measures on the Euclidean space with a -field of Borel subsets , converges weakly to if and only if for every continuous bounded function on , its integrals with respect to converge to , i.e.,

 μn→μ⟺∫ϕdμn→∫ϕdμ

The most challenging problem in our analysis with the general penalty measure is taking the derivative of the integral, where the measure depends on the variable that we want to differentiate. If our penalty measure is either absolutely continuous or discrete, then it is easy to deal with the integral. However, in the case of singular penalty measure, dealing with the integral term is not an easy task. Therefore, we introduce the concept of a weak derivative of a probability measure in the following[5]. The weak derivative of a measure is useful for handling a parametric measure that is not absolutely continuous with low dimensional support.

###### Definition 3.

(Weak Derivatives of a Probability Measure) Consider the Euclidean space and its -field of Borel subsets . The probability measure is called weakly differentiable at if a signed finite measure exists where

 ddθ∫ϕ(x)dPθ=limΔ→01Δ{∫ϕ(x)dPθ+Δ−∫ϕ(x)dPθ}=∫ϕ(x)dP′θ

is satisfied for every continuous bounded function on . For the multidimensional parameter , this can be defined similar manner.

We can show that the positive part and negative part of have the same mass by putting and the Hahn–Jordan decomposition on . Therefore, the following triple is called a weak derivative of , where are probability measures and is rewritten as:

 P′θ=cθP+θ−cθP−θ

Therefore,

 ddθ∫ϕ(x)dPθ=∫ϕ(x)dP′θ=cθ(∫ϕ(x)dP+θ−∫ϕ(x)dP−θ)

holds for every continuous bounded function on . It is known that the representation of for is not unique because is also another representation of .

For the general finite measure , a normalizing coefficient can be introduced. The product rule for differentiating can also be applied in a similar manner to calculus.

 ddθ∫ϕ(x;θ)dPθ=∫∇θϕ(x;θ)dPθ+∫ϕ(x;θ)dP′θ

Therefore, for the general finite measure , its derivative can be represented as below.

 Q′θ=M′(θ)Pθ+M(θ)P′θ=M′(θ)Pθ+cθM(θ)P+θ−cθM(θ)P−θ

### 2.2 Problem Setting as a Dynamic System

Previous work of [9] showed that the dynamic system of WGAN-GP is not necessarily stable at equilibrium by demonstrating that the sequence of parameters is not Cauchy sequence. This is mainly due to the term in the dynamic system which has a derivative that is not defined at . WGAN-GP has a penalty term that can lead to a discontinuity in its dynamic system.

These problems can be avoided by using the squared value of the gradient’s norm , which is a differentiable function. In contrast to the WGAN-GP, recent methods based on a gradient penalty such as the simple gradient penalty employed by [9] and the Sobolev GAN used the average of the squared values for the penalty area, whereas the WGAN-GP penalizes the size of the discriminator’s gradient away from 1 in a pointwise manner.

This advantage of squared gradient term111In this study, we prefer to use the expectation notation on the finite measure, which can be understood as follows. Suppose that where is normalized to the probability measure. Then, , , makes the dynamic system differentiable and we define the WGAN problem with the square of the gradient’s norm as a simple gradient penalty. This simple gradient penalty can be treated as soft regularization based on the size of the discriminator’s gradient, especially in case where is the probability measure [15]. It is convenient to determine whether the system is stable by observing the spectrum of the Jacobian matrix. In the following, is defined as an SGP -WGAN optimization problem (SGP-form) with a simple gradient penalty term on the penalty measure .

###### Definition 4.

The WGAN optimization problem with a simple gradient penalty term , penalty measure

, and penalty weight hyperparameter

is given as follows, where the penalty term is only introduced to update the discriminator.

 maxψ :Epd[D(x;ψ)]−Epθ[D(x;ψ)]−ρ2Eμ[∥∇xD(x;ψ)∥2] minθ :Epd[D(x;ψ)]−Epθ[D(x;ψ)]

According to [11] and many other optimization problem studies, the simultaneous gradient descent algorithm for GAN updating can be viewed as an autonomous dynamic system of discriminator parameters and generator parameters, which we denote as and . As a result, the related dynamic system is given as follows.

 ˙ψ =Epd[∇ψD]−Epθ[∇ψD]−ρ2∇ψEμ[∇TxD∇xD] ˙θ =∇θEpθ[D]

## 3 Toy Examples

We investigate two examples considered in previous studies by [9] and [11]. We then generalize the results to a finite measure case. The first example is the univariate Dirac GAN, which was introduced by [9].

###### Definition 5.

(Dirac GAN) The Dirac GAN comprises a linear discriminator , data distribution , and sample distribution .

The Dirac-GAN with a gradient penalty with an arbitrary probability measure is known to be globally convergent[9]. We argue that this result can be generalized to a finite penalty measure case.

###### Lemma 1.

Consider the Dirac GAN problem with SGP form . Suppose that some small exists such that its finite penalty measure with mass satisfies either

• for or

• and for .

Then, the SGP -WGAN optimization dynamics with are locally stable at the origin and the basin of attraction is open ball with radius . Its radius is given as follows.

 R=max{η≥0|2M(ψ,θ)+ψ∇ψM(ψ,θ)≥0 for all (ψ,θ) such that ψ2+θ2≤η2}

Motivated by this example, we can extend this idea to the other toy example given by [11], where WGAN fails to converge to the equilibrium points .

###### Lemma 2.

Consider the toy example where and the ideal equilibrium points are given by . For a finite measure on which is independent of , suppose that with for . The dynamic system is locally stable near the desired equilibrium , where the spectrum of the Jacobian at is given by .

## 4 Main Convergence Theorem

We propose the convergence property of WGAN with a simple gradient penalty on an arbitrary penalty measure for a realizable case: with exists. In subsection 4.1, we provide the necessary assumptions, which comprise our main convergence theorem. In subsection 4.2, we give the main convergence theorem with a sketch of the proof. A more rigorous analysis is given in the Appendix.

### 4.1 Assumptions

The first assumption is made regarding the equilibrium condition for GANs, where we state the ideal conditions for the discriminator parameter and generator parameter. As the parameters converge to the ideal equilibrium, the sample distribution converges to the real data distribution and the discriminator cannot distinguish the generated sample and the real data.

###### Assumption 1.

as and on and its small open neighborhood, i.e., implies . For simplicity, we denote as .

The second assumption ensures that the higher order terms cannot affect the stability of the SGP -WGAN. In the Appendix, we consider the case where the WGAN fails to converge when Assumption 2 is not satisfied. Compared with the previous study by [11], the conditions for the discriminator parameter are slightly modified.

###### Assumption 2.

are locally constant along the nullspace of the Hessian matrix.

The third assumption allows us to extend our results to discrete probability distribution cases, as described by

[9].

###### Assumption 3.

such that on .

The fourth assumption indicates that there are no other “bad” equilibrium points near , which justifies the projection along the axis perpendicular to the null space.

###### Assumption 4.

A bad equilibrium does not exist near the desired equilibrium point. Thus, is an isolated equilibrium or there exist such that all equilibrium points in satisfy the other assumptions.

The last assumption is related to the necessary conditions for the penalty measure. A calculation of the gradient penalty based on samples from the data manifold and generator manifold or the interpolation of both was introduced in recent studies

[4, 15, 9]. First, we propose strong conditions for the penalty measure.

###### Assumption 5.

The finite penalty measure satisfies the followings:

1. and is independent of the discriminator parameter .

2. such that for .

The assumption given above means that the support of the penalty measure should approach the support of the data manifolds smoothly as . Thus, the gradient penalty should be evaluated based on the data manifold and some open neighborhood near the equilibrium. However, the penalty measure from WGAN-GP with a simple gradient penalty still reaches equilibrium without satisfying Assumption 5c. Therefore, we suggest Assumption 6, which is a weak version of Assumption 5. Assumption 6a222This condition is technically required to handle the derivative of the measure in a convenient manner using the weak formulation. Even if the measure is not differentiable, it may possible to differentiate the integral. For instance, is continuous but it does not have its weak derivative. However, it is still possible to differentiate if the function is differentiable at . is technically required to take the derivative of the integral with respect to .

###### Assumption 6.

(Weak version of Assumption 5) The finite penalty measure satisfies the following.

1. , where only depends on . Near the equilibrium, can be weakly differentiated twice with respect to . In addition, its mass is a twice-differentiable function of and bounded by near the equilibrium.

2. is positive definite or .

3. such that for , where .

In summary, the gradient penalty regularization term with any penalty measure where the support approaches in a smooth manner works well and this main result can explain the regularization effect of previously proposed penalty measures such as , , , and their mixtures.

### 4.2 Main Convergence Theorem

According to the modified assumptions given above, we prove that the related dynamic system is locally stable near the equilibrium. The tools used for analyzing stability are mainly based on those described by [11]. Our main contributions comprise proposing the necessary conditions for the penalty measure and proving the local stability for all penalty measures that satisfy Assumption 6.

###### Theorem 1.

Suppose that our SGP -WGAN optimization problem with equilibrium point satisfies the assumptions given above. Then, the related dynamic system is locally stable at the equilibrium.

A detailed proof of the main convergence theorem is given in the Appendix. A sketch of the proof is given in three steps. First, the undesired terms in the Jacobian matrix of the system at the equilibrium are cancelled out. Next, the Jacobian matrix at equilibrium is given by , where and . The system is locally stable when both and are positive definite. We can complete the proof by dealing with zero eigenvalues by showing that and the projected system’s stability implies the original system’s stability.

Our analysis mainly focuses on WGAN, which is the simplest case of general GAN minimax optimization

 maxψ minθ :Epd[f(D(x;ψ))]+Epθ[f(−D(x;ψ))]

with . Similar approach is still valid for general GANs with concave function with and .

## 5 Experimental Results

We claim that every penalty measure that satisfies the assumptions can regularize the WGAN and generate similar results to the recently proposed gradient penalty methods. Several penalty measures were tested based on two-dimensional problems (mixture of 8 Gaussians, mixture of 25 Gaussians, and swissroll), MNIST and CIFAR-10 datasets using a simple gradient penalty term. In the comparisons with WGAN, the recently proposed penalty measures and our test penalty measures used the same network settings and hyperparameters. The penalty measures and its detailed sampling methods are listed in Table 1, where , and . indicates fixed anchor point in .

By setting the previously proposed WGAN with weight-clipping[2] and WGAN-GP[4] as the baseline models, SGP -WGAN was examined with various penalty measures comprising three recently proposed measures and two artificially generated measures. and were suggested by [9] and was introduced from the WGAN-GP. We analyzed the artificial penalty measures and as the test penalty measures.

The experiments were conducted based on the implementation of the [4]

. The hyperparameters, generator/discriminator structures, and related TensorFlow implementations can be found at

https://github.com/igul222/improved_wgan_training [4]

. Only the loss function was modified slightly from a non-zero centered gradient penalty to a simple penalty. For the CIFAR-10 image generation tasks, the inception score

[16] and FID[6] were used as benchmark scores to evaluate the generated images.

### 5.1 2D Examples and MNIST

We checked the convergence of for the 2D examples (8 Gaussians, swissroll data, and 25 Gaussians) and MNIST digit generation for the SGP-WGANs with five penalty measures. MNIST and 25 Gaussians were trained over 200K iterations, the 8 Gaussians were trained for 30K iterations, and the Swiss Roll data were trained for 100K iterations. The anchor for was set as for the 2D examples and 784 gray pixels for MNIST. We only present the results obtained for the MNIST dataset with the penalty measures comprising and in Figure 1. The others are presented in the Appendix.

### 5.2 Cifar-10

DCGAN and ResNet architectures were tested on the CIFAR-10 dataset. The generators were trained for 200K iterations. The anchor for during CIFAR-10 generation was set as fixed random pixels. The WGAN, WGAN-GP, and five penalty measures were evaluated based on the inception score and FID, as shown in Table 2, which are useful tools for scoring the quality of generated images. The images generated from and with ResNet are shown in Figure 2. The others are presented in the Appendix.

## 6 Conclusion

In this study, we proved the local stability of simple gradient penalty -WGAN optimization for a general class of finite measure . This proof provides insight into the success of regularization with previously proposed penalty measures. We explored previously proposed analyses based on various gradient penalty methods. Furthermore, our theoretical approach was supported by experiments using unintuitive penalty measures. In future research, our works can be extended to alternative gradient descent algorithm and its related optimal hyperparameters. Stability at non-realizable equilibrium points is one of the important topics on stability of GANs. Optimal penalty measure for achieving the best convergence speed can be also investigated using a spectral theory, which provides the mathematical analysis on stability of GAN with a precise information on the convergence theory.

#### Acknowledgments

We thank Dr Seok Hyun Hong for fruitful discussions about this study. Hyung Ju Hwang was supported by the Basic Science Research Program through the National Research Foundation of Korea (NRF-2017R1E1A1A03070105).

## References

• [1] Martín Arjovsky and Léon Bottou. Towards principled methods for training generative adversarial networks. In International Conference on Learning Representations, 2017.
• [2] Martín Arjovsky, Soumith Chintala, and Léon Bottou. Wasserstein generative adversarial networks. In

Proceedings of the 34th International Conference on Machine Learning

, pages 214–223, 2017.
• [3] Ian J. Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron C. Courville, and Yoshua Bengio. Generative adversarial nets. In Advances in Neural Information Processing Systems, pages 2672–2680, 2014.
• [4] Ishaan Gulrajani, Faruk Ahmed, Martín Arjovsky, Vincent Dumoulin, and Aaron C. Courville. Improved training of wasserstein gans. In Advances in Neural Information Processing Systems, pages 5769–5779, 2017.
• [5] B. Heidergott and F. J. Vázquez-Abad.

Measure-valued differentiation for markov chains.

Journal of Optimization Theory and Applications, 136:187–209, 2008.
• [6] Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. Gans trained by a two time-scale update rule converge to a local nash equilibrium. In Advances in Neural Information Processing Systems, pages 6629–6640, 2017.
• [7] Phillip Isola, Jun-Yan Zhu, Tinghui Zhou, and Alexei A. Efros. In

2017 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, Honolulu, HI, USA, July 21-26, 2017

, pages 5967–5976, 2017.
• [8] Christian Ledig, Lucas Theis, Ferenc Huszar, Jose Caballero, Andrew Cunningham, Alejandro Acosta, Andrew P. Aitken, Alykhan Tejani, Johannes Totz, Zehan Wang, and Wenzhe Shi. Photo-realistic single image super-resolution using a generative adversarial network. In 2017 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, Honolulu, HI, USA, July 21-26, 2017, pages 105–114, 2017.
• [9] Lars M. Mescheder, Andreas Geiger, and Sebastian Nowozin. Which training methods for gans do actually converge? In Proceedings of the 35th International Conference on Machine Learning, pages 3478–3487, 2018.
• [10] Youssef Mroueh, Chun-Liang Li, Tom Sercu, Anant Raj, and Yu Cheng. Sobolev GAN. In International Conference on Learning Representations, 2018.
• [11] Vaishnavh Nagarajan and J. Zico Kolter. Gradient descent GAN optimization is locally stable. In Advances in Neural Information Processing Systems, pages 5591–5600, 2017.
• [12] Sebastian Nowozin, Botond Cseke, and Ryota Tomioka. f-gan: Training generative neural samplers using variational divergence minimization. In Advances in Neural Information Processing Systems, pages 271–279, 2016.
• [13] Alec Radford, Luke Metz, and Soumith Chintala. Unsupervised representation learning with deep convolutional generative adversarial networks. CoRR, abs/1511.06434, 2015.
• [14] Scott E. Reed, Zeynep Akata, Xinchen Yan, Lajanugen Logeswaran, Bernt Schiele, and Honglak Lee. Generative adversarial text to image synthesis. In Proceedings of the 33nd International Conference on Machine Learning, ICML 2016, New York City, NY, USA, June 19-24, 2016, pages 1060–1069, 2016.
• [15] Kevin Roth, Aurélien Lucchi, Sebastian Nowozin, and Thomas Hofmann. Stabilizing training of generative adversarial networks through regularization. In Advances in Neural Information Processing Systems, pages 2015–2025, 2017.
• [16] Tim Salimans, Ian J. Goodfellow, Wojciech Zaremba, Vicki Cheung, Alec Radford, and Xi Chen. Improved techniques for training gans. In Advances in Neural Information Processing Systems, pages 2226–2234, 2016.
• [17] Casper Kaae Sønderby, Jose Caballero, Lucas Theis, Wenzhe Shi, and Ferenc Huszár. Amortised MAP inference for image super-resolution. International Conference on Learning Representations, 2017.
• [18] Han Zhang, Tao Xu, and Hongsheng Li. Stackgan: Text to photo-realistic image synthesis with stacked generative adversarial networks. In IEEE International Conference on Computer Vision, ICCV 2017, Venice, Italy, October 22-29, 2017, pages 5908–5916, 2017.

## Appendix A : Proof of Lemmas based on toy examples

###### Proof of Lemma 1.

The related dynamic system of can be written as follows.

 ˙ψ =−θ−ρ2∇ψEμψ,θ[ψ2] ˙θ =ψ

First, the only equilibrium point is given by from

 0 =−θ−2ψM(ψ,θ)−ψ2∇ψM(ψ,θ) 0 =ψ

The corresponding Jacobian matrix for the dynamic system is written as:

 J=[Z−110]

where

 Z=−ρ2∇ψψEμψ,θ[ψ2]∣∣∣ψ=0,θ=0

does not depend on , so this can be rewritten as:

 Z =−ρ2∇ψψ(ψ2Eμψ,θ[1])=−ρ2(2M(ψ,θ)+4ψ∇ψM(ψ,θ)+ψ2Mψψ(ψ,θ))∣∣∣ψ=0,θ=0 =−ρM(0,0)

Therefore, if , then the given system is locally stable because the eigenvalues of its linearized system have negative real parts. If , then the stability of the system cannot be proved by the linearization theorem. In this case, we consider the following Lyapunov function.

 L(ψ(t),θ(t))=ψ(t)2+θ(t)2

By differentiating with , we obtain

 ˙L =2(ψψ′+θθ′)=−ρψ∇ψ(ψ2M(ψ,θ))=−ρψ(2ψM(ψ,θ)+ψ2∇ψM(ψ,θ)) =−ρψ2(2M(ψ,θ)+ψ∇ψM(ψ,θ))≤0

Clearly, and the equality holds iff . In addition, since and from the assumption. Furthermore, it is clear that if , then for all because the Lyapunov function (square of the distance between the origin and ) always decreases as . Therefore, the given system is stable according to the Lyapunov stability theorem.

Again, we can check that if is a probability measure, then the system is globally stable, as shown by [9]. The basin of attraction is given by the whole plane since , so for every . ∎

###### Proof of Lemma 2.

From the general setup of the SGP -WGAN optimization problem, the dynamic system corresponding to the simple-GAN in Definition 6 can be written as follows.

 ˙ψ =13−θ23−4ρψEμ[x2] ˙θ =2ψθ3

If we let , then the Jacobian matrix at the equilibrium is given by . Therefore, the given system is locally stable when . ∎

## Appendix B : Proof of Lemma related with Assumption 2

###### Lemma 3.

Consider the Dirac-GAN setup and SGP -WGAN optimization system with a slightly changed discriminator function . The system does not converge to but for any point with , the system has equilibrium points on the whole -axis and it violates Assumption 2.

###### Proof of Lemma 3.

For the SGP -WGAN optimization problem , the dynamic system can be written as follows.

 ˙ψ =−θ2−43ρψθ2 ˙θ =2ψθ

and implies that , so the -axis is the set of all equilibrium points. By drawing the nullclines and in the -plane, it is clear that no solution curve converges to with , as shown in Figure 3.

## Appendix C : Proof of the Main Convergence Theorem

###### Proof.

Let us consider the Jacobian matrix at the first equilibrium 444 In standard notation, is the matrix. For a real-valued function

, we consider the first derivative as the column vector instead of the row vector.

is considered to be the matrix(column vector) of the total derivative. For the second derivative, is the matrix. The transpose notation is used in a similar manner to the matrix. .

 J=[Epd[∇ψψD]−Epθ∗[∇ψψD]−ρ2∇ψψEμ[∥∇xD∥2]−∇θψEpθ[D]−ρ2∇θψEμ[∥∇xD∥2]∇ψθEpθ[D]∇θθEpθ[D]]

First, Assumption 1 implies that since as . From Assumption 3, is locally zero near the equilibrium , which implies that

 KGG=∇θθEpθ[D(x;ψ∗)]∣∣∣θ=θ∗=0

We still need to evaluate and . According to Assumption 6a, finite signed measures and exist555 and will be considered as row vector( matrix) and matrix of finite signed measures respectively. and ., so they are the first and second weak derivatives of with respect to the parameter at . Therefore, the expectations given above can be rewritten as below.

 I =∇ψψ∫supp(μψ,θ)∥∇xD∥2dμψ,θ II =∇θψ∫supp(μψ,θ)∥∇xD∥2dμψ,θ =∇θ(∫supp(μψ,θ)2(∇TψxD∇xD)dμψ,θ+∫supp(μψ,θ)∥∇xD∥2dμ′ψ,θ)

where

 K0(x;ψ)=[∑k∂3∂ψi∂ψj∂xkD(x;ψ)∂∂xkD(x;ψ)]ij

From Assumption 6c and the fact that the weak derivative of vanishes outside of , on for all with and on the outside of , which leads to the desired results:

 I =∫supp(μ∗)2(∇TψxD(x;ψ∗)∇ψxD(x;ψ∗))dμ∗ II =0

After cancelling the undesired terms, the Jacobian matrix at the equilibrium is given as:

 J=[−ρQ−RRT0]

where

 Q =Eμ∗[∇TψxD∇ψxD] R =∇θEpθ[∇ψD]∣∣∣θ=θ∗

From the definition of , it is easy to check that is at least positive semi-definite. It is known that for a negative definite matrix and full column rank matrix , the block matrix is Hurwitz, i.e., all eigenvalues of the matrix have a negative real part. Therefore, if is positive definite and is full column rank, the proof is complete. We consider the complementary case.

Suppose that or have some zero eigenvalues. Let and with and , where and

are the eigenvectors of

and that correspond to non-zero eigenvalues. First, we assume that and are not empty. We can show that is also an equilibrium point for a sufficiently small and by using the techniques given by [11]. If the system does not update at the equilibrium point and its small neighborhood is perturbed along and , then it is reasonable to project the system orthogonal to and .

First, we assume that . By Assumption 2, for , which implies that for and . Thus, we obtain

 Eμψ∗+ξv,θ∗[∇TψxD(x;ψ∗+ξv)∇xD(x;ψ∗+ξv)]=0

and

 ∫supp(μ∗)∥∇xD(x;ψ∗+ξv)∥2dμ′ψ∗+ξv,θ∗=0

By Assumption 4, since . By adding these equations, we obtain

 ˙ψ =Epd[∇ψD(x;ψ∗+ξv)]−Epθ∗[∇ψD(x;ψ∗+ξv)] −ρ2∫supp(μψ∗+ξv,θ∗)2∇TψxD(x;ψ∗+ξv)∇xD(x;ψ∗+ξv)dμψ∗+ξv,θ∗ =0