Deep generative models reached a turning point after generative adversarial networks (GANs) were proposed by 
. GANs are capable of modeling data with complex structures. For example, DCGAN can sample realistic images using a convolutional neural network (CNN) structure8, 7, 18, 14].
However, despite these successes, GANs are affected by training instability and mode collapse problems. GANs often fail to converge, which can result in unrealistic fake samples. Furthermore, even if GANs successfully synthesize realistic data, the fake samples exhibit little variability. This problem is due to Jensen–Shannon divergence and the low dimensionality of the data manifold.
A common solution to this problem is injecting an instance noise and finding different divergences. The injection of instance noise into real and fake samples during the training procedure was proposed by , where its positive impact on the low dimensional support for the data distribution was shown to be a regularizing factor based on the Wasserstein distance, as demonstrated analytically by . In -GAN, -divergence between the target and generator distributions was suggested which generalizes the divergence between two distributions
. In addition, a gradient penalty term which is related with Sobolev IPM(Integral Probability Metric) between data distribution and sample distribution was suggested by.
The Wasserstein GAN (WGAN) is known to resolve the problems of generic GANs by selecting the Wasserstein distance as the divergence. However, WGAN often fails with simple examples because the Lipschitz constraint on discriminator is rarely achieved during the optimization process and weight clipping. Thus, mimicking the Lipschitz constraint on the discriminator by using a gradient penalty was proposed by .
Noise injection and regularizing with a gradient penalty appear to be equivalent. The addition of instance noise in -GAN can be approximated to adding a zero centered gradient penalty. Thus, regularizing GAN with a simple gradient penalty term was suggested by  who provided a proof of its stability.
Based on a theoretical analysis of the convergence, 
proved the local exponential stability of the gradient-based optimization dynamics in GANs by treating the simultaneous gradient descent algorithm with a dynamic system approach. These previous studies were useful because they showed that the local behavior of GANs can be explained using dynamic system tools and the related Jacobian’s eigenvalues. An alternative gradient descent algorithm and the optimal step size for discrete updating were also studied by.
In this study, we aim to prove the convergence property of the simple gradient penalty -Wasserstein GAN(SGP -WGAN) dynamic system under general gradient penalty measures . To the best of our knowledge, our study is the first theoretical approach to GAN stability analysis which deals with abstract singular penalty measure. In addition, measure valued differentiation is applied to take the derivative on the integral with a parametric measure, which is helpful for handling an abstract measure and its integral in our proof.
The main contributions of this study are as follows.
We prove the regularized effect and local stability of the dynamic system for a general penalty measure under suitable assumptions. The assumptions are written as both a tractable strong version and intractable weak version. To prove the main theorem, we also introduce the measure valued differentiation concept to handle the parametric measure.
Based on the proof of the stability, we explain the reason for the success of previous penalty measures. We claim that the support of a penalty measure will be strongly related to the stability, where the weight on the limiting penalty measure might affect the speed of convergence.
We experimentally examined the general convergence results by applying two test penalty measures to several examples. The proposed test measures are unintuitive but they still satisfy the assumptions and similar convergence results were obtained in the experiment.
First, we introduce our notations and basic measure-theoretic concepts. Second, we define our SGP -WGAN optimization problem and treat this problem as a continuous dynamic system. Preliminary measure theoretic concepts are required to justify that the dynamic system changes in a sufficiently smooth manner as the parameter changes, so it is possible to use linearization theorem. They are also important for dealing with the parametric measure and its derivative. The problem setting with a simple gradient term is also discussed. The squared gradient size and simple gradient penalty term are used to build a differentiable dynamic system and to apply soft regularization as a resolving constraint, respectively. The continuous dynamic system approach, which is a so-called ODE method, is used to analyze the GAN optimization problem with the simultaneous gradient descent algorithm, as described by .
2.1 Notations and Preliminaries Regarding Measure Theory
is a discriminator function with its parameter and is a generator function with its parameter . is the distribution of real data and is the distribution of the generated samples in , which is induced from the generator function and a known initial distribution in the latent space . denotes the Euclidean norm if no special subscript is present.
The concept of weak convergence for finite measures is used to ensure the continuity of the integral term over the measure in the dynamic system, which must be checked before applying the theorems related to stability. Throughout this study, we assume that the measures in the sample space are all finite and bounded.
For a set of finite measures in the metric space with metric and Borel -algebra , is referred to as bounded if there exists some such that for all ,
For instance, can be set as 1 if are probability measures on . Assuming that the penalty measures are bounded, Portmanteau theorem offers the equivalent definition of the weak convergence for finite measures. This definition is important for ensuring that the integrals over and in the dynamic system change continuously.
(Portmanteau Theorem) For a bounded sequence of finite measures on the Euclidean space with a -field of Borel subsets , converges weakly to if and only if for every continuous bounded function on , its integrals with respect to converge to , i.e.,
The most challenging problem in our analysis with the general penalty measure is taking the derivative of the integral, where the measure depends on the variable that we want to differentiate. If our penalty measure is either absolutely continuous or discrete, then it is easy to deal with the integral. However, in the case of singular penalty measure, dealing with the integral term is not an easy task. Therefore, we introduce the concept of a weak derivative of a probability measure in the following. The weak derivative of a measure is useful for handling a parametric measure that is not absolutely continuous with low dimensional support.
(Weak Derivatives of a Probability Measure) Consider the Euclidean space and its -field of Borel subsets . The probability measure is called weakly differentiable at if a signed finite measure exists where
is satisfied for every continuous bounded function on . For the multidimensional parameter , this can be defined similar manner.
We can show that the positive part and negative part of have the same mass by putting and the Hahn–Jordan decomposition on . Therefore, the following triple is called a weak derivative of , where are probability measures and is rewritten as:
holds for every continuous bounded function on . It is known that the representation of for is not unique because is also another representation of .
For the general finite measure , a normalizing coefficient can be introduced. The product rule for differentiating can also be applied in a similar manner to calculus.
Therefore, for the general finite measure , its derivative can be represented as below.
2.2 Problem Setting as a Dynamic System
Previous work of  showed that the dynamic system of WGAN-GP is not necessarily stable at equilibrium by demonstrating that the sequence of parameters is not Cauchy sequence. This is mainly due to the term in the dynamic system which has a derivative that is not defined at . WGAN-GP has a penalty term that can lead to a discontinuity in its dynamic system.
These problems can be avoided by using the squared value of the gradient’s norm , which is a differentiable function. In contrast to the WGAN-GP, recent methods based on a gradient penalty such as the simple gradient penalty employed by  and the Sobolev GAN used the average of the squared values for the penalty area, whereas the WGAN-GP penalizes the size of the discriminator’s gradient away from 1 in a pointwise manner.
This advantage of squared gradient term111In this study, we prefer to use the expectation notation on the finite measure, which can be understood as follows. Suppose that where is normalized to the probability measure. Then, , , makes the dynamic system differentiable and we define the WGAN problem with the square of the gradient’s norm as a simple gradient penalty. This simple gradient penalty can be treated as soft regularization based on the size of the discriminator’s gradient, especially in case where is the probability measure . It is convenient to determine whether the system is stable by observing the spectrum of the Jacobian matrix. In the following, is defined as an SGP -WGAN optimization problem (SGP-form) with a simple gradient penalty term on the penalty measure .
The WGAN optimization problem with a simple gradient penalty term , penalty measure , and penalty weight hyperparameter
, and penalty weight hyperparameteris given as follows, where the penalty term is only introduced to update the discriminator.
According to  and many other optimization problem studies, the simultaneous gradient descent algorithm for GAN updating can be viewed as an autonomous dynamic system of discriminator parameters and generator parameters, which we denote as and . As a result, the related dynamic system is given as follows.
3 Toy Examples
We investigate two examples considered in previous studies by  and . We then generalize the results to a finite measure case. The first example is the univariate Dirac GAN, which was introduced by .
(Dirac GAN) The Dirac GAN comprises a linear discriminator , data distribution , and sample distribution .
The Dirac-GAN with a gradient penalty with an arbitrary probability measure is known to be globally convergent. We argue that this result can be generalized to a finite penalty measure case.
Consider the Dirac GAN problem with SGP form . Suppose that some small exists such that its finite penalty measure with mass satisfies either
and for .
Then, the SGP -WGAN optimization dynamics with are locally stable at the origin and the basin of attraction is open ball with radius . Its radius is given as follows.
Motivated by this example, we can extend this idea to the other toy example given by , where WGAN fails to converge to the equilibrium points .
Consider the toy example where and the ideal equilibrium points are given by . For a finite measure on which is independent of , suppose that with for . The dynamic system is locally stable near the desired equilibrium , where the spectrum of the Jacobian at is given by .
4 Main Convergence Theorem
We propose the convergence property of WGAN with a simple gradient penalty on an arbitrary penalty measure for a realizable case: with exists. In subsection 4.1, we provide the necessary assumptions, which comprise our main convergence theorem. In subsection 4.2, we give the main convergence theorem with a sketch of the proof. A more rigorous analysis is given in the Appendix.
The first assumption is made regarding the equilibrium condition for GANs, where we state the ideal conditions for the discriminator parameter and generator parameter. As the parameters converge to the ideal equilibrium, the sample distribution converges to the real data distribution and the discriminator cannot distinguish the generated sample and the real data.
as and on and its small open neighborhood, i.e., implies . For simplicity, we denote as .
The second assumption ensures that the higher order terms cannot affect the stability of the SGP -WGAN. In the Appendix, we consider the case where the WGAN fails to converge when Assumption 2 is not satisfied. Compared with the previous study by , the conditions for the discriminator parameter are slightly modified.
are locally constant along the nullspace of the Hessian matrix.
The third assumption allows us to extend our results to discrete probability distribution cases, as described by.
such that on .
The fourth assumption indicates that there are no other “bad” equilibrium points near , which justifies the projection along the axis perpendicular to the null space.
A bad equilibrium does not exist near the desired equilibrium point. Thus, is an isolated equilibrium or there exist such that all equilibrium points in satisfy the other assumptions.
The last assumption is related to the necessary conditions for the penalty measure. A calculation of the gradient penalty based on samples from the data manifold and generator manifold or the interpolation of both was introduced in recent studies[4, 15, 9]. First, we propose strong conditions for the penalty measure.
The finite penalty measure satisfies the followings:
and is independent of the discriminator parameter .
such that for .
The assumption given above means that the support of the penalty measure should approach the support of the data manifolds smoothly as . Thus, the gradient penalty should be evaluated based on the data manifold and some open neighborhood near the equilibrium. However, the penalty measure from WGAN-GP with a simple gradient penalty still reaches equilibrium without satisfying Assumption 5c. Therefore, we suggest Assumption 6, which is a weak version of Assumption 5. Assumption 6a222This condition is technically required to handle the derivative of the measure in a convenient manner using the weak formulation. Even if the measure is not differentiable, it may possible to differentiate the integral. For instance, is continuous but it does not have its weak derivative. However, it is still possible to differentiate if the function is differentiable at . is technically required to take the derivative of the integral with respect to .
(Weak version of Assumption 5) The finite penalty measure satisfies the following.
, where only depends on . Near the equilibrium, can be weakly differentiated twice with respect to . In addition, its mass is a twice-differentiable function of and bounded by near the equilibrium.
is positive definite or .
such that for , where .
In summary, the gradient penalty regularization term with any penalty measure where the support approaches in a smooth manner works well and this main result can explain the regularization effect of previously proposed penalty measures such as , , , and their mixtures.
4.2 Main Convergence Theorem
According to the modified assumptions given above, we prove that the related dynamic system is locally stable near the equilibrium. The tools used for analyzing stability are mainly based on those described by . Our main contributions comprise proposing the necessary conditions for the penalty measure and proving the local stability for all penalty measures that satisfy Assumption 6.
Suppose that our SGP -WGAN optimization problem with equilibrium point satisfies the assumptions given above. Then, the related dynamic system is locally stable at the equilibrium.
A detailed proof of the main convergence theorem is given in the Appendix. A sketch of the proof is given in three steps. First, the undesired terms in the Jacobian matrix of the system at the equilibrium are cancelled out. Next, the Jacobian matrix at equilibrium is given by , where and . The system is locally stable when both and are positive definite. We can complete the proof by dealing with zero eigenvalues by showing that and the projected system’s stability implies the original system’s stability.
Our analysis mainly focuses on WGAN, which is the simplest case of general GAN minimax optimization
with . Similar approach is still valid for general GANs with concave function with and .
5 Experimental Results
We claim that every penalty measure that satisfies the assumptions can regularize the WGAN and generate similar results to the recently proposed gradient penalty methods. Several penalty measures were tested based on two-dimensional problems (mixture of 8 Gaussians, mixture of 25 Gaussians, and swissroll), MNIST and CIFAR-10 datasets using a simple gradient penalty term. In the comparisons with WGAN, the recently proposed penalty measures and our test penalty measures used the same network settings and hyperparameters. The penalty measures and its detailed sampling methods are listed in Table 1, where , and . indicates fixed anchor point in .
|Penalty||Penalty term||Penalty measure, sampling method|
By setting the previously proposed WGAN with weight-clipping and WGAN-GP as the baseline models, SGP -WGAN was examined with various penalty measures comprising three recently proposed measures and two artificially generated measures. and were suggested by  and was introduced from the WGAN-GP. We analyzed the artificial penalty measures and as the test penalty measures.
The experiments were conducted based on the implementation of the 
. The hyperparameters, generator/discriminator structures, and related TensorFlow implementations can be found athttps://github.com/igul222/improved_wgan_training 
. Only the loss function was modified slightly from a non-zero centered gradient penalty to a simple penalty. For the CIFAR-10 image generation tasks, the inception score and FID were used as benchmark scores to evaluate the generated images.
5.1 2D Examples and MNIST
We checked the convergence of for the 2D examples (8 Gaussians, swissroll data, and 25 Gaussians) and MNIST digit generation for the SGP-WGANs with five penalty measures. MNIST and 25 Gaussians were trained over 200K iterations, the 8 Gaussians were trained for 30K iterations, and the Swiss Roll data were trained for 100K iterations. The anchor for was set as for the 2D examples and 784 gray pixels for MNIST. We only present the results obtained for the MNIST dataset with the penalty measures comprising and in Figure 1. The others are presented in the Appendix.
DCGAN and ResNet architectures were tested on the CIFAR-10 dataset. The generators were trained for 200K iterations. The anchor for during CIFAR-10 generation was set as fixed random pixels. The WGAN, WGAN-GP, and five penalty measures were evaluated based on the inception score and FID, as shown in Table 2, which are useful tools for scoring the quality of generated images. The images generated from and with ResNet are shown in Figure 2. The others are presented in the Appendix.
|WGAN 333 WGAN failed to generate images for the ResNet architecture||48.7||-||-|
In this study, we proved the local stability of simple gradient penalty -WGAN optimization for a general class of finite measure . This proof provides insight into the success of regularization with previously proposed penalty measures. We explored previously proposed analyses based on various gradient penalty methods. Furthermore, our theoretical approach was supported by experiments using unintuitive penalty measures. In future research, our works can be extended to alternative gradient descent algorithm and its related optimal hyperparameters. Stability at non-realizable equilibrium points is one of the important topics on stability of GANs. Optimal penalty measure for achieving the best convergence speed can be also investigated using a spectral theory, which provides the mathematical analysis on stability of GAN with a precise information on the convergence theory.
We thank Dr Seok Hyun Hong for fruitful discussions about this study. Hyung Ju Hwang was supported by the Basic Science Research Program through the National Research Foundation of Korea (NRF-2017R1E1A1A03070105).
-  Martín Arjovsky and Léon Bottou. Towards principled methods for training generative adversarial networks. In International Conference on Learning Representations, 2017.
Martín Arjovsky, Soumith Chintala, and Léon Bottou.
Wasserstein generative adversarial networks.
Proceedings of the 34th International Conference on Machine Learning, pages 214–223, 2017.
-  Ian J. Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron C. Courville, and Yoshua Bengio. Generative adversarial nets. In Advances in Neural Information Processing Systems, pages 2672–2680, 2014.
-  Ishaan Gulrajani, Faruk Ahmed, Martín Arjovsky, Vincent Dumoulin, and Aaron C. Courville. Improved training of wasserstein gans. In Advances in Neural Information Processing Systems, pages 5769–5779, 2017.
B. Heidergott and F. J. Vázquez-Abad.
Measure-valued differentiation for markov chains.Journal of Optimization Theory and Applications, 136:187–209, 2008.
-  Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. Gans trained by a two time-scale update rule converge to a local nash equilibrium. In Advances in Neural Information Processing Systems, pages 6629–6640, 2017.
Phillip Isola, Jun-Yan Zhu, Tinghui Zhou, and Alexei A. Efros.
Image-to-image translation with conditional adversarial networks.
2017 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, Honolulu, HI, USA, July 21-26, 2017, pages 5967–5976, 2017.
-  Christian Ledig, Lucas Theis, Ferenc Huszar, Jose Caballero, Andrew Cunningham, Alejandro Acosta, Andrew P. Aitken, Alykhan Tejani, Johannes Totz, Zehan Wang, and Wenzhe Shi. Photo-realistic single image super-resolution using a generative adversarial network. In 2017 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, Honolulu, HI, USA, July 21-26, 2017, pages 105–114, 2017.
-  Lars M. Mescheder, Andreas Geiger, and Sebastian Nowozin. Which training methods for gans do actually converge? In Proceedings of the 35th International Conference on Machine Learning, pages 3478–3487, 2018.
-  Youssef Mroueh, Chun-Liang Li, Tom Sercu, Anant Raj, and Yu Cheng. Sobolev GAN. In International Conference on Learning Representations, 2018.
-  Vaishnavh Nagarajan and J. Zico Kolter. Gradient descent GAN optimization is locally stable. In Advances in Neural Information Processing Systems, pages 5591–5600, 2017.
-  Sebastian Nowozin, Botond Cseke, and Ryota Tomioka. f-gan: Training generative neural samplers using variational divergence minimization. In Advances in Neural Information Processing Systems, pages 271–279, 2016.
-  Alec Radford, Luke Metz, and Soumith Chintala. Unsupervised representation learning with deep convolutional generative adversarial networks. CoRR, abs/1511.06434, 2015.
-  Scott E. Reed, Zeynep Akata, Xinchen Yan, Lajanugen Logeswaran, Bernt Schiele, and Honglak Lee. Generative adversarial text to image synthesis. In Proceedings of the 33nd International Conference on Machine Learning, ICML 2016, New York City, NY, USA, June 19-24, 2016, pages 1060–1069, 2016.
-  Kevin Roth, Aurélien Lucchi, Sebastian Nowozin, and Thomas Hofmann. Stabilizing training of generative adversarial networks through regularization. In Advances in Neural Information Processing Systems, pages 2015–2025, 2017.
-  Tim Salimans, Ian J. Goodfellow, Wojciech Zaremba, Vicki Cheung, Alec Radford, and Xi Chen. Improved techniques for training gans. In Advances in Neural Information Processing Systems, pages 2226–2234, 2016.
-  Casper Kaae Sønderby, Jose Caballero, Lucas Theis, Wenzhe Shi, and Ferenc Huszár. Amortised MAP inference for image super-resolution. International Conference on Learning Representations, 2017.
-  Han Zhang, Tao Xu, and Hongsheng Li. Stackgan: Text to photo-realistic image synthesis with stacked generative adversarial networks. In IEEE International Conference on Computer Vision, ICCV 2017, Venice, Italy, October 22-29, 2017, pages 5908–5916, 2017.
Appendix A : Proof of Lemmas based on toy examples
Proof of Lemma 1.
The related dynamic system of can be written as follows.
First, the only equilibrium point is given by from
The corresponding Jacobian matrix for the dynamic system is written as:
does not depend on , so this can be rewritten as:
Therefore, if , then the given system is locally stable because the eigenvalues of its linearized system have negative real parts. If , then the stability of the system cannot be proved by the linearization theorem. In this case, we consider the following Lyapunov function.
By differentiating with , we obtain
Clearly, and the equality holds iff . In addition, since and from the assumption. Furthermore, it is clear that if , then for all because the Lyapunov function (square of the distance between the origin and ) always decreases as . Therefore, the given system is stable according to the Lyapunov stability theorem.
Again, we can check that if is a probability measure, then the system is globally stable, as shown by . The basin of attraction is given by the whole plane since , so for every . ∎
Proof of Lemma 2.
From the general setup of the SGP -WGAN optimization problem, the dynamic system corresponding to the simple-GAN in Definition 6 can be written as follows.
If we let , then the Jacobian matrix at the equilibrium is given by . Therefore, the given system is locally stable when . ∎
Appendix B : Proof of Lemma related with Assumption 2
Consider the Dirac-GAN setup and SGP -WGAN optimization system with a slightly changed discriminator function . The system does not converge to but for any point with , the system has equilibrium points on the whole -axis and it violates Assumption 2.
Appendix C : Proof of the Main Convergence Theorem
Let us consider the Jacobian matrix at the first equilibrium 444
In standard notation, is the matrix. For a real-valued function , we consider the first derivative as the column vector instead of the row vector.
, we consider the first derivative as the column vector instead of the row vector.is considered to be the matrix(column vector) of the total derivative. For the second derivative, is the matrix. The transpose notation is used in a similar manner to the matrix. .
First, Assumption 1 implies that since as . From Assumption 3, is locally zero near the equilibrium , which implies that
We still need to evaluate and . According to Assumption 6a, finite signed measures and exist555 and will be considered as row vector( matrix) and matrix of finite signed measures respectively. and ., so they are the first and second weak derivatives of with respect to the parameter at . Therefore, the expectations given above can be rewritten as below.
From Assumption 6c and the fact that the weak derivative of vanishes outside of , on for all with and on the outside of , which leads to the desired results:
After cancelling the undesired terms, the Jacobian matrix at the equilibrium is given as:
From the definition of , it is easy to check that is at least positive semi-definite. It is known that for a negative definite matrix and full column rank matrix , the block matrix is Hurwitz, i.e., all eigenvalues of the matrix have a negative real part. Therefore, if is positive definite and is full column rank, the proof is complete. We consider the complementary case.
Suppose that or have some zero eigenvalues. Let and with and , where and
are the eigenvectors ofand that correspond to non-zero eigenvalues. First, we assume that and are not empty. We can show that is also an equilibrium point for a sufficiently small and by using the techniques given by . If the system does not update at the equilibrium point and its small neighborhood is perturbed along and , then it is reasonable to project the system orthogonal to and .
First, we assume that . By Assumption 2, for , which implies that for and . Thus, we obtain
By Assumption 4, since . By adding these equations, we obtain
Therefore, the point with is an equilibrium point. According to Assumption 4, is an equilibrium discriminator for , and thus is already an optimal discriminator for .
Suppose that . By Assumption 2, for , and thus for . Furthermore, Assumption 3 gives for a sufficiently close , which implies that for . Finally,
since and on