1 From Divergence to GAN
1.1 Divergence
Most Generative Adversarial Networks (GANs, Goodfellow . (2014)) are based on a certain form of probability divergence. A divergence is a function of two variables satisfies the following definition:
Definition 1
If is function of two variables satisfies the following properties:

;

.
We say is a divergence between and .
Compared with the axiomatic defination of distance, a divergence do not need symmetry and triangle inequality necessarily. Divergence only keeps the fundamental property for measuring the difference between and .
1.2 Dual Form
If
represent two probability distributions,
becomes a functional and we call it probability divergence. For example, we have JensenShannon divergence (JS divergence):(1) 
In most cases, we can find a dual form for a probability divergence (Nowozin ., 2016). For example, the dual form of JS divergence is
(2)  
Here
is sigmoid function. The dual form can convert the integral of original divergence into a sampling form, which allows us to estimate it by Monte Carlo method. That is the essential to GAN. The dual form always has a max operation, which means a divergence is a supremum of a family of functions.
1.3 MinMax Game
With the help of dual form of probability divergence, we can train a generator to generate the distribution we are interested via playing a minmax game. For example, using we have
(3) 
For a fixed , the goal of is
(4)  
However, the loss is not always good for optimization, so we usually use a equivalent loss, such as and . Namely, we may adjust the loss of generator for a better optimization, rather than playing the original minmax game.
2 Divergence in Dual Space
2.1 Steps to GAN
From the above discussion, we can see that to construct a GAN usually includes three steps:

choose a probability divergence;

onvert it into a dual form;

play a minmax game.
But we know that only the last two steps are useful for practice. The fisrt step is only a theoretical concept and is not very important for a GAN. Therefore, a natural thought is: why not analyse the property of divergence and even construct new divergence in dual space directly? Our following content will demonstrate this thought is a very simple approach to build and understand GANs.
2.2 Sgan
We start from the Standard GAN (SGAN, Goodfellow . (2014)) as an example to show how we can achieve the goal. From the appendix A.1, we have Lemma 1:
Lemma 1
The following defines a probability divergence
(5) 
It is worth to highlight that we prove is a probability divergence^{1}^{1}1Namely, satisfying the definition 1. in dual space, not need the original defination . Getting rid of the original defination of divergence allows us to seek more powerful divergence in dual space.
Now we have a divergence defined by Lemma 1, so we can train a generator by minimizing , which results in the minmax game .
The difficulty arises while there is almost no intersection between and . For example, we consider and . Now we have
(6) 
because no constraint on , we can let , to obain the maximum value of the above formula, that is
(7) 
So if there is almost no intersection between and , this divergence of them is a constant , whose gradients are zeros. In this situation, generator can not imporve via gradient descent method. And we know this situation will happen with very high probability (Arjovsky Bottou, 2017). Therefore it’s hard to train a good generative model under the framework of SGAN.
2.3 Wgan
We turn to a new kind of divergence by Lemma 2:
Lemma 2
The following defines a probability divergence
(8) 
here
(9) 
and is any distance metric of . is frequentlyused.
The proof is in appendix A.2. Interestingly, the proof is very simple compared the corresponding proof of divergence.
Compared with , can reasonably measure the difference of while they almost have no intersection. Let us consider and :
(11) 
The constraint means . So we have
(12) 
The result is not a constant and its gradients are not zeros. So WGAN will not suffer gradient vanishing usually.
2.4 WganGp
The essential problem of WGAN is how to constrain in , which currently has serveral solutions: weight clipping (WC, Arjovsky . (2017)), gradient penalty (GP, Gulrajani . (2017)) and spectral normalization (SN, Miyato . (2018)).
Weight clipping is always unstable and has been abandoned in most cases. Spectral normalization is a better operation for not only WGANs but also many other GANs, but it constrains in a tiny subspace of , wasting the modeling power of .
It seems the best approch is gradient penalty now. Gradient penalty replaces with the norm of gradients , and implement it as a penalty term:
(13)  
whose
is a hyperparameter and
is the random linear interpolation of
and . It is called WGANGP.WGANGP works well in many cases but it is just an empirical trick. Recently some researchs reveal some irrationality of WGANGP (Wu ., 2017). Another disadvantage of WGANGP is the slow speed, while calculating the exact gradients needs more heavy computation.
3 GAN with Quadratic Potential
3.1 A Quadratic Divergence
From the discussion of SGAN and WGAN, we can see that an ideal divergence should not has any constraints on (like SGAN) and should give a reasonable measurement of while they almost have no intersection (like WGAN).
Here we propose such a divergence:
Lemma 3
The following defines a probability divergence
(14)  
is a hyperparameter and is any distance metric of . is frequentlyused.
The proof is shown in appendix A.3. It is like a conventional divergence pluses a quadratic potential, so we call it quadratic potential divergence (QPdiv).
Now we check with to demenstrate QPdiv will have a reasonable performance to two extreme distribution:
(15) 
Let , that is only a maximum value problem of quadratic function . We know it is , so
(16) 
We can see that has similar property like , but with no constraints on . It is very friendly to pratice.
3.2 From QPdiv to GANQP
In theory, we can play a minmax game on QPdiv to train a generative model
(17)  
However, is not a good loss for generator because there is a in the denominator. Generator wants to minmize , which will minimize correspondingly. And we know any readytouse distance may not be used as a perfect metric of two samples.
We find that using as the loss of generator is enough. That results the following generative model called GAN with Quadratic Potential (GANQP):
(18)  
Futher discussion can be found in appendix B.
4 Experiments
4.1 Experimential Details
Our experiments are mainly conducted on CelebA HQ dataset (Karras ., 2017). Our basic setup follows DCGANs (Radford ., 2015)
, and is implemented in Keras
(Chollet ., 2015), and available in my repository^{2}^{2}2https://github.com/bojone/ganqp. We use the Adam optimizer (Kingma Ba, 2014), with a constant learning rate of and in both and . We train GANQP with two D steps per G step.In , discriminator is a model with both real sample and fake sample as inputs. But in our experiments, we find just using one sample as input has generated good performance. In other words, is enough. We try the architecture like but there is no obvious improvement. So the final loss we use is
(19)  
Hyperparameter is
(20) 
is the width, height and the number of channels of the input images.We test both L1 norm and L2 norm in our experiments, but they have no significant statistical difference.
The quantitative index we use to evaluate a GAN is Frechet Inception Distance (FID, Heusel . (2017)). We also reimplement it in Keras.
4.2 Basic Comparison
Firstly, we compared GANQP with WGANGP, WGANSN (WGAN with spectral normalization), SGANSN (SGAN with spectral normalization), LSGANSN (LSGAN with spectral normalization) on 128x128 resolution. in WGANGP
we use is 10. Then Batch Normalization is removed from discriminator of WGANGP and other hyperparameters are as the same as GANQP. Each experiment is repeated twice for obtaining more reliable conclusion. The comparison is shown in Figure
1 and Table 1.We can see that there is no obvious difference between GANQPL1 and GANQPL2, which means GANQP is robust to distance metric. The best two results come from GANQP and WGANSN. The worst is WGANGP. Generally, the FID curve of WGANSN and SGANSN is more smoother and GANQP is more shaking. But GANQP keeps the best performance as same as SGANSN.
GANQPL1 / L2  WGANGP  WGANSN  SGANSN  LSGANSN  

Best FID  45.0 / 44.7  55.5  47.8  44.5  45.8 
Speed  1x / 1x  1.5x  1x  1x  1x 
4.3 Higher Resolution
On 128x128 resolution, SGANSN and GANQP has the same best performance. If we turn to 256x256 resolution, we can see that GANQP achieves a better FID than SGANSN (Table 2). It even works well on 512x512 resolution (Figure 2).
GANQP  SGANSN  
Best FID  22.7  27.9 
5 Conclusion
In this paper, we demonstrate that we can explore probability divergence directly, which is more convenient and flexible. Starting from this idea, we find out a novel divergence called QPdiv, which has excellent characteristics, does not require the 1Lipschitz constraint and does not require the extract gradient penalty. As a concrete example, we construct a new framework of GAN equiped QPdiv: GANQP. And the experiments demonstrate the stability and superiority of GANQP.
References
 Arjovsky Bottou (2017) Arjovsky2017TowardsArjovsky, M. Bottou, L. 2017. Towards Principled Methods for Training Generative Adversarial Networks Towards principled methods for training generative adversarial networks.
 Arjovsky . (2017) Arjovsky2017WassersteinArjovsky, M., Chintala, S. Bottou, L. 2017. Wasserstein GAN Wasserstein gan.
 Chollet . (2015) chollet2015kerasChollet, F. . 2015. Keras. Keras. https://keras.io.
 Goodfellow . (2014) Goodfellow2014GenerativeGoodfellow, IJ., PougetAbadie, J., Mirza, M., Xu, B., WardeFarley, D., Ozair, S.Bengio, Y. 2014. Generative Adversarial Networks Generative adversarial networks. Advances in Neural Information Processing Systems326722680.
 Gulrajani . (2017) Gulrajani2017ImprovedGulrajani, I., Ahmed, F., Arjovsky, M., Dumoulin, V. Courville, A. 2017. Improved Training of Wasserstein GANs Improved training of wasserstein gans.
 Heusel . (2017) Heusel2017GANsHeusel, M., Ramsauer, H., Unterthiner, T., Nessler, B. Hochreiter, S. 2017. GANs Trained by a Two TimeScale Update Rule Converge to a Local Nash Equilibrium Gans trained by a two timescale update rule converge to a local nash equilibrium.
 Karras . (2017) Karras2017ProgressiveKarras, T., Aila, T., Laine, S. Lehtinen, J. 2017. Progressive Growing of GANs for Improved Quality, Stability, and Variation Progressive growing of gans for improved quality, stability, and variation.
 Kingma Ba (2014) Kingma2014AdamKingma, DP. Ba, J. 2014. Adam: A Method for Stochastic Optimization Adam: A method for stochastic optimization. Computer Science.
 Mao . (2016) Mao2016LeastMao, X., Li, Q., Xie, H., Lau, RYK., Wang, Z. Smolley, SP. 2016. Least Squares Generative Adversarial Networks Least squares generative adversarial networks.
 Miyato . (2018) Miyato2018SpectralMiyato, T., Kataoka, T., Koyama, M. Yoshida, Y. 2018. Spectral Normalization for Generative Adversarial Networks Spectral normalization for generative adversarial networks.
 Nowozin . (2016) Nowozin2016fNowozin, S., Cseke, B. Tomioka, R. 2016. fGAN: Training Generative Neural Samplers using Variational Divergence Minimization fgan: Training generative neural samplers using variational divergence minimization.
 Radford . (2015) Radford2015UnsupervisedRadford, A., Metz, L. Chintala, S. 2015. Unsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks Unsupervised representation learning with deep convolutional generative adversarial networks. Computer Science.
 Wu . (2017) Wu2017WassersteinWu, J., Huang, Z., Thoma, J., Acharya, D. Van Gool, L. 2017. Wasserstein Divergence for GANs Wasserstein divergence for gans.
Appendix A Detailed Derivation
a.1 is a divergence
Firstly, it is trivial to see that is nonnegative since we can always let :
(21)  
Next, we show , which is also simple:
(22)  
It is not difficult to prove that the maximum value of is at , so we have
(23) 
Finally, we show if .^{3}^{3}3Strictly, is not enough. The sufficient condition is . Let
(24) 
we have
(25)  
Becausce of , , and we know if . Therefore,
(26) 
a.2 is a divergence
Firstly, it is trivial to see that is nonnegative since we can always let :
(27) 
Next, is also trivial. So we only need to show if . It it actually not difficult because we only need to let :
(28) 
That means .
a.3 is a divergence
Firstly, it is trivial to see that is nonnegative since we can always let :
(29) 
Next, we have
(30) 
Obviously, the maximum value is zero. So .
Finally, if , we let
(31) 
now we have
(32)  
If , then
(33)  
else if , we can define
(34) 
then
(35)  
Therefore .
Appendix B Analyse of GANQP
b.1 Optimum Solution of
Lemma 4
the optimum solution of satisfies
(36) 
The proof is the basic application of variational method:
(37)  
The formula in square brackets must be identically equal to zero. Therefore
(38) 
b.2 Adaptive Lipschitz Constraint
From , it is not difficult to prove that the optimum satisfies
(39) 
In other words, the optimum satisfies Lipschitz constraint automatically. Therefore we can say is a divergence with adative Lipschitz constraint.
b.3 The Divergence of Generator
We use rather than the whole as the loss of generator in . And we have solved the optimum solution of in Lemma 4. Then we can see the ultimate goal of generator to minimize is
(40) 
Now we have Lemma 5:
Lemma 5
(41) 
is also a probability divergence of .
Actually Lemma 5 is a conclusion of Cauchy–Schwarz inequality. Firstly we let
(42) 
Then by Cauchy–Schwarz inequality we have
(43)  
So
(44) 
Two sides are equal if and only if , which means . Therefore is a probability divergence, which means to lower is actually to lower the difference between and . The divergence is weighted by , forcing generator to focus on the sample pairs of larger distance, which is in line with our intuition.
b.4 Performance while No Intersection
We have shown the QPdiv in Lemma 3 also works well while there is no intersection between and . But now we use as the loss of generator, corresponding to the new divergence . Therefore we have to check the performance of with . That is very easy:
(45)  
We know and for , so the result is , a reasonable measurement actually.
b.5 Robustness of
showed that is just a scaler for . That means GANQP is insensitive to the hyperparameter , which is different from WGANGP. We only need to choose a suitable to make the loss more readable (not very large and not very small).
Appendix C Future Work
c.1 A Conjecture
Inspired by the form of QPdiv , it may be extended as
Conjecture 1
If
(46) 
is a probability divergence of and , so does
(47) 
for some .
Conjecture 1 means we can use
(48) 
as a penalty term for any other GAN’s discriminator to enhance the original GAN.
c.2 Example
For example, we can enhance SGAN with a quadratic potential term:
Comments
There are no comments yet.