 # GAN-QP: A Novel GAN Framework without Gradient Vanishing and Lipschitz Constraint

We know SGAN may have a risk of gradient vanishing. A significant improvement is WGAN, with the help of 1-Lipschitz constraint on discriminator to prevent from gradient vanishing. Is there any GAN having no gradient vanishing and no 1-Lipschitz constraint on discriminator? We do find one, called GAN-QP. To construct a new framework of Generative Adversarial Network (GAN) usually includes three steps: 1. choose a probability divergence; 2. convert it into a dual form; 3. play a min-max game. In this articles, we demonstrate that the first step is not necessary. We can analyse the property of divergence and even construct new divergence in dual space directly. As a reward, we obtain a simpler alternative of WGAN: GAN-QP. We demonstrate that GAN-QP have a better performance than WGAN in theory and practice.

## Authors

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 From Divergence to GAN

### 1.1 Divergence

Most Generative Adversarial Networks (GANs, Goodfellow . (2014)) are based on a certain form of probability divergence. A divergence is a function of two variables satisfies the following definition:

###### Definition 1

If is function of two variables satisfies the following properties:

1. ;

2. .

We say is a divergence between and .

Compared with the axiomatic defination of distance, a divergence do not need symmetry and triangle inequality necessarily. Divergence only keeps the fundamental property for measuring the difference between and .

### 1.2 Dual Form

If

represent two probability distributions,

becomes a functional and we call it probability divergence. For example, we have Jensen-Shannon divergence (JS divergence):

 JS[p(x),q(x)]=12∫p(x)logp(x)12[p(x)+q(x)]dx+12∫q(x)logq(x)12[p(x)+q(x)]dx (1)

In most cases, we can find a dual form for a probability divergence (Nowozin ., 2016). For example, the dual form of JS divergence is

 JS[p(x),q(x)]= maxT12∫p(x)logσ(T(x))dx+12∫q(x)log(1−σ(T(x)))dx+log2 (2) = maxT12Ex∼p(x)[logσ(T(x))]+12Ex∼q(x)[log(1−σ(T(x)))]+log2

Here

is sigmoid function. The dual form can convert the integral of original divergence into a sampling form, which allows us to estimate it by Monte Carlo method. That is the essential to GAN. The dual form always has a max operation, which means a divergence is a supremum of a family of functions.

### 1.3 Min-Max Game

With the help of dual form of probability divergence, we can train a generator to generate the distribution we are interested via playing a min-max game. For example, using we have

 G,T=argminGargmaxTEx∼p(x)[logσ(T(x))]+Ex=G(z),z∼q(z)[log(1−σ(T(x)))] (3)

For a fixed , the goal of is

 G= argminGEx∼p(x)[logσ(T(x))]+Ex=G(z),z∼q(z)[log(1−σ(T(x)))] (4) = argminGEx=G(z),z∼q(z)[log(1−σ(T(x)))]

However, the loss is not always good for optimization, so we usually use a equivalent loss, such as and . Namely, we may adjust the loss of generator for a better optimization, rather than playing the original min-max game.

## 2 Divergence in Dual Space

### 2.1 Steps to GAN

From the above discussion, we can see that to construct a GAN usually includes three steps:

1. choose a probability divergence;

2. onvert it into a dual form;

3. play a min-max game.

But we know that only the last two steps are useful for practice. The fisrt step is only a theoretical concept and is not very important for a GAN. Therefore, a natural thought is: why not analyse the property of divergence and even construct new divergence in dual space directly? Our following content will demonstrate this thought is a very simple approach to build and understand GANs.

### 2.2 Sgan

We start from the Standard GAN (SGAN, Goodfellow . (2014)) as an example to show how we can achieve the goal. From the appendix A.1, we have Lemma 1:

###### Lemma 1

The following defines a probability divergence

 D[p(x),q(x)]=maxT12Ex∼p(x)[logσ(T(x))]+12Ex∼q(x)[log(1−σ(T(x)))]+log2 (5)

It is worth to highlight that we prove is a probability divergence111Namely, satisfying the definition 1. in dual space, not need the original defination . Getting rid of the original defination of divergence allows us to seek more powerful divergence in dual space.

Now we have a divergence defined by Lemma 1, so we can train a generator by minimizing , which results in the min-max game .

The difficulty arises while there is almost no intersection between and . For example, we consider and . Now we have

 D[p(x),q(x)]=maxT12logσ(T(α))+12log(1−σ(T(β)))+log2 (6)

because no constraint on , we can let , to obain the maximum value of the above formula, that is

 D[p(x),q(x)]=log2 (7)

So if there is almost no intersection between and , this divergence of them is a constant , whose gradients are zeros. In this situation, generator can not imporve via gradient descent method. And we know this situation will happen with very high probability (Arjovsky  Bottou, 2017). Therefore it’s hard to train a good generative model under the framework of SGAN.

These conclusions can be popularized to any kind of -GANs (Nowozin ., 2016) in parallel, including LSGAN (Mao ., 2016). And all of them suffer the same difficulty.

### 2.3 Wgan

We turn to a new kind of divergence by Lemma 2:

###### Lemma 2

The following defines a probability divergence

 W[p(x),q(x)]=maxT,∥T∥L≤1Ex∼p(x)[T(x)]−Ex∼q(x)[T(x)] (8)

here

 ∥T∥L=maxx≠y|T(x)−T(y)|d(x,y) (9)

and is any distance metric of . is frequently-used.

The proof is in appendix A.2. Interestingly, the proof is very simple compared the corresponding proof of -divergence.

Now we can play a new min-max game:

 G,T=argminGargmaxT,∥T∥L≤1Ex∼p(x)[T(x)]−Ex=G(z),z∼q(z)[T(x)] (10)

That is what we call WGAN (Arjovsky ., 2017).

Compared with , can reasonably measure the difference of while they almost have no intersection. Let us consider and :

 W[p(x),q(x)]=maxT,∥T∥L≤1T(α)−T(β) (11)

The constraint means . So we have

 W[p(x),q(x)]=d(α,β) (12)

The result is not a constant and its gradients are not zeros. So WGAN will not suffer gradient vanishing usually.

### 2.4 Wgan-Gp

The essential problem of WGAN is how to constrain in , which currently has serveral solutions: weight clipping (WC, Arjovsky . (2017)), gradient penalty (GP, Gulrajani . (2017)) and spectral normalization (SN, Miyato . (2018)).

Weight clipping is always unstable and has been abandoned in most cases. Spectral normalization is a better operation for not only WGANs but also many other GANs, but it constrains in a tiny subspace of , wasting the modeling power of .

It seems the best approch is gradient penalty now. Gradient penalty replaces with the norm of gradients , and implement it as a penalty term:

 T=argmaxTEx∼p(x)[T(x)]−Ex∼q(x)[T(x)]−λEx∼pq(x)[(∥∇xT(x)∥−1)2] (13) G=argminGEx=G(z),z∼q(z)[−T(x)]

whose

is a hyperparameter and

is the random linear interpolation of

and . It is called WGAN-GP.

WGAN-GP works well in many cases but it is just an empirical trick. Recently some researchs reveal some irrationality of WGAN-GP (Wu ., 2017). Another disadvantage of WGAN-GP is the slow speed, while calculating the exact gradients needs more heavy computation.

## 3 GAN with Quadratic Potential

From the discussion of SGAN and WGAN, we can see that an ideal divergence should not has any constraints on (like SGAN) and should give a reasonable measurement of while they almost have no intersection (like WGAN).

Here we propose such a divergence:

###### Lemma 3

The following defines a probability divergence

 L[p(x),q(x)] (14) = maxTE(xr,xf)∼p(xr)q(xf)[T(xr,xf)−T(xf,xr)−(T(xr,xf)−T(xf,xr))22λd(xr,xf)]

is a hyperparameter and is any distance metric of . is frequently-used.

The proof is shown in appendix A.3. It is like a conventional divergence pluses a quadratic potential, so we call it quadratic potential divergence (QP-div).

Now we check with to demenstrate QP-div will have a reasonable performance to two extreme distribution:

 L[p(x),q(x)]=maxTT(α,β)−T(β,α)−(T(α,β)−T(β,α))22λd(α,β) (15)

Let , that is only a maximum value problem of quadratic function . We know it is , so

 L[p(x),q(x)]=12λd(α,β) (16)

We can see that has similar property like , but with no constraints on . It is very friendly to pratice.

### 3.2 From QP-div to GAN-QP

In theory, we can play a min-max game on QP-div to train a generative model

 G,T=argminGargmaxTE(xr,xf)∼p(xr)q(xf)[L(xr,xf)] (17) L(xr,xf)=T(xr,xf)−T(xf,xr)−(T(xr,xf)−T(xf,xr))22λd(xr,xf)

However, is not a good loss for generator because there is a in the denominator. Generator wants to minmize , which will minimize correspondingly. And we know any ready-to-use distance may not be used as a perfect metric of two samples.

We find that using as the loss of generator is enough. That results the following generative model called GAN with Quadratic Potential (GAN-QP):

 T=argmaxTE(xr,xf)∼p(xr)q(xf)[L(xr,xf)] (18) G=argminGE(xr,xf)∼p(xr)q(xf)[T(xr,xf)−T(xf,xr)]

Futher discussion can be found in appendix B.

## 4 Experiments

### 4.1 Experimential Details

Our experiments are mainly conducted on CelebA HQ dataset (Karras ., 2017). Our basic setup follows DCGANs (Radford ., 2015)

, and is implemented in Keras

(Chollet ., 2015), and available in my repository. We use the Adam optimizer (Kingma  Ba, 2014), with a constant learning rate of and in both and . We train GAN-QP with two D steps per G step.

In , discriminator is a model with both real sample and fake sample as inputs. But in our experiments, we find just using one sample as input has generated good performance. In other words, is enough. We try the architecture like but there is no obvious improvement. So the final loss we use is

 T=argmaxTE(xr,xf)∼p(xr)q(xf)[T(xr)−T(xf)−(T(xr)−T(xf))22λd(xr,xf)] (19) G=argminGE(xr,xf)∼p(xr)q(xf)[T(xr)−T(xf)]

Hyperparameter is

 λ=⎧⎪ ⎪ ⎪⎨⎪ ⎪ ⎪⎩10w×h×ch,while d(xr,xf)=∥xr−xf∥110√w×h×ch,while d(xr,xf)=∥xr−xf∥2 (20)

is the width, height and the number of channels of the input images.We test both L1 norm and L2 norm in our experiments, but they have no significant statistical difference.

The quantitative index we use to evaluate a GAN is Frechet Inception Distance (FID, Heusel . (2017)). We also re-implement it in Keras.

### 4.2 Basic Comparison

Firstly, we compared GAN-QP with WGAN-GP, WGAN-SN (WGAN with spectral normalization), SGAN-SN (SGAN with spectral normalization), LSGAN-SN (LSGAN with spectral normalization) on 128x128 resolution. in WGAN-GP

we use is 10. Then Batch Normalization is removed from discriminator of WGAN-GP and other hyperparameters are as the same as GAN-QP. Each experiment is repeated twice for obtaining more reliable conclusion. The comparison is shown in Figure

1 and Table 1.

We can see that there is no obvious difference between GAN-QP-L1 and GAN-QP-L2, which means GAN-QP is robust to distance metric. The best two results come from GAN-QP and WGAN-SN. The worst is WGAN-GP. Generally, the FID curve of WGAN-SN and SGAN-SN is more smoother and GAN-QP is more shaking. But GAN-QP keeps the best performance as same as SGAN-SN. Figure 1: FID comparison of GAN-QP with L1/L2 distance and WGAN-GP,WGAN-SN,SGAN-SN,LSGAN-SN. WGAN-GP is generally worse than others. The best two is GAN-QP and SGAN-SN.

### 4.3 Higher Resolution

On 128x128 resolution, SGAN-SN and GAN-QP has the same best performance. If we turn to 256x256 resolution, we can see that GAN-QP achieves a better FID than SGAN-SN (Table 2). It even works well on 512x512 resolution (Figure 2). Figure 2: Random samples from GAN-QP on 512x512 resolution. The final FID is 26.64. And it costs 2 days to finish training on one gtx 1080ti.

## 5 Conclusion

In this paper, we demonstrate that we can explore probability divergence directly, which is more convenient and flexible. Starting from this idea, we find out a novel divergence called QP-div, which has excellent characteristics, does not require the 1-Lipschitz constraint and does not require the extract gradient penalty. As a concrete example, we construct a new framework of GAN equiped QP-div: GAN-QP. And the experiments demonstrate the stability and superiority of GAN-QP.

## References

• Arjovsky  Bottou (2017) Arjovsky2017TowardsArjovsky, M.  Bottou, L.  2017. Towards Principled Methods for Training Generative Adversarial Networks Towards principled methods for training generative adversarial networks.
• Arjovsky . (2017) Arjovsky2017WassersteinArjovsky, M., Chintala, S.  Bottou, L.  2017. Wasserstein GAN Wasserstein gan.
• Chollet . (2015) chollet2015kerasChollet, F. .  2015. Keras. Keras.
• Goodfellow . (2014) Goodfellow2014GenerativeGoodfellow, IJ., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S.Bengio, Y.  2014. Generative Adversarial Networks Generative adversarial networks. Advances in Neural Information Processing Systems32672-2680.
• Gulrajani . (2017) Gulrajani2017ImprovedGulrajani, I., Ahmed, F., Arjovsky, M., Dumoulin, V.  Courville, A.  2017. Improved Training of Wasserstein GANs Improved training of wasserstein gans.
• Heusel . (2017) Heusel2017GANsHeusel, M., Ramsauer, H., Unterthiner, T., Nessler, B.  Hochreiter, S.  2017. GANs Trained by a Two Time-Scale Update Rule Converge to a Local Nash Equilibrium Gans trained by a two time-scale update rule converge to a local nash equilibrium.
• Karras . (2017) Karras2017ProgressiveKarras, T., Aila, T., Laine, S.  Lehtinen, J.  2017. Progressive Growing of GANs for Improved Quality, Stability, and Variation Progressive growing of gans for improved quality, stability, and variation.
• Kingma  Ba (2014) Kingma2014AdamKingma, DP.  Ba, J.  2014. Adam: A Method for Stochastic Optimization Adam: A method for stochastic optimization. Computer Science.
• Mao . (2016) Mao2016LeastMao, X., Li, Q., Xie, H., Lau, RYK., Wang, Z.  Smolley, SP.  2016. Least Squares Generative Adversarial Networks Least squares generative adversarial networks.
• Miyato . (2018) Miyato2018SpectralMiyato, T., Kataoka, T., Koyama, M.  Yoshida, Y.  2018. Spectral Normalization for Generative Adversarial Networks Spectral normalization for generative adversarial networks.
• Nowozin . (2016) Nowozin2016fNowozin, S., Cseke, B.  Tomioka, R.  2016. f-GAN: Training Generative Neural Samplers using Variational Divergence Minimization f-gan: Training generative neural samplers using variational divergence minimization.
• Radford . (2015) Radford2015UnsupervisedRadford, A., Metz, L.  Chintala, S.  2015. Unsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks Unsupervised representation learning with deep convolutional generative adversarial networks. Computer Science.
• Wu . (2017) Wu2017WassersteinWu, J., Huang, Z., Thoma, J., Acharya, D.  Van Gool, L.  2017. Wasserstein Divergence for GANs Wasserstein divergence for gans.

## Appendix A Detailed Derivation

### a.1 (???) is a divergence

Firstly, it is trivial to see that is nonnegative since we can always let :

 D[p(x),q(x)]= maxT12Ex∼p(x)[logσ(T(x))]+12Ex∼q(x)[log(1−σ(T(x)))]+log2 (21) ≥ 12Ex∼p(x)[logσ(0)]+12Ex∼q(x)[log(1−σ(0))]+log2 = 0

Next, we show , which is also simple:

 D[p(x),p(x)]= maxT12Ex∼p(x)[logσ(T(x))+log(1−σ(T(x)))]+log2 (22) = maxT12Ex∼p(x)[log(σ(T(x))(1−σ(T(x))))]+log2

It is not difficult to prove that the maximum value of is at , so we have

 D[p(x),p(x)]=12Ex∼p(x)[log14]+log2=0 (23)

Finally, we show if .333Strictly, is not enough. The sufficient condition is . Let

 σ(T(x))=p(x)p(x)+q(x)=λ(x) (24)

we have

 D[p(x),q(x)]= 12∫[p(x)logp(x)p(x)+q(x)+q(x)logq(x)p(x)+q(x)]dx+log2 (25) = ∫(p(x)+q(x)2)[λ(x)logλ(x)+(1−λ(x))log(1−λ(x))]dx+log2

Becausce of , , and we know if . Therefore,

 D[p(x),q(x)]>∫(p(x)+q(x)2)(−log2)dx+log2=0 (26)

### a.2 (???) is a divergence

Firstly, it is trivial to see that is nonnegative since we can always let :

 W[p(x),q(x)]≥Ex∼p(x)−Ex∼q(x)=0 (27)

Next, is also trivial. So we only need to show if . It it actually not difficult because we only need to let :

 Ex∼p(x)[T0(x)]−Ex∼q(x)[T0(x)]=∫(p(x)−q(x))⋅sign(p(x)−q(x))dx>0 (28)

That means .

### a.3 (???) is a divergence

Firstly, it is trivial to see that is nonnegative since we can always let :

 (29)

Next, we have

 L[p(x),p(x)]= maxTE(xr,xf)∼p(xr)p(xf)[−(T(xr,xf)−T(xf,xr))22λd(x,y)] (30)

Obviously, the maximum value is zero. So .

Finally, if , we let

 T0(xr,xf)=sign(p(xr)q(xf)−p(xf)q(xr)) (31)

now we have

 γ1= E(xr,xf)∼p(xr)q(xf)[T0(xr,xf)−T0(xf,xr)] (32) = ∬p(xr)q(xf)[T0(xr,xf)−T0(xf,xr)]dxrdxf = ∬[p(xr)q(xf)−p(xf)q(xr)]T0(xr,xf)dxrdxf = ∬[p(xr)q(xf)−p(xf)q(xr)]⋅% sign(p(xr)q(xf)−p(xf)q(xr))dxrdxf>0 γ2= E(xr,xf)∼p(xr)q(xf)[(T0(xr,xf)−T0(xf,xr))22λd(x,y)]≥0

If , then

 E(xr,xf)∼p(xr)q(xf)[T(xr,xf)−T(xf,xr)−(T(xr,xf)−T(xf,xr))22λd(x,y)] (33) = γ1−γ2>0

else if , we can define

 T(xr,xf)=γ12γ2⋅T0(xr,xf) (34)

then

 E(xr,xf)∼p(xr)q(xf)[T(xr,xf)−T(xf,xr)−(T(xr,xf)−T(xf,xr))22γd(x,y)] (35) = (γ12γ2)γ1−(γ12γ2)2γ2=γ214γ2>0

Therefore .

## Appendix B Analyse of GAN-QP

### b.1 Optimum Solution of (???)

###### Lemma 4

the optimum solution of satisfies

 p(xr)q(xf)−p(xf)q(xr)p(xr)q(xf)+p(xf)q(xr)=T(xr,xf)−T(xf,xr)λd(xr,xf) (36)

The proof is the basic application of variational method:

 δ∬p(xr)q(xf)[T(xr,xf)−T(xf,xr)−(T(xr,xf)−T(xf,xr))22λd(xr,xf)]dxrdxf (37) = ∬p(xr)q(xf)[δT(xr,xf)−δT(xf,xr) −(T(xr,xf)−T(xf,xr))(δT(xr,xf)−δT(xf,xr))λd(xr,xf)]dxrdxf = ∬[p(xr)q(xf)−p(xf)q(xr) −(p(xr)q(xf)+p(xf)q(xr))T(xr,xf)−T(xf,xr)λd(xr,xf)]δT(xr,xf)dxrdxf

The formula in square brackets must be identically equal to zero. Therefore

 p(xr)q(xf)−p(xf)q(xr)p(xr)q(xf)+p(xf)q(xr)=T(xr,xf)−T(xf,xr)λd(xr,xf) (38)

From , it is not difficult to prove that the optimum satisfies

 −1≤T(xr,xf)−T(xf,xr)λd(xr,xf)≤1 (39)

In other words, the optimum satisfies Lipschitz constraint automatically. Therefore we can say is a divergence with adative Lipschitz constraint.

### b.3 The Divergence of Generator

We use rather than the whole as the loss of generator in . And we have solved the optimum solution of in Lemma 4. Then we can see the ultimate goal of generator to minimize is

 λ∬p(xr)q(xf)p(xr)q(xf)−p(xf)q(xr)p(xr)q(xf)+p(xf)q(xr)d(xr,xf)dxrdxf (40)

Now we have Lemma 5:

###### Lemma 5
 ~L[p(x),q(x)]=∬p(xr)q(xf)p(xr)q(xf)−p(xf)q(xr)p(xr)q(xf)+p(xf)q(xr)d(xr,xf)dxrdxf (41)

is also a probability divergence of .

Actually Lemma 5 is a conclusion of Cauchy–Schwarz inequality. Firstly we let

 μ(xr,xf)=d(xr,xf)p(xr)q(xf)+p(xf)q(xr)>0 (42)

Then by Cauchy–Schwarz inequality we have

 (∬μ(xr,xf)p2(xr)q2(xf)dxrdxf)2 (43) = (∬(√μ(xr,xf)p(xr)q(xf))2dxrdxf)(∬(√μ(xf,xr)p(xf)q(xr))2dxfdxr) ≥ (∬μ(xr,xf)p(xr)q(xf)p(xf)q(xr)dxrdxf)2

So

 ~L[p(x),q(x)]=∬μ(xr,xf)p(xr)q(xf)(p(xr)q(xf)−p(xf)q(xr))μ(xr,xf)dxrdxf≥0 (44)

Two sides are equal if and only if , which means . Therefore is a probability divergence, which means to lower is actually to lower the difference between and . The divergence is weighted by , forcing generator to focus on the sample pairs of larger distance, which is in line with our intuition.

### b.4 Performance while No Intersection

We have shown the QP-div in Lemma 3 also works well while there is no intersection between and . But now we use as the loss of generator, corresponding to the new divergence . Therefore we have to check the performance of with . That is very easy:

 ~L[δ(x−α),δ(x−β)] (45) = ∬δ(xr−α)δ(xf−β)δ(xr−α)δ(xf−β)−δ(xf−α)δ(xr−β)δ(xr−α)δ(xf−β)+δ(xf−α)δ(xr−β)d(xr,xf)dxrdxf = δ(0)δ(0)−δ(β−α)δ(α−β)δ(0)δ(0)+δ(β−α)δ(α−β)d(α,β)

We know and for , so the result is , a reasonable measurement actually.

### b.5 Robustness of λ

showed that is just a scaler for . That means GAN-QP is insensitive to the hyperparameter , which is different from WGAN-GP. We only need to choose a suitable to make the loss more readable (not very large and not very small).

## Appendix C Future Work

### c.1 A Conjecture

Inspired by the form of QP-div , it may be extended as

###### Conjecture 1

If

 argmaxTE(xr,xf)∼p(xr)q(xf)[f(T(xr,xf))] (46)

is a probability divergence of and , so does

 argmaxTE(xr,xf)∼p(xr)q(xf)[f(T(xr,xf))−f2(T(xr,xf))2λd(xr,xf)] (47)

for some .

Conjecture 1 means we can use

 f2(T(xr,xf))2λd(xr,xf) (48)

as a penalty term for any other GAN’s discriminator to enhance the original GAN.

### c.2 Example

For example, we can enhance SGAN with a quadratic potential term:

 T=argmaxTE(xr,xf)∼p(xr)q(xf)[f(xr,xf)−