 # From GAN to WGAN

This paper explains the math behind a generative adversarial network (GAN) model and why it is hard to be trained. Wasserstein GAN is intended to improve GANs' training by adopting a smooth metric for measuring the distance between two probability distributions.

## Code Repositories

### feynman

Crisp and clear explanations

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

has shown great results in many generative tasks to replicate the real-world rich content such as images, human language, and music. It is inspired by game theory: two models, a generator and a critic, are competing with each other while making each other stronger at the same time. However, it is rather challenging to train a GAN model, as people are facing issues like training instability or failure to converge.

Here I would like to explain the math behind the generative adversarial network framework, why it is hard to be trained, and finally introduce a modified version of GAN intended to solve the training difficulties.

## 2 Kullback–Leibler and Jensen–Shannon Divergence

Before we start examining GANs closely, let us first review two metrics for quantifying the similarity between two probability distributions.

(1)

measures how one probability distribution diverges from a second expected probability distribution .

 DKL(p∥q)=∫xp(x)logp(x)q(x)dx

achieves the minimum zero when everywhere.

It is noticeable according to the formula that KL divergence is asymmetric. In cases where is close to zero, but is significantly non-zero, the ’s effect is disregarded. It could cause buggy results when we just want to measure the similarity between two equally important distributions.

(2) Jensen–Shannon Divergence is another measure of similarity between two probability distributions, bounded by . JS divergence is symmetric and more smooth. Check this post if you are interested in reading more about the comparison between KL divergence and JS divergence.

 DJS(p∥q)=12DKL(p∥p+q2)+12DKL(q∥p+q2) Figure 1: Given two Gaussian distribution, p with mean=0 and std=1 and q with mean=1 and std=1. The average of two distributions is labeled as m=(p+q)/2. KL divergence DKL is asymmetric but JS divergence DJS is symmetric.

Some gan2015train

believe that one reason behind GANs’ big success is switching the loss function from asymmetric KL divergence in traditional maximum-likelihood approach to symmetric JS divergence. We will discuss more on this point in the next section.

GAN consists of two models:

• A discriminator estimates the probability of a given sample coming from the real dataset. It works as a critic and is optimized to tell the fake samples from the real ones.

• A generator outputs synthetic samples given a noise variable input ( brings in potential output diversity). It is trained to capture the real data distribution so that its generative samples can be as real as possible, or in other words, can trick the discriminator to offer a high probability.

These two models compete against each other during the training process: the generator is trying hard to trick the discriminator, while the critic model is trying hard not to be cheated. This interesting zero-sum game between two models motivates both to improve their functionalities.

Given,

On one hand, we want to make sure the discriminator ’s decisions over real data are accurate by maximizing . Meanwhile, given a fake sample , the discriminator is expected to output a probability, , close to zero by maximizing .

On the other hand, the generator is trained to increase the chances of producing a high probability for a fake example, thus to minimize .

When combining both aspects together, and are playing a minimax game in which we should optimize the following loss function:

 minGmaxDL(D,G) =Ex∼pr(x)[logD(x)]+Ez∼pz(z)[log(1−D(G(z)))] =Ex∼pr(x)[logD(x)]+Ex∼pg(x)[log(1−D(x)]

### 3.1 What is the Optimal Value for D?

Now we have a well-defined loss function. Let’s first examine what is the best value for .

 L(G,D)=∫x(pr(x)log(D(x))+pg(x)log(1−D(x)))dx

Since we are interested in what is the best value of to maximize , let us label

 ~x=D(x),A=pr(x),B=pg(x)

And then what is inside the integral (we can safely ignore the integral because is sampled over all the possible values) is:

 f(~x) =Alog~x+Blog(1−~x) df(~x)d~x =A1ln101~x−B1ln1011−~x =1ln10(A~x−B1−~x) =1ln10A−(A+B)~x~x(1−~x)

Thus, set , we get the best value of the discriminator: . Once the generator is trained to its optimal, gets very close to . When , becomes .

### 3.2 What is the Global Optimal?

When both and are at their optimal values, we have and and the loss function becomes:

 L(G,D∗) =∫x(pr(x)log(D∗(x))+pg(x)log(1−D∗(x)))dx =log12∫xpr(x)dx+log12∫xpg(x)dx =−2log2

### 3.3 What does the Loss Function Represent?

According to the formula listed in Sec. 2, JS divergence between and can be computed as:

 DJS(pr∥pg)= 12DKL(pr||pr+pg2)+12DKL(pg||pr+pg2) = 12(log2+∫xpr(x)logpr(x)pr+pg(x)dx)+ 12(log2+∫xpg(x)logpg(x)pr+pg(x)dx) = 12(log4+L(G,D∗))

Thus,

 L(G,D∗)=2DJS(pr∥pg)−2log2

Essentially the loss function of GAN quantifies the similarity between the generative data distribution and the real sample distribution by JS divergence when the discriminator is optimal. The best that replicates the real data distribution leads to the minimum which is aligned with equations above.

Other Variations of GAN

: There are many variations of GANs in different contexts or designed for different tasks. For example, for semi-supervised learning, one idea is to update the discriminator to output real class labels,

, as well as one fake class label . The generator model aims to trick the discriminator to output a classification label smaller than .

## 4 Problems in GANs

Although GAN has shown great success in the realistic image generation, the training is not easy; The process is known to be slow and unstable.

### 4.1 Hard to Achieve Nash Equilibrium

salimans2016nips discussed the problem with GAN’s gradient-descent-based training procedure. Two models are trained simultaneously to find a Nash equilibrium to a two-player non-cooperative game. However, each model updates its cost independently with no respect to another player in the game. Updating the gradient of both models concurrently cannot guarantee a convergence.

Let’s check out a simple example to better understand why it is difficult to find a Nash equilibrium in an non-cooperative game. Suppose one player takes control of to minimize , while at the same time the other player constantly updates to minimize .

Because and , we update with and with simultaneously in one iteration, where is the learning rate. Once and have different signs, every following gradient update causes huge oscillation and the instability gets worse in time, as shown in Fig. 3. Figure 3: A simulation of our example for updating x to minimize xy and updating y to minimize −xy. The learning rate η=0.1. With more iterations, the oscillation grows more and more unstable.

### 4.2 Low Dimensional Supports

arjovsky2017 discussed the problem of the supports of and lying on low dimensional manifolds and how it contributes to the instability of GAN training thoroughly.

The dimensions of many real-world datasets, as represented by , only appear to be artificially high. They have been found to concentrate in a lower dimensional manifold. This is actually the fundamental assumption for Manifold Learning. Thinking of the real world images, once the theme or the contained object is fixed, the images have a lot of restrictions to follow, i.e., a dog should have two ears and a tail, and a skyscraper should have a straight and tall body, etc. These restrictions keep images away from the possibility of having a high-dimensional free form.

lies in a low dimensional manifolds, too. Whenever the generator is asked to a much larger image like 64x64 given a small dimension, such as 100, noise variable input

, the distribution of colors over these 4096 pixels has been defined by the small 100-dimension random number vector and can hardly fill up the whole high dimensional space.

Because both and rest in low dimensional manifolds, they are almost certainly gonna be disjoint (See Fig. 4). When they have disjoint supports, we are always capable of finding a perfect discriminator that separates real and fake samples 100% correctly. arjovsky2017 Figure 4: Low dimensional manifolds in high dimension space can hardly have overlaps. (Left) Two lines in a three-dimension space. (Right) Two surfaces in a three-dimension space.

When the discriminator is perfect, we are guaranteed with and . Therefore the loss function falls to zero and we end up with no gradient to update the loss during learning iterations. Fig. 5 demonstrates an experiment when the discriminator gets better, the gradient vanishes fast. Figure 5: First, a DCGAN is trained for 1, 10 and 25 epochs. Then, with the generator fixed, a discriminator is trained from scratch and measure the gradients with the original cost function. We see the gradient norms decay quickly (in log scale), in the best case 5 orders of magnitude after 4000 discriminator iterations. (Image source: arjovsky2017 ).

As a result, training a GAN faces an dilemma:

• If the discriminator behaves badly, the generator does not have accurate feedback and the loss function cannot represent the reality.

• If the discriminator does a great job, the gradient of the loss function drops down to close to zero and the learning becomes super slow or even jammed.

This dilemma clearly is capable to make the GAN training very tough.

### 4.4 Mode Collapse

During the training, the generator may collapse to a setting where it always produces same outputs. This is a common failure case for GANs, commonly referred to as Mode Collapse. Even though the generator might be able to trick the corresponding discriminator, it fails to learn to represent the complex real-world data distribution and gets stuck in a small space with extremely low variety. Figure 6: A DCGAN model is trained with an MLP network with 4 layers, 512 units and ReLU activation function, configured to lack a strong inductive bias for image generation. The results shows a significant degree of mode collapse. (Image source: wgan2017 ).

### 4.5 Lack of a Proper Evaluation Metric

Generative adversarial networks are not born with a good objection function that can inform us the training progress. Without a good evaluation metric, it is like working in the dark. No good sign to tell when to stop; No good indicator to compare the performance of multiple models.

## 5 Improved GAN Training

The following suggestions are proposed to help stabilize and improve the training of GANs.

First five methods are practical techniques to achieve faster convergence of GAN training salimans2016nips . The last two are proposed in arjovsky2017 to solve the problem of disjoint distributions.

(1) Feature Matching

Feature matching suggests to optimize the discriminator to inspect whether the generator’s output matches expected statistics of the real samples. In such a scenario, the new loss function is defined as , where can be any computation of statistics of features, such as mean or median.

(2) Minibatch Discrimination

With minibatch discrimination, the discriminator is able to digest the relationship between training data points in one batch, instead of processing each point independently.

In one minibatch, we approximate the closeness between every pair of samples, , and get the overall summary of one data point by summing up how close it is to other samples in the same batch, . Then is explicitly added to the input of the model.

(3) Historical Averaging

For both models, add into the loss function, where is the model parameter and is how the parameter is configured at the past training time . This addition piece penalizes the training speed when is changing too dramatically in time.

(4) One-sided Label Smoothing

When feeding the discriminator, instead of providing 1 and 0 labels, use soften values such as 0.9 and 0.1. It is shown to reduce the networks’ vulnerability.

(5)

Virtual Batch Normalization (VBN)

Each data sample is normalized based on a fixed batch ("reference batch") of data rather than within its minibatch. The reference batch is chosen once at the beginning and stays the same through the training.

Based on the discussion in Sec. 4.2, we now know and are disjoint in a high dimensional space and it causes the problem of vanishing gradient. To artificially "spread out" the distribution and to create higher chances for two probability distributions to have overlaps, one solution is to add continuous noises onto the inputs of the discriminator .

(7) Use Better Metric of Distribution Similarity

The loss function of the vanilla GAN measures the JS divergence between the distributions of and . This metric fails to provide a meaningful value when two distributions are disjoint.

Wasserstein metric is proposed to replace JS divergence because it has a much smoother value space. See more in the next section.

## 6 Wasserstein GAN (WGAN)

### 6.1 What is Wasserstein Distance?

Wasserstein Distance is a measure of the distance between two probability distributions. It is also called Earth Mover’s distance, short for EM distance, because informally it can be interpreted as the minimum energy cost of moving and transforming a pile of dirt in the shape of one probability distribution to the shape of the other distribution. The cost is quantified by: the amount of dirt moved x the moving distance.

Let us first look at a simple case where the probability domain is discrete. For example, suppose we have two distributions and , each has four piles of dirt and both have ten shovelfuls of dirt in total. The numbers of shovelfuls in each dirt pile are assigned as follows:

 P1=3,P2=2,P3=1,P4=4Q1=1,Q2=2,Q3=4,Q4=3

In order to change to look like , as illustrated in Fig. 7, we:

• First move 2 shovelfuls from to => match up.

• Then move 2 shovelfuls from to => match up.

• Finally move 1 shovelfuls from to => and match up. Figure 7: Step-by-step plan of moving dirt between piles in P and Q to make them match.

If we label the cost to pay to make and match as , we would have and in the example:

 δ0 =0 δ1 =0+3−1=2 δ2 =2+2−2=2 δ3 =2+1−4=−1 δ4 =−1+4−3=0

Finally the Earth Mover’s distance is .

When dealing with the continuous probability domain, the distance formula becomes:

 W(pr,pg)=infγ∼Π(pr,pg)E(x,y)∼γ[∥x−y∥]

In the formula above, is the set of all possible joint probability distributions between and

. One joint distribution

describes one dirt transport plan, same as the discrete example above, but in the continuous probability space. Precisely states the percentage of dirt should be transported from point to so as to make follows the same probability distribution of . That’s why the marginal distribution over adds up to , (Once we finish moving the planned amount of dirt from every possible to the target , we end up with exactly what has according to .) and vice versa .

When treating as the starting point and as the destination, the total amount of dirt moved is and the traveling distance is and thus the cost is . The expected cost averaged across all the pairs can be easily computed as:

 ∑x,yγ(x,y)∥x−y∥=Ex,y∼γ∥x−y∥

Finally, we take the minimum one among the costs of all dirt moving solutions as the EM distance. In the definition of Wasserstein distance, the (infimum, also known as *greatest lower bound*) indicates that we are only interested in the smallest cost.

### 6.2 Why Wasserstein is better than JS or KL Divergence?

Even when two distributions are located in lower dimensional manifolds without overlaps, Wasserstein distance can still provide a meaningful and smooth representation of the distance in-between.

The WGAN paper exemplified the idea with a simple example.

Suppose we have two probability distributions, and :

 ∀(x,y)∈P,x=0 and y∼U(0,1)∀(x,y)∈Q,x=θ,0≤θ≤1 and y∼U(0,1) Figure 8: There is no overlap between P and Q when θ≠0.

When :

 DKL(P∥Q) =∑x=0,y∼U(0,1)1⋅log10=+∞ DKL(Q∥P) =∑x=θ,y∼U(0,1)1⋅log10=+∞ DJS(P,Q) =12(∑x=0,y∼U(0,1)1⋅log11/2+∑x=0,y∼U(0,1)1⋅log11/2)=log2 W(P,Q) =|θ|

But when , two distributions are fully overlapped:

 DKL(P∥Q) =DKL(Q∥P)=DJS(P,Q)=0 W(P,Q) =0=|θ|

gives us infinity when two distributions are disjoint. The value of has sudden jump, not differentiable at . Only Wasserstein metric provides a smooth measure, which is super helpful for a stable learning process using gradient descents.

### 6.3 Use Wasserstein Distance as GAN Loss Function

It is intractable to exhaust all the possible joint distributions in to compute . Thus the authors proposed a smart transformation of the formula based on the Kantorovich-Rubinstein duality to:

 W(pr,pg)=1Ksup∥f∥L≤KEx∼pr[f(x)]−Ex∼pg[f(x)]

where (supremum) is the opposite of (infimum); we want to measure the least upper bound or, in even simpler words, the maximum value.

#### 6.3.1 Lipschitz Continuity

The function in the new form of Wasserstein metric is demanded to satisfy , meaning it should be K-Lipschitz continuous.

A real-valued function is called -Lipschitz continuous if there exists a real constant such that, for all ,

 |f(x1)−f(x2)|≤K|x1−x2|

Here is known as a Lipschitz constant for function . Functions that are everywhere continuously differentiable is Lipschitz continuous, because the derivative, estimated as , has bounds. However, a Lipschitz continuous function may not be everywhere differentiable, such as .

Explaining how the transformation happens on the Wasserstein distance formula is worthy of a long post by itself, so I skip the details here. If you are interested in how to compute Wasserstein metric using linear programming, or how to transfer Wasserstein metric into its dual form according to the Kantorovich-Rubinstein Duality, read this awesome

post.

#### 6.3.2 Wasserstein Loss Function

Suppose this function comes from a family of K-Lipschitz continuous functions, , parameterized by . In the modified Wasserstein-GAN, the "discriminator" model is used to learn to find a good and the loss function is configured as measuring the Wasserstein distance between and .

 L(pr,pg)=W(pr,pg)=maxw∈WEx∼pr[fw(x)]−Ez∼pr(z)[fw(gθ(z))]

Thus the "discriminator" is not a direct critic of telling the fake samples apart from the real ones anymore. Instead, it is trained to learn a -Lipschitz continuous function to help compute Wasserstein distance. As the loss function decreases in the training, the Wasserstein distance gets smaller and the generator model’s output grows closer to the real data distribution.

One big problem is to maintain the -Lipschitz continuity of during the training in order to make everything work out. The paper presents a simple but very practical trick: After every gradient update, clamp the weights to a small window, such as , resulting in a compact parameter space and thus obtains its lower and upper bounds to preserve the Lipschitz continuity.

Compared to the original GAN algorithm, the WGAN undertakes the following changes:

• After every gradient update on the critic function, clamp the weights to a small fixed range, .

• Use a new loss function derived from the Wasserstein distance, no logarithm anymore. The "discriminator" model does not play as a direct critic but a helper for estimating the Wasserstein metric between real and generated data distribution.

• Empirically the authors recommended RMSProp optimizer on the critic, rather than a momentum based optimizer such as Adam which could cause instability in the model training. I haven’t seen clear theoretical explanation on this point through.

Sadly, Wasserstein GAN is not perfect. Even the authors of the original WGAN paper mentioned that "Weight clipping is a clearly terrible way to enforce a Lipschitz constraint". WGAN still suffers from unstable training, slow convergence after weight clipping (when clipping window is too large), and vanishing gradients (when clipping window is too small).

Some improvement, precisely replacing weight clipping with gradient penalty, has been discussed in wgan2017improve .