DeepAI

# On the regularization of Wasserstein GANs

Since their invention, generative adversarial networks (GANs) have become a popular approach for learning to model a distribution of real (unlabeled) data. Convergence problems during training are overcome by Wasserstein GANs which minimize the distance between the model and the empirical distribution in terms of a different metric, but thereby introduce a Lipschitz constraint into the optimization problem. A simple way to enforce the Lipschitz constraint on the class of functions, which can be modeled by the neural network, is weight clipping. It was proposed that training can be improved by instead augmenting the loss by a regularization term that penalizes the deviation of the gradient of the critic (as a function of the network's input) from one. We present theoretical arguments why using a weaker regularization term enforcing the Lipschitz constraint is preferable. These arguments are supported by experimental results on toy data sets.

• 5 publications
• 31 publications
• 1 publication
07/12/2019

Generative adversarial networks (GANs) are one of the most popular appro...
11/29/2019

### Orthogonal Wasserstein GANs

Wasserstein-GANs have been introduced to address the deficiencies of gen...
04/08/2018

### Language Modeling with Generative AdversarialNetworks

Generative Adversarial Networks (GANs) have been promising in the field ...
02/15/2019

In this paper we study the convergence of generative adversarial network...
07/02/2018

### Understanding the Effectiveness of Lipschitz Constraint in Training of GANs via Gradient Analysis

This paper aims to bring a new perspective for understanding GANs, by de...
08/17/2020

### HRVGAN: High Resolution Video Generation using Spatio-Temporal GAN

In this paper, we present a novel network for high resolution video gene...
02/26/2021

### Moreau-Yosida f-divergences

Variational representations of f-divergences are central to many machine...

## 1 Introduction

General adversarial networks (GANs) (Goodfellow et al., 2014)

are a class of generative models that have recently gained a lot of attention. They are based on the idea of defining a game between two competing neural networks (NNs): a generator and a classifier (or discriminator). While the classifier aims at distinguishing generated from real data, the generator tries to generate samples which the classifier can not distinguish from samples from the empirical distribution.

The original training procedure for GANs showed an unstable behavior and it experienced mode collapsing already when trying to learn to model simple mixtures of Gaussians (see e.g. Tolstikhin et al. (2017)). Therefore, realizing the potential behind this new approach to generative models, different more recent approaches focused on the stabilization of training, including ensemble methods (Tolstikhin et al., 2017), improved network structure (Radford et al., 2015; Salimans et al., 2016a) and theoretical improvements (Nowozin et al., 2016; Salimans et al., 2016b; Arjovsky & Bottou, 2017; Chen et al., 2016), which helped to successfully model complex distributions using GANs.

It was proposed by Arjovsky et al. (2017) to train generator and discriminator networks by minimizing the Wasserstein-1 distance, a distance with properties superior to the Jensen-Shannon distance (used in the original GAN) in terms of convergence. Accordingly, this version of GAN was called Wasserstein GAN (WGAN). The change of metric introduces a new minimization problem, which requires the discriminator function to lie in the space of 1-Lipschitz functions.

In the same paper, the Lipschitz constraint was guaranteed by performing weight clipping, i.e., by constraining the parameters of the discriminator NN to be smaller than a given value in absolute value. An improved training strategy was proposed by Gulrajani et al. (2017) based on results from optimal transport theory (see (Villani, 2008)). Here, instead of weight clipping, the loss gets augmented by a regularization term that penalizes any deviation of the gradient of the critic function (with respect to its input) from one.

We review these results, and we present both theoretical considerations and empirical results on toy datasets, leading to the proposal of a less restrictive regularization term for WGANs.111In the blog post https://lernapparat.de/improved-wasserstein-gan/ which was written simultaneously to our work, the author presents some ideas that follow a similar intuition as the one underlying our arguments. More precisely, our contributions are as follows:

• We present theoretical considerations illustrating why weight clipping is harmful for WGAN training.

• We review the arguments that the regularization technique proposed by Gulrajani et al. (2017) is based on and present a simplified proof.

• The regularization term used by Gulrajani et al. (2017)

requires training samples and generated samples to be drawn from a certain joint distribution. However, in practice samples are drawn independently from their marginals. We explain why this can be harmful for training.

• The arguments of Gulrajani et al. (2017) further assume the discriminator to be differentiable. We present examples explaining why the conclusion that follows from this assumption can be problematic.

• We present empirical results on toy datasets strongly supporting our theoretical considerations.

We begin by reviewing the required basic concepts from optimal transport theory in Section 2, and we introduce GANs and WGANs in Sections 3 and 4. We discuss the restrictiveness of weight clipping in Section 5 and discuss problems with the gradient penalty introduced by Gulrajani et al. (2017) in Section 6, before proposing our modification in Section 7. The experimental results supporting our theoretical considerations can be found in Section 8. We conclude in Section 9.

## 2 Optimal Transport

We will require the notion of a coupling of two probability distributions. Although a coupling can be defined more generally, we state the definition in the setting of our interest, i.e., we consider all spaces involved to equal

.

###### Definition 1.

Let and be two probability distributions on . A coupling of and is a probability distribution on such that and for all measurable sets . The set of all couplings of and is denoted by .

Let us now recall the Kantorovich duality. Note, that the presented theorem is a less general, but to our needs adapted version of Theorem 5.10 from Villani (2008)222In the work of Arjovsky et al. (2017), the theorem is called Kantorovich-Rubinstein, although this theorem only applies to compact metric spaces. There are several generalizations of the Kantorovich-duality. For a detailed account we refer the reader to the work of Edwards (2011).. A proof of how to derive our version from the referenced one can be found in Appendix A.1. We will denote by the set of all 1-Lipschitz functions, i.e., the set of all functions such that for all .

###### Theorem 1 (Kantorovich).

Let and be two probability distributions on such that

 ∫Rn||x||2 dμ(x)<∞ and ∫Rn||x||2 dν(x)<∞.

Then

(i)

 minπ∈Π(μ,ν)∫Rn×Rn||x−y||2 dπ(x,y)=maxf∈Lip1(∫Rnf(x) dμ(x)−∫Rnf(x) dν(x)). (1)

In particular, both minimum and maximum exist.

(ii) The following two statements are equivalent:

• is an optimal coupling (minimizing the value on the left hand side of eq. (1))

• Any optimal function (at which the maximum is attained for the right hand side of eq. (1)) satisfies that

 f∗(x)−f∗(y)=||x−y||2

for all in the support of .

One can think of the minimizing coupling as an assignment of starting points to end-points leading to overall minimal transport costs when moving onto measured in distance (here the Euclidean distance). When the weights to be transported are thought of being goods, then the maximizing function can be thought of an optimal price function from the viewpoint of a transport company, that buys units of goods at location and sells units at . The price at each point indicates the amount to charge or pay per unit good at this position. Maximizing with respect to the constraint means maximizing profit for the company (that can potentially offer the transport for lower costs than given by the distance function) under the constraint of never charging more than what the employer would pay, if he transports the goods himself at Euclidean distance costs.

Formally, given an empirical distribution , a class of generative distributions over some space , and a class of discriminators , GAN training (Goodfellow et al., 2014) aims at solving the following optimization problem333Usually, both the generative distribution and the discriminator are modeled by NNs, whose structure determine the two classes we are optimizing over.

 minνmaxdEx∼μ[log(d(x))]+Ey∼ν[log(1−d(y))]. (2)

In practice, the parameters of the generator and the discriminator networks are updated in an alternating fashion based on (several steps) of stochastic gradient descent. The discriminator thereby tries to assign a value close to zero to generated data points and values close to one to real data points. As an opposing agent, the generator aims to produce data where the discriminator expects to see real data.

Theorem 1 by Goodfellow et al. (2014)

shows that, if the optimal discriminator is found in each iteration, minimization of the resulting loss function of the generator leads to minimization of the

Jensen-Shannon (JS) divergence

 JS(μ,ν)=KL(μ||ν)+KL(ν||μ),

where denotes the Kulback-Leibler divergence.

## 4 Wasserstein GANs

Instead of minimizing the JS divergence, Arjovsky et al. (2017) proposed to minimize the Wasserstein-1 distance, also known as Earth-Mover (EM) distance, which is defined for any Polish space and probability distributions and on by

 W(μ,ν)=infπ∈Π(μ,ν)∫M×Md(x,y) dπ(x,y). (3)

From the Kantorovich duality (see Theorem 1, (i), in Section 2) it follows that, in the special case we are considering, the infimum is attained and, instead of computing the minimum in Equation (3), the Wasserstein-1 distance can also be computed as

 W(μ,ν)=maxf∈Lip1Ex∼μ[f(x)]−Ey∼ν[f(y)], (4)

where the maximum is taken over the set of all 1-Lipschitz functions .

Thus, the WGAN objective is to solve

 minνmaxf∈Lip1Ex∼μ[f(x)]−Ey∼ν[f(y)], (5)

which can be achieved by alternating gradient descent updates for the generating network and the 1-Lipschitz function (also modeled by a NN), just as in the case of the original GAN. The objective of the generator is still to generate real-looking data points and is led by function values of that plays the role of an appraiser (or critic). The appraiser’s goal is to assign a value of confidence to each data point, which is as low as possible on generated data points and as high as possible on real data. The confidence value it can assign is bounded by a constraint of similarity, where similarity is measured by the distance of data points. This can be motivated by the idea that similar points should have similar values of confidence for being real. It is clear then, that the critic function depends on the metric space in which the data points lie in.

An issue of the original GAN discriminator was that its decision equals the value zero every time it is certain to see generated data, independent on how far away a generated data point lies from the real distribution. As a consequence, locally, there is no incentive for the generator to rather generate a value closer to (but still off) the real data; its optimal value is zero in either case. The WGAN’s optimal critic function measures this distance which helps for the generated distribution to converge, but the interpretation of the absolute value as real (close to 1) and fake data (close to 0) is lost. And worse, even the relative value of the optimal critic function does not surely help to decide what is real and what is fake.

###### Observation 1.

The generator could learn wrong things, if it bases its decision on the relative values of the optimal critic function (i.e., if it generates at locations of high values of the critic function).

Consider the following setting, where the X’s represent generated and the O’s represent real data points.

An optimal coupling in this example is quite obvious: We connect the left-most O with the X on the left, and then extend by an arbitrary matching of the other O’s with the other X’s. It is then not hard to verify that the indicated critic function leads to an equality in the Kantorovich duality and hence is optimal. The value of the critic function at the left-most ’X’ is higher than the value at the right-most ’O’, suggesting to generate images at the wrong position.

This issue might be fixed by the alternating updates at a later stage of training when less X’s are generated so far on the right side of the O’s. The critic function will then flatten the peak, eventually assigning a lower value to an X on the left than to any of the O’s.

###### Remark 1.

One can show that the same holds (with only a slight change of the critic function ) if the ’s and ’s denote the centers of Gaussians. This can be shown with similar arguments as those in the proofs in Appendix A.3.

## 5 Weight clipping to enforce the Lipschitz constraint

The critic function is generated by a neural network, which raises the question on how to enforce the 1-Lipschitz constraint in the maximization problem of the objective in Equation (5).

In the following, we will denote by all Lipschitz-continuous functions with Lipschitz constant , i.e., the set of all functions such that . Moreover, we denote by the Lipschitz constant of , i.e., the smallest such that for all . As Arjovsky et al. (2017) point out, it does not matter whether to maximize over or , since we can equivalently optimize instead of .

As proposed by Arjovsky et al. (2017) a simple way to restrict the class of functions that can be modeled by the neural network to (for some ) is to perform weight clipping, i.e. to enforce the parameters of the network not to exceed a certain value in absolute value. As the authors note, this is not a good but simple choice. This is further demonstrated by the line of thoughts outlined below and leading to the following observation.

###### Observation 2.

Weight clipping is not a good strategy to enforce the Lipschitz constraint for the critic function.

Let us first note, that an easy consideration leads to the following lemma.

###### Lemma 1.

The optimal function (leading to the maximum in Eq. (4)) exhausts the Lipschitz constraint for given in the sense that there is a pair of points such that .

###### Proof.

If

 supx≠y{f∗(y)−f∗(x)||x−y||2}=c<α,

then generates a contradiction to the optimality of .

(Alternatively, in the case , it follows directly from Theorem 1, (ii), that the transport is optimal if and only if the Lipschitz constraint of one is exhausted for any two points of the coupling.) ∎

Thus, we know that an optimal

exhausts the Lipschitz constraint. We will now show exemplarily for deep NN with rectified linear unit (ReLU) activation functions that there is an extremely limited number of functions generated by the NN using weight clipping that do exhaust the Lipschitz constraint. It follows that, in almost all cases, the optimal

is not in the class of functions that can be generated by the network under the weight clipping constraint.

First note that by clipping the weights we enforce a common Lipschitz constraint, where the common Lipschitz constant is defined as the minimal such that for all and all functions that can be generated by the network under weight clipping. The actual value of does not follow directly from the weight clipping constant but can be computed from the structure of the network. The following result determines the subset of all functions that can be generated by the network under weight clipping, which exhaust the implicit Lipschitz constant .

###### Proposition 1.

Consider a (deep) NN with ReLU activation functions and linear output layer. A function generated by the NN under constraining each weight in absolute value by exhausts the common Lipschitz constraint if and only if

• The weight matrix of the first layer consists of constant columns with value or .

• The weights of all other layers are given by a matrix with every entry equal to .

The proof is given in Appendix A.2.

It is therefore quite clear that we need to find a different way to enforce the Lipschitz constraint appearing in the Wasserstein objective. In the following we discuss approaches using a regularization term.

## 6 Improved training of WGANs

Gulrajani et al. (2017) propose an alternative to weight clipping leading to what is called improved training of WGANs. The basic idea is to augment the loss by a regularization term that penalizes the deviation of the gradient norm of the critic with respect to its input from one (referred to as WGAN-GP, where GP stands for gradient penalty) . More precisely, the loss of the critic to be minimized is given by

 Ey∼ν[f(y)]−Ex∼μ[f(x)]+λE^x∼τ[(||∇f(^x)||2−1)2], (6)

where is the distribution of for .

The regularization term is derived based on the following result.

###### Proposition 2.

Let and be two probability distributions on . Let be an optical critic, leading to the maximum

 maxf∈Lip1∫Rnf(x) dμ(x)−∫Rnf(x) dν(x),

and let be an optimal coupling with respect to

 minπ∈Π(μ,ν)∫Rn×Rn||x−y||2 dπ(x,y).

If is differentiable and for , it holds that

 P(x,y)∼π∗[(∇f∗(xt)=y−xt||y−xt||)]=1.

This in particular implies, that the norms of the gradients are one -almost surely.

For the convenience of the reader, we provide a simple argument for obtaining this result.

###### Proof.

It follows from Theorem 1 (ii) that for all in the support of we have . Considering the line between and , the 1-Lipschitz constraint implies that the values of have to follow a linear function (since assuming that the slope was smaller than one at some point would imply that the function must have a slope larger than one somewhere else between and , which contradicts the 1-Lipschitz constraint). It follows that at each point on the line, the partial derivative has norm equal to one into the direction pointing from the real data point to the generated one (which are coupled by the corresponding optimal coupling). Since, by the 1-Lipschitz constraint, the maximal norm of a partial derivative at any point into any direction is one, the given direction is the direction of maximal descent, i.e. equals the gradient. ∎

Note, that Proposition 2 holds only when is differentiable and and are sampled from the optimal coupling . However, sampling independently from the marginal distributions and very likely results in points that lie outside the support of . Furthermore, the optimal cost function does not need not to be differentiable everywhere. These two points will be discussed in more detail in the following subsections.

### 6.1 Sampling from the marginals instead of the coupling

###### Observation 3.

Suppose is an optimal critic function and the optimal coupling determined by the Kantorovich duality in Theorem 1. Then on the line , for sampled from , but not necessarily on the lines connecting an arbitrary pair of a real and a generated data point.

For a first example consider the left-most X in Figure 1 together with any but the left-most ’s. A more striking one-dimensional simple example is the one in Figure 2, where again every denotes a sample from the generator and every a sample of real data.

An optimal coupling is indicated in red. Then

 ∫R×R|x−y| dπ∗(x,y)=17(1+1+1+1+1+1+1)=∫Rf∗(x) dν(x)−∫Rf∗(x) dμ(x)

Therefore and are indeed optimal. Now note that, for example, for the left-most and the right-most we have .

We also provide a 2-dimensional example in Figure 3, where a different phenomena leads to the same conclusion.

Again, samples from the discrete empirical and generated distribution are indicated by O’s and X’s, respectively. The blue numbers, with a variable denote the value of an optimal critic function at these points (the values at these points is all that matters). We fixed the value at position to be zero, taking into account that an optimal critic function remains optimal under addition of an arbitrary constant. The indicated coupling in red and the critic function are optimal, since with this choice we have

 ∫Rn×Rn||x−y||2 dπ∗(x,y)=12(1+1)=1

and

 ∫Rnf∗(y) dν(y)−∫Rnf∗(x) dμ(x)=12(1+a+1)−12(0+a)=1.

Since the Lipschitz constraint of must be satisfied we get and . Therefore and one of the inequalities of the Lipschitz constraint must be strict.

### 6.2 Differentiability of the critic

###### Observation 4.

The assumption of differentiability is not valid at points of interest.

Consider the example of two discrete probability distributions and its optimal critic function shown in Figure 4, which is an excerpt of the example in Figure 2.

We can see that the indicated function is optimal as it leads to an equality in the equation of the Kantorovich dual. (Also, it is the only continuous function, up to a constant, that realizes for coupled points .) However, it is not differentiable at 0.

Of course, it might seem to be a quite strong assumption that two points of a discrete distribution lie at exactly one location. If these points lie slightly apart instead, there is a differentiable optimal critic function , with a derivative of norm one at both points. One may also argue, that we are usually not interested in learning discrete but continuous distributions and that discrete examples might not be representative for the continuous case. However, the counterexample can be made continuous by considering the points as the center points of Gaussians, as illustrated in Figure 5.

This is formalized by the following theorem showing that the the critic indicated in red is indeed optimal for the depicted orange Gaussian of real data and the blue mixture of two Gaussians of generated data.

###### Theorem 2.

Let

be a normal distribution centered around zero and

be a mixture of the two normal distributions and over the real line. If describes the distribution of real data and describes the distribution of the generative model, then the optimal critic function is given by

 ϕ∗(x)=−|x|.

The proof can be found in appendix A.3.

Furthermore, one dimensional examples might not be considered as representative. However, the issue with non-differentiability can be generalized to higher-dimensional spaces based on the observation that an optimal coupling is in general not deterministic as defined in the following.

###### Definition 2.

Let and be two probability spaces. A coupling is called deterministic if there exists a measurable function such that

 supp(π)⊆{(x,ρ(x)) | x∈X}.

We can now formulate the following observation.

###### Observation 5.

Suppose is a non-deterministic optimal coupling between two probability distributions over so that there exist points and in . Suppose further that there is no with (in particular this implies ). Then any optimal critic function is not differentiable at .

For the coupled points and we have that the partial derivatives at into the directions of and respectively have an absolute value of one. If there are two such directions and is differentiable, then the norm of its gradient must be larger than one, contradicting the 1-Lipschitz constraint. Indeed, recall that, considering as a function on the line with of unit length, the slope of at is given by Now

 ∇f(x)⋅v=||∇f(x)||2⋅cos(θv) (7)

with

being the angle between the vector

and the unit vector . Equation (7) with has a unique solution for with . It follows that, if for two different directions we have , then and .

Of course, the critic function generated by the NN will be (almost) everywhere differentiable (depending on the activation function). By the Stone-Weierstrass theorem, on compact sets, we can approximate any (Lipschitz-)continuous function by differentiable functions uniformly. It might therefore seem possible that a good approximating function may have a gradient norm of one for points sampled between coupled points. However, close to a point of non-differentiability, this seems to be a strong constraint and a sensible approximating function may at times rather have a gradient norm strictly smaller than 1 in the neighborhood of a non-differentiability (cf. Figure 4).

Therefore, we argue – in contrast to the argumentation of Gulrajani et al. (2017) – that the gradient should not be assumed to equal one for arbitrary points on the line between and sampled from .

### 6.3 A discussion of the regularization of WGAN-GP

Taken together, from a theoretical perspective, the regularization strategy by Gulrajani et al. (2017) may be problematic because of two reasons:

• Firstly, as illustrated in Section 6.1, we have only for coupled pairs of points, not for arbitrary points sampled from the marginals.

• Secondly, as discussed in 6.2, the assumption of differentiability of the critic function is used to derive a penalty enforcing for all points on the line between points , which will in general be difficult to satisfy by a differentiable approximating function.

## 7 How to regularize WGANs

In the following, we will discuss how the regularization of WGANs can be improved.

### 7.1 Penalizing the violation of the Lipschitz constraint

For the critic function, we have nothing more at hand than the inequality of the Lipschitz-constraint. Moreover, as shown in Lemma 1, the exhaustion of the Lipschitz constant is automatic by maximizing the objective function. Therefore, a natural choice of regularization is to penalize the given constraint directly, i.e., sample two points and and add the regularization term

 (max{0,|f(x)−f(y)|||x−y||2−1})2 (8)

to the cost function. (We square to penalize larger deviations more than smaller ones.)

Alternatively, since the NN generates (almost everywhere) differentiable functions, we can only penalize whenever gradient norms are strictly larger than one, an option referred to as ”one-sided penalty”and shortly discussed as alternative to penalizing any deviation from one by Gulrajani et al. (2017)444While the authors note that “ In practice, we found this [using the GP or two-sided penalty] to converge slightly faster and to better optima. Empirically this seems not to constrain the critic too much…”, our experiments point towards another conclusion.. Note that enforcing the gradient to be smaller than one in norm has the advantage that we penalize when the partial derivative has norm into the direction of steepest descent. Hence, all partial derivatives are implicitly enforced to be bounded in norm by one, too. At the same time, enforcing for smooth approximating functions is not an unreasonable constraint even at points of non-differentiability. For these reasons we suggest to add the regularization term

 (max{0,||∇f(^x)||−1})2 (9)

to the cost function. Different ways of sampling the point will be analyzed in Section 8.

Moreover we remark that, different to many other settings in machine learning, we do not introduce a bias into the problem that leads to oversimplified solutions when the regularization term is chosen too high. In contrast, we here enforce a condition that should be satisfied by the optimal solution. We might therefore call the additional term a constraining term instead, for which we can set the corresponding parameter

very high. The performance in learning is less sensitive towards changing the value of than it is for standard regularization parameters in other learning methods.

### 7.2 Our proposal for WGANs

Our proposed method (WGAN-LP555LP stands for Lipschitz penalty) alternates between updating the discriminator to minimize

 Ey∼ν[f(y)]−Ex∼μ[f(x)]+λE^x∼τ[(max{0,||∇f(^x)||−1})2], (10)

and updating the generator network modeling to minimize using gradient descent.

### 7.3 A more general view

The Kantorovich duality theorem holds in a quite general setting. For example, a different metric can be substituted for the Euclidean distance . Taking for a different natural number for example leads to the minimization of the Wasserstein distance of order (i.e., the Wasserstein-p distance). Choosing has the theoretical advantage that, if the distributions and

have finite moment of order 2 and are absolutely continuous with respect to the Lebesgue measure, then there is a unique deterministic optimal coupling. Based on the dual problem to the computation of the Wasserstein distance of order

(as given by the Kantorovich duality theorem) we still need to maximize Equation (5) with the only difference that 1-Lipschitz-continuity is now measured with respect to . For our training method this entails that the only modification to make is to change the metric in the regularization term. Specifically, optimizing with respect to Wasserstein-p distance corresponds to adding the following constraining term to the critic loss

 max({0,|f(x)−f(y)|||x−y||p2−1})2. (11)
###### Remark 2.

Recently, in Bellemare et al. (2017), the Wasserstein distance was replaced by the energy distance (Székely & Rizzo, 2013). The energy distance is a generalization of the Cramer distance on two distributions over , and the authors therefore call their variant Cramer GAN. The Cramer distance has the same convergence properties as the Wasserstein distance, but additionally satisfies that the expected gradient of the sample loss equals the gradient of the true loss (i.e., the gradient approximation does not have a bias). For the training of Cramer GANs, the authors apply the GP-penalty term proposed by Gulrajani et al. (2017). We would expect that using the LP-penalty term instead is also beneficial for Cramer GANs.

## 8 Experiments

We perform several experiments on three toy data sets, 8Gaussians, 25Gaussians, and Swiss Roll (which were also used in the analysis of Gulrajani et al. (2017)), to compare the effect of different regularization terms. More specifically, we compare the performance of WGAN-GP and WGAN-LP as described in Equations (6) and (10) respectively, where the penalty was applied to points randomly sampled on the line between the training sample and the generated sample .

Furthermore, we analyze the effect of the GP- and the LP-penalty using different sampling procedures. In particular, we compare variants, which generate the samples used for the regularization term by adding random noise either onto training points or onto both training and generated samples. Note, that applying the GP-penalty on samples generated by adding noise to the training examples was also suggested by Kodali et al. (2017), but using a different (asymmetric) noise distribution and in combination with the vanilla GAN objective, given in (2). We will refer to this as “local perturbation” in the following.

Both, the generator network and the critic network, are simple feed-forward NNs with three hidden Leaky ReLU layers, each containing 512 neurons, and one linear output layer. The dimensionality of the latent variables of the generator network was set to two. During training, 10 critic updates are performed for every generator update, except for the first 25 generator updates, where the critic is updated 100 times for each generator update in order to get closer to the optimal critic in the beginning of training. Both networks were trained using RMSprop (

Tieleman & Hinton (2012)) with learning rate and a batch size of 256.

### 8.1 Level sets of the critic

A qualitative way to evaluate the learned critic function for a two-dimensional data set is by displaying its level sets, as it was done by Gulrajani et al. (2017) and Kodali et al. (2017). The level sets after 10, 50, 100 and 1000 training iterations of a WGAN trained with the GP and LP penalty on the Swiss Roll data set are shown in Figure 6. Similar experimental results for the 8Gaussians and 25Gaussian data sets can be found in Appendix B.1.

It becomes clear that with a penalty weight of

, which corresponds to the hyperparameter value suggested by

Gulrajani et al. (2017), the WGAN-GP does neither learn a good critic function nor a good model of the data generating distribution. With a smaller regularization parameter, , learning is stabilized. However, with the LP-penalty a good critic is learned even with a high penalty weight in only a few iterations and the level sets show a higher regularity. Training a WGAN-LP with lower penalty weight led to equivalent observations (results not shown). We also experimented with much higher values for , which led to almost the same results as for , which emphasizes that LP-penalty based training is less sensitive to the choice of .

Given the location of the red points, which correspond to the samples drawn for the regularization term, it also becomes clear, that by sampling from the line between arbitrary training points and generated points , a lot of samples lie far from the data manifold (which would not frequently be the case if and were drawn from the optimal coupling ).

### 8.2 Evolution of the critic loss

To yield a fair comparison of methods applying different regularization terms, we display values of the critic’s loss functions without the regularization term throughout training. Results for WGAN-GPs and WGAN-LPs are shown in Figure 7.

The optimization of the critic with the GP-penalty and is very unstable: the loss is oscillating heavily around 0. When we use the LP-penalty instead, the critic’s loss smoothly reduces to zero, which is what we expect when the generative distribution steadily converges to the empirical distribution . Also note that we would expect the negative of the critic’s loss to be slightly positive, as a good critic function assigns higher values to real data points and lower values to generated points . This is exactly what we observe when using the LP-penalty but not when using the GP-penalty in combination with a high regularization parameter. Interestingly, when using the LP-penalty in combination with a very high penalty weight, like , we obtain the same results, indicating that the constraint is always fulfilled for already. Using in combination with the GP-penalty on the other hand stabilized training but still results in fluctuations in the beginning of the training (results shown in Appendix B.2).

We get qualitatively similar results when using a local perturbation for generating samples for the regularization term as can be seen in Figure 8. Interestingly, later WGAN-GP training is stabilized if one only adds noise to training examples and not to generated examples. This indicates that enforcing the GP-penalty close to the data manifold is less harmful. However, the critic’s loss is still much more fluctuating than when training a WGAN-LP.

### 8.3 Estimating the Earth-Mover distance

In order to estimate how the actual Earth-Mover distance between the real and generated distribution evolves during training, we compute the cost of minimum assignment based on Euclidean distance between sets of samples from the real and generated distributions, using the Kuhn-Munkres algorithm

(Kuhn, 1955). We use sample set size of 500 to maintain reasonable computation time and estimate EMD every 10th iteration over the course of 500 iterations. All experiments were repeated 10 times for different settings. From the results for WGAN-GP and WGAN-LP with shown in Figure 9, we conclude that the proposed LP-penalty leads to smaller estimated EMD and less fluctuations during training. Results for both penalties when using local sampling for the regularization term can be found in Appendix B.3. When training WGAN-GPs with a regularization parameter of , training is stabilized as well (see Appendix B.3), indicating that the effect of using a GP-penalty is highly dependent on the right choice of .

### 8.4 Optimizing the Wasserstein-2 distance

We also trained a WGAN with the objective of minimizing the Wasserstein-2 distance, that is, with the penalty given in (11) (and penalty weight ). Results for the evolution of the critics loss and the approximated EM distance during training on the Swiss Roll data set are shown in Figure 10. Both critic loss and EM reduce smoothly, which makes the Wasserstein-2 distance (in combination with its theoretical properties) an interesting candidate to further investigations.

## 9 Conclusion

For stable training of Wasserstein GANs, we propose to use the penalty term given by

to enforce the Lipschitz constraint that appears in the objective function. We presented theoretical and empirical evidence on toy data that this gradient penalty performs better than the previously considered approaches of clipping weights and of applying the stronger gradient penalty given by . In addition to more stable learning behavior, the proposed regularization term leads to lower sensitivity to the value of the penalty weight (demonstrating smooth convergence and well-behaved critic scores throughout the whole training process for different values of ). We assume that the proposed penalty will also lead to improvements when modeling real world data, which we will investigate in future work.

## Appendix A Proofs

### a.1 Proof of Theorem 1

###### Proof.

We provide the arguments how to derive our version from Theorem 5.10 of Villani (2008).

With , our assumptions imply (with ) that all conclusions of Theorem 5.10 hold. Moreover, 5.4 of Villani (2008) shows that in this case (in the notation of Villani (2008)) and -convexity is the same as 1-Lipschitz continuity. This leads to our formulation in (i) and the existence of an optimal coupling and an optimal critic function by part (iii).

If we let

 Γf={(x,y)∈Rn×Rn | f(x)−f(y)=||x−y||2}

then it follows from the proof of Theorem 5.10 that the set in part 5.10 (iii) is given by , where being optimal means that it leads to a maximum on the RHS of equation (1).

To prove our part from 5.10, let be optimal then, by 5.10 (iii), . Hence, in particular, for all optimal . This shows that (a) implies (b). For the other direction, we use that if for all optimal , then , which by Theorem 5.10 (iii) is equivalent to being optimal. ∎

### a.2 Proof of Proposition 1

###### Proof.

We need to determine every function generated by the neural net, such that we can find points with . Recall that is defined as the minimal satisfying for all functions generated by the neural net and all points .

Every function generated by the neural net is a composition of functions

 f=fn∘relu∘fn−1∘…∘relu∘f1.

with linear functions and relu denoting a layer of activation functions with rectifier linear units. Since each linear function is Lipschitz continuous with Lipschitz constant and relu is Lipschitz continuous with , it follows that is Lipschitz continuous with . Moreover, equality holds if there is a pair of points such that the consecutive images witness the maximal Lipschitz constant and for each of the individual functions making up the composition of . More formally, equality holds if and only if there is a tuple of pairs of points , , such that for all ,

• and

• All entries of and are larger or equal to zero. This is equivalent to the condition that

 |relu(fi(x(i)))−relu(fi(y(i)))|=α(relu)||fi(x(i))−fi(y(i))||2.

It follows that to determine we need to maximize for the linear layers with weight constraint and find a sequence of points that satisfy (i)-(iv). The existence of the sequence of points shows that and maximizing each then shows that

 α(f∗)=n∏i=1α(fi)=¯α.

Since, as we will show, the conditions in (a) and (b) maximize the Lipschitz constraint of each layer individually, the existence of suitable proves the if-direction of the proposition.

For the only-if direction, we will see that the ability to find the sequence of points gives restrictions on how to maximize of an individual layer, leading to the more restrictive condition of (b) for all but the first layer (cf. (a)).

So let us first maximize the Lipschitz constraint of each linear layer and then make sure that we can find the corresponding points. We write the linear layer as a matrix multiplication . Using linearity,

 α(fi)=max||z||2=1||A(i)z||2,

and our goal can be reformulated to finding the matrix maximizing .

For any fixed , is maximized exactly when each vector entry is maximized in absolute value. Now, with and denoting the sign function,

 |(A(i)z)j|=∣∣ ∣∣∑ka(i)j,kzk∣∣ ∣∣≤∑k|a(i)j,k||zk|≤∑kcmax|zk| (by the weight constraint)
 =∑k(cmax⋅sgn(zk))⋅zk

and equality holds if and only if or consists of columns of constant entry with the value in column . It follows that

 α(fi)=max||z||2=1||A(i)z||2=max||z||2=1cmax⋅||z||1
 ≤max||z||2=1cmax√dim(z)⋅||z||2 (by % Cauchy-Schwartz inequality)
 =cmax√dim(z)

with equality if and only if for all .

Hence, for the first linear layer we need to choose a matrix satisfying (a) of the statement of the proposition.

Now, we find a pair with

 x(1)−y(1)=a⋅(±1,±1,…,±1) for some a≠0

such that

 sgn(x(1)k)=sgn(y(1)k)=sgn(x(1)k−y(1)k)=the sign of column k of A(1).

This is the only possibility to ensure (iii) and (iv) of the conditions above. Note that also (i) holds for , and (ii) (together with (iv)) determines uniquely from as

 x(2)=A(1)x(1)=cmax⋅||x(1)||1⋅(1,1,...,1)&y(2)=A(1)y(1)=cmax⋅||y(1)||1⋅(1,1,...,1).

We may assume that . (Otherwise, switch the roles of and . In the case of equality, we need to choose a different pair for not to violate (i) for .) Then we have that ,

 +1=sgn(x(2)k)=sgn(y(2)k)=sgn(x(2)k−y(2)k) for all k.

Using the same arguments as above, it follows that for such , to maximize the Lipschitz constant of (and to guarantee that the maximum is reached at ), we need to have equal to a matrix with at each position.

Now (i)-(iv) also hold for the second layer and one may now proceed by induction to show that for , contains only for each of its entries. This is the only way to maximize the Lipschitz constraint for functions generated by the neural net, and it does indeed hold with and

 (x(i),y(i))=(fi∘relu∘fi−1∘…∘% relu∘f1(x∗),fi∘relu∘fi−1∘…∘relu∘f1(y∗)).

### a.3 Proof of Theorem 2

To prove Theorem 2, we first prove that is the optimal critic function for certain distributions with non-overlapping support, and then reduce the case of Gaussians to the simplified setting.

###### Proposition 3.

Let and be two continuous functions on the real line that satisfy the following conditions.

• and are symmetric with respect to the y-axis.

• and for all .

• If denotes the open support of a continuous function , then .

• has connected support (this implies that is centered around because of the symmetry).

• .

Then the maximum of over is maximized for .

###### Proof.

We multiply both and by a constant number such that

 ∫Rc⋅f(x)dx=∫Rc⋅g(x)dx=1.

Then and

define probability density functions. A function

maximizes if and only if it maximizes . We therefore may assume from now on that

 ∫Rf(x)dx=∫Rg(x)dx=1.

Now it suffices to find a coupling of the probability distributions defined by and , which itself is defined by a probability density function such that for we get

 ∫R×R|x−y|⋅π(x,y)dxdy=∫Rϕ(x)(f(x)−g(x))dx.

This holds by the Kantorovich duality theorem 1, because the right hand side is always smaller or equal to the left hand side for arbitrary coupling and function and is consequently maximized when equality holds. By the assumption of symmetry, we may write where the support and for all . The area under equals half the area under , or put differently,

 ∫Rg1(x)dx=∫Rf(x)δ(−∞,0](x)dx=12.

We now consider the probability density function given by

 π1(x,y)=2g1(x)⋅2f(y)⋅δ(−∞,0](y),

which defines a coupling between the two distributions given by the probability density functions and . For later use we note that

 ∫x∈Rπ1(x,y) dx=2⋅f(y)⋅δ(−∞,0](y) and ∫y∈Rπ1(x,y) dy=2⋅g1(x).

We define for and . Further, we let . Then defines a coupling between and as can be seen by computing

 ∫x∈Rπ(x,y)dx=12∫x∈Rπ1(x,y)dx+12∫x∈Rπ2(x,y)dx
 =12∫x∈Rπ1(x,y)dx+12∫x∈Rπ1(−x,−y)δ{y≠0}(y)dx
 =f(y)δ(−∞,0](y)+f(y)δ(0,∞)(y)=f(y)

and

 ∫y∈Rπ(x,y)dy=12∫y∈Rπ1(x,y)dy+12∫y∈Rπ2(x,y)dy
 12∫y∈Rπ1(x,y)dy+12∫y∈Rπ1(−x,−y)δ{y≠0}(y)dy
 =g1(x)+g1(−x)=g1(x)+g2(x)=g(x)

We have established the existence of some coupling between and and we will now compute its transport costs. We will subsequently show that this equals , hence both and are optimal by realizing the Kantorovich duality.

We wish to compute .

 ∫R×R|x−y|π(x,y)dxdysymmetry=∫R×R|x−y|π1(x,y)dxdy
 =∫R×R(y−x)π1