# Gradient Layer: Enhancing the Convergence of Adversarial Training for Generative Models

We propose a new technique that boosts the convergence of training generative adversarial networks. Generally, the rate of training deep models reduces severely after multiple iterations. A key reason for this phenomenon is that a deep network is expressed using a highly non-convex finite-dimensional model, and thus the parameter gets stuck in a local optimum. Because of this, methods often suffer not only from degeneration of the convergence speed but also from limitations in the representational power of the trained network. To overcome this issue, we propose an additional layer called the gradient layer to seek a descent direction in an infinite-dimensional space. Because the layer is constructed in the infinite-dimensional space, we are not restricted by the specific model structure of finite-dimensional models. As a result, we can get out of the local optima in finite-dimensional models and move towards the global optimal function more directly. In this paper, this phenomenon is explained from the functional gradient method perspective of the gradient layer. Interestingly, the optimization procedure using the gradient layer naturally constructs the deep structure of the network. Moreover, we demonstrate that this procedure can be regarded as a discretization method of the gradient flow that naturally reduces the objective function. Finally, the method is tested using several numerical experiments, which show its fast convergence.

## Authors

• 16 publications
• 66 publications
• ### Mean-Field Langevin Dynamics and Energy Landscape of Neural Networks

We present a probabilistic analysis of the long-time behaviour of the no...
05/19/2019 ∙ by Kaitong Hu, et al. ∙ 0

• ### Gradient Descent Provably Optimizes Over-parameterized Neural Networks

One of the mystery in the success of neural networks is randomly initial...
10/04/2018 ∙ by Simon S. Du, et al. ∙ 30

• ### Practical Convex Formulation of Robust One-hidden-layer Neural Network Training

Recent work has shown that the training of a one-hidden-layer, scalar-ou...
05/25/2021 ∙ by Yatong Bai, et al. ∙ 0

• ### WGAN with an Infinitely Wide Generator Has No Spurious Stationary Points

Generative adversarial networks (GAN) are a widely used class of deep ge...
02/15/2021 ∙ by Albert No, et al. ∙ 0

• ### Ratio Matching MMD Nets: Low dimensional projections for effective deep generative models

Deep generative models can learn to generate realistic-looking images on...
05/31/2018 ∙ by Akash Srivastava, et al. ∙ 0

• ### Towards Understanding the Importance of Shortcut Connections in Residual Networks

Residual Network (ResNet) is undoubtedly a milestone in deep learning. R...
09/10/2019 ∙ by Tianyi Liu, et al. ∙ 2

• ### Your Local GAN: Designing Two Dimensional Local Attention Mechanisms for Generative Models

We introduce a new local sparse attention layer that preserves two-dimen...
11/27/2019 ∙ by Giannis Daras, et al. ∙ 41

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

### 1 Introduction

are a promising scheme for learning generative models. GANs are trained by a discriminator and a generator in an adversarial way. Discriminators are trained to classify between real samples and fake samples drawn from generators, whereas generators are trained to mimic real samples. Although training GANs is quite difficult, adversarial learning succeeded in generating very impressive samples

[18], and there are many subsequent studies [13, 19, 16, 6, 25]. Wasserstein GANs (WGANs) [3] are a variant to remedy the mode collapse that appears in the standard GANs by using the Wasserstein distance [23], although they also sometimes generate low-quality samples or fail to converge. Moreover, an improved variant of WGANs was also proposed [8] and it succeeded in generating high-quality samples and stabilizing WGANs. Although these attempts have provided better results, there is still scope to improve the performance of GANs further.

One reason for this difficulty stems from the limitation of the representational power of the generator. If the discriminator is optimized for the generator, the behavior is solely determined by the samples produced from that generator. In other words, for a generator with a poor representational power, the discriminator terminates its learning in the early stage and consequently results in having low discriminative power. However, for a finite-dimensional parameterized generator, the ability to generate novel samples to cheat the discriminators is limited. In addition, the highly non-convex structure of the deep neural network for the generator prevents us from finding a direction for improvement. As a result, the trained parameter gets stuck in a local optimum and the training procedure does not proceed any more.

In this study, we propose a new learning procedure to overcome the issues of limited representational power and local optimum by introducing a new type of layer called a gradient layer. The gradient layer finds a direction for improvement in an infinite-dimensional space by computing the functional gradient [14] instead of the ordinary gradient induced by a finite-dimensional model. Because the functional gradient used for the gradient layer is not limited in the tangent space of a finite-dimensional model, it has much more freedom than the ordinary finite-dimensional one. Thanks to this property, our method can break the limit of the local optimum induced by the strong non-convexity of a finite-dimensional model, which gives much more representational power to the generator. We theoretically justify this phenomenon from the functional gradient method perspective and rigorously present a convergence analysis. Interestingly, one iteration of the method can be recognized as inserting one layer into the generator and the total number of iterations is the number of inserted layers. Therefore, our learning procedure naturally constructs the deep neural network architecture by inserting gradient layers. Although, gradient layers can be inserted into an arbitrary layer, they are typically stacked on top of the generator in the final training phase to improve the generated sample quality.

Moreover, we provide another interesting perspective of the gradient layer, i.e., discretization of the gradient flow in the space of probability measures. In Euclidean space, the steepest descent which is the typical optimization method, can be derived by discretizing the gradient flow that naturally produces a curve to reduce the objective function. Because the goal of GANs is to generate a sequence of probability measures moving to the empirical distribution by training samples, it is natural to consider a gradient flow in the space of probability measures defined by a distance between generated distribution and the empirical distribution and to discretize it in order to construct practical algorithms. We show that the functional gradient method for optimizing the generator in the function space is such a discretization method; in other words, the gradient flow can be tracked by stacking gradient layers successively.

The recently proposed SteinGAN [24] is closely related to our work and has a similar flavor, but it is based on another strategy to track gradient flow. That is, since that discretization is mimicked by a fixed-size deep neural network in SteinGAN, it may have the same limitation as typical GANs. By contrast, our method directly tracks the gradient flow in the final phase of training GANs to break the limit of the finite-dimensional generator.

### 2 Brief Review of Wasserstein GANs

In this section, we introduce WGANs and their variants. Although our proposed gradient layer is applicable to various models, we demonstrate how it performs well for the training of generative models; in particular, we treat Wasserstein GANs as a main application in this paper. Let us start from briefly reviewing WGANs.

WGAN is a powerful generative model based on the -Wasserstein distance, defined as the

minimum cost of transporting one probability distribution to the other. Let

and be a compact convex data space and a hidden space, respectively. A typical example of is the image space . For a noise distribution on , WGAN learns a data generator to minimize an approximation to the -Wasserstein distance between the data distribution and the push-forward distribution

, which is a distribution that the random variable

follows when (in other words, the distribution obtained by applying a coordinate transform to ). That is, WGAN can be described as the following problem by using a Kantrovich-Rubinstein duality form of the -Wasserstein distance:

 ming∈Gmaxf∈FL(f,g)def=Ex∼μD[f(x)]−Ez∼μn[f∘g(z)],

where is the set of generators and is an approximate set to the set of -Lipschitz continuous functions called critic. In WGANs, are parameterized neural networks and the problem is solved by alternate optimization: maximizing and minimizing with respect to and , alternately.

In practice, to impose the Lipschitz continuity on critics , penalization techniques were explored. For instance, the original WGANs [3] use weight clipping , which implies the upper-bound on the norm of and makes it Lipschitz continuous. However, it was pointed out in a subsequent study [8] that such a restriction seems to be unnatural and sometimes leads to a low-quality generator or a failure to converge. In the same study, an improved variant of WGANs called WGAN-GP was proposed, which succeeded in stabilizing the optimization process and generating high-quality samples. WGAN-GP [8] adds the gradient penalty to the objective function in the training phase of critics, where

is a random interpolation between a training example

and a generated sample , i.e., (

: uniform distribution). DRAGAN

[12] is a similar method to WGAN-GP, although it is based on a different motivation. DRAGAN also uses the gradient penalty, but the penalty is imposed on a neighborhood of the data manifold by a random perturbation of a training example.

WGAN and its variants are learned by alternately optimizing and , as stated above. We can regard this learning procedure as a problem of minimizing , where is a penalty term. Let attain its maximum value at for . Then, the gradient is the same as by the envelope theorem [15] when both terms are well-defined. The differentiability of with respect to almost everywhere is proved in [3] under a reasonable assumption. Hence, we can apply the gradient method to this problem by approximating this gradient with finite particles generated from . However, because it is difficult to obtain , we run the gradient method for several iterations on training a critic instead of exactly computing at each . We can notice that this learning procedure is quite similar to that of the standard GAN [7].

In the usual training procedure of WGANs, though more general maps are admissible for the original purpose, generators are parameterized by finite-dimensional space as described in the previous section, and the parameter may get stuck in a local optimum induced by this restriction, or the speed of convergence may reduce. In this work, we propose a gradient layer that accelerates the convergence and breaks the limit of finite-dimensional models. This layer is theoretically derived by the infinite-dimensional optimization method. We first explain the high-level idea of the gradient layer that strictly improves the ability of generator and why our method enhances the convergence of training WGANs.

#### 3.1 High-level idea of gradient layer

Here, we explain gradient layer with intuitive motivation. It is inserted into the generator in WGANs. We now focus on minimizing with respect to under a fixed critic , that is, we consider the problem . Let us split into two neural networks at arbitrary layer where a new layer is to be inserted. Our purpose is to specify the form of layer that reduces the objective value by perturbations of inputs , i.e., . Since is regarded as the integral with respect to the push-forward distribution , this purpose is achieved by transporting the input distribution of along the gradient field . Therefore, we propose a gradient layer

with one hyperparameter

as a map that transforms an input to

 Gη(z′)=z′+η∇z′f(g1(z′)). (1)

Because the gradient layer depends on the parameters of the upper layers , we specify the parameter as if needed.

Applying the gradient layer recursively, it further progresses and achieves a better objective. The computation of the gradient layer is quite simple. Actually, simply taking the derivative is sufficient, which can be efficiently executed. Because too many gradient layers would lead to overfitting to the critic , we stop stacking the gradient layer after an appropriate number of steps. Indeed, if is Lipschitz continuous, for sufficiently small is an injection because implies where is the Lipschitz constant. Thus, a topology of is preserved and early stopping is justified. Then, this layer efficiently generates high-quality samples for the critic and the overall adversarial training procedure can be also boosted.

#### 3.2 Powerful optimization ability

Because the gradient layer directly transports inputs as stated above, it strictly improves the objective value if there is room for optimization, unlike finite-dimensional models that may be trapped in local optima induced by the restriction of generators. Indeed, when the gradient layer cannot move inputs, i.e., , the gradient vanishes on and there is no chance to improve the objective value by optimizing

because of the chain rule of derivatives. We now explain this phenomenon more precisely. Let us first consider the training of

in the usual way. We denote by the parameter of . As stated in the previous section, is updated by using the gradient

 Eμn[J⊤w2g2(z)∇z′f(g1(g2(z)))], (2)

where is the input to and is the Jacobian matrix of with respect to . We immediately notice that this gradient (2) is the inner product of and in -space, and the latter term is the output of the gradient layer. Therefore, when the gradient layer cannot move the inputs, the gradient with respect to also vanishes. However, even if the gradient vanishes, the gradient layer can move the inputs in general. Thus, whereas the optimization of may get stuck in a local optimum or be slowed down in this case, the gradient layer strictly improves the quality of the generated samples for the upper layers, because does not vanish. This is the reason the gradient layer has a greater optimization ability than finite-dimensional models.

#### 3.3 Algorithm description

The overall algorithm is described in this subsection. We adopt WGAN-GP as the base model to which gradient layer is applied. Let us denote by a gradient penalty term. In a paper on the improved WGANs [8], the use of a two-sided penalty is recommended. However, we also allow the use of the one-sided variant . As for the place in which the gradient layer is inserted, we can propose several possibilities, e.g., inserting the gradient layer into (i) the top and (ii) the bottom of the layers of the generator. The latter usage is described in the appendix.

, and RMSPROP

[22]. From the optimization perspective, we show that Algorithm 1 can be regarded as an approximation to the functional gradient method. From this perspective, we show fast convergence of the method under appropriate assumptions where the objective function is smooth and the critics are optimized in each loop. This theoretical justification is described later. Although Algorithm 1 has a great optimization ability, applying the algorithm to large models is difficult because it requires the memory to register parameters; thus, we propose its usage for fine-tuning in the final phase of training a WGAN-GP. After the execution of Algorithm 1, we can generate samples by using the history of critics, the learning rate, and the base distribution as described in Algorithm 2.

In this section, we provide mathematically rigorous derivation from the functional gradient method [14] perspective under the Fréchet differentiable (functional differentiable) assumption on . That is, we consider an optimization problem with respect to a generator in an infinite-dimensional space. For simplicity, we focus on the case where the gradient layer is stacked on top of a generator and we treat as the base measure . Thus, in the following we omit the notation in . Let be the space of -integrable maps from to , equipped with the -inner product: for ,

 ⟨ϕ1,ϕ2⟩L2(μg)=Eμg[ϕ1(z)⊤ϕ2(z)].

To learn WGAN-GP, we consider the infinite-dimensional problem:

 minϕ∈L2(μg)maxfτ∈FL(fτ,ϕ)−λRfτ,

where is a gradient penalty term. To achieve this goal, we take a Gâteaux derivative along a given map , i.e., a directional derivative along . Let us denote by and by and the corresponding parameter by , i.e., for . If every is Lipschitz continuous and differentiable, we can find that by the envelope theorem and Lebesgue’s convergence theorem this derivative takes the form:

 ddtL(ϕ+tv)∣∣t=0=−Eμg[∇xf∗ϕ(x)|⊤x=ϕ(z)v(z)].

Therefore, can be regarded as a Fréchet derivative (functional gradient) in and we denote it by , which performs like the usual gradient in Euclidean space. Using this notation, the optimization of can be accomplished by Algorithm 3, which is a gradient descent method in a function space. Because the functional gradient has the form , each iteration of the functional gradient method with respect to is , where is the learning rate. We notice here that this iteration is the composition of a perturbation map and a current map and is nothing but stacking a gradient layer on . In other words, the functional gradient method with respect to , i.e., Algorithm 3, is the procedure of building a deep neural network by inserting gradient layers, where the total number of iterations is the number of layers. Moreover, we notice that if we view as a perturbation term, this layer resembles that of residual networks [9]

which is one of the state-of-the-art architectures in supervised learning tasks.

However, executing Algorithm 3 is difficult in practice because the exact optimization with respect to a critic to compute is a hard problem. Thus, we need an approximation and we argue that Algorithm 1 is such a method. This point can be understood as follows. Roughly speaking, it maximizes with respect to in the inner loop under fixed to obtain an approximate solution to and minimizes that with respect to in the outer loop by stacking , which is an approximation to . Thus, Algorithm 1

is an approximated method, but we expect it to achieve fast convergence owing to the powerful optimization ability of the functional gradient method, as shown later. In particular, it is more effective to apply the algorithm in the final phase of training WGAN-GP to fine-tune it, because the optimization ability of parametric models are limited.

### 5 Convergence Analysis

Let us provide convergence analysis of Algorithm 3 for the problem of the general form:

. The convergence can be shown in an analogous way to that for the finite-dimensional one. To prove this, we make a smoothness assumption on the loss function. We now describe a definition of the smoothness on a Hilbert space whose counterpart in finite-dimensional space is often assumed for smooth non-convex optimization methods.

###### Definition 1.

Let be a function on a Hilbert space . We call that is -smooth at in if is differentiable at and it follows that .

 |h(z′)−h(z)−⟨∇zh(z),z′−z⟩Z|≤L2∥z′−z∥2Z.

The following definition and proposition provide one condition leading to Lipschitz smoothness of . Let us denote by the sup-norm and by a ball of center and radius . Let . In the following we assume is uniquely defined for and -smoothness with respect to the input .

###### Definition 2.

For positive values and , we call that is -regular at when the following condition is satisfied; For , is -smooth at with respect to in .

###### Proposition 1.

If is -regular at , then is -smooth at in .

We now show the convergence of Algorithm 3. The following theorem gives the rate to converge to the stationary point.

###### Theorem 1.

Let us assume the norm of the gradient is uniformly bounded by and assume is -smooth at in for . Suppose we run Algorithm 3 with constant learning rate . Then we have for

 mink∈{0,…,T−1}∥∇ϕL(ϕk)∥2L2(μg)≤2ηT(L(ϕ0)−L∗),

where .

Note that the convergence rate is the same as the gradient descent method for smooth objective in the finite-dimensional one. This means that even though the optimization is executed in the infinite-dimensional space, we do not suffer from the infinite dimensionality in terms of the convergence.

The following rough argument indicates that Algorithm 3 matches with learning WGANs. Let denote the -Wasserstein distance with respect to the Euclidean distance on a compact base space . The following proposition is immediately shown by combining existing results [1, 20].

###### Proposition 2.

Let be a Borel probability measure on and assume is absolutely continuous with respect to the Lebesgue measure. Then, there exists an optimal transport and it follows that , where .

The notion of the optimal transport is briefly introduced in Appendix. By this proposition, there exists a curve strictly reducing distance, i.e., if . Because approximates , it is expected that when differs from . Noting that , the functional gradient does not vanish and the objective may be strictly reduced by Algorithm 3.

In Euclidean space, the step of the steepest descent method for minimizing problems can be derived by the discretization of the gradient flow where is an objective function on Euclidean space. Because our goal is to move closer to , we should consider a gradient flow in the space of probability measures. To make this argument rigorously, we need the continuity equation that characterizes a curve of probability measures and the tangent space where velocities of curves should be contained (c.f., [2]). When these notions are provided, the gradient flow is defined immediately and it is quite natural to discretize this flow to track it well. In this section, we show that Algorithm 3 is such a natural discretization; in other words, building a deep neural network by stacking gradient layers is a discretization procedure of the gradient flow. We refer to [2] for detailed descriptions on this subject, and also refer to [17] for an original method developed by Otto.

#### 6.1 Continuity Equation and Discretization

We denote by the set of probability measures on . For , let be a curve in

that solves the following ordinary differential equation: for an

-integrable vector field

on ,

 ϕ0=id,  ddtϕt(x)=vt(ϕt(x)) for ∀x∈Rv.

Then, this equation derives the curve in , which can be characterized by .

 ddtνt+∇⋅(vtνt)=0. (3)

In other words, the following equation is satisfied

 ∫I∫Rv(∂tf(x,t)+∇xf(x,t)⊤vt)dνtdt=0,

for where is the set of -functions with compact support in . Conversely, a narrowly continuous family of probability measures solving equation (3) can be obtained by transport map satisfying [2]. Thus, equation (3) indicates that drifts the probability measures . Indeed, can be recognized as the tangent vector of the curve as discussed below.

Here, we focus on curves in the subset

composed of probability measures with finite second moment. Noting that there is freedom in the choice of

modulo divergence-free vector fields (i.e., ), it is natural to consider the equivalence class of modulo divergence-free vector fields. Moreover, there exists a unique that attains the minimum -norm in this class: . Thus, we here introduce the definitions of the tangent space at as follows:

 TμP2def={Π(v)∣v∈L2(μ)}. (4)

The following proposition shows that has the property of the tangent space on the space of probability measures, that is, a perturbation using can discretize an absolutely continuous curve and locally approximates optimal transport maps. We denote the -Wasserstein distance by .

###### Proposition 3 ([2]).

Let be an absolutely continuous curve satisfying the continuity equation with a Borel vector field that is contained in almost everywhere . Then, for almost everywhere the following property holds:

 limδ→0W2(νt+δ,(id+δvt)♯νt)|δ|=0.

In particular, for almost everywhere such that is absolutely continuous with respect to the Lebesgue measure, we have

 limδ→01δ(tνt+δνt−id)=vtin L2(νt),

where is the unique optimal transport map between and .

This proposition suggests the update for discretizing an absolutely continuous curve in . Note that when , (), the corresponding map to is obtained by where is a composition as follows:

 ϕ+←(id+v)∘ϕ=ϕ+v∘ϕ. (5)

So far, we have introduced the property of continuous curves in and a method of their discretization. We notice that the above update resembles the update of Algorithm 3. Indeed, we show that the functional gradient method is nothing but a discretization method of the gradient flow derived by the functional gradient .

#### 6.2 Discretization of Gradient Flow

We here introduce the gradient flow, which is one of the most straightforward ways to understand Algorithm 3. We have explained that an absolutely continuous curve in is well characterized by the continuity equation (3) and we have seen that in (3) corresponds to the notion of the velocity field induced by the curve. Such a velocity points in the direction of the particle flow. Moreover, the functional gradient points in an opposite direction to reduce the objective at each particle. Thus, these two vector fields exist in the same space and it is natural to consider the following equation:

 vt=−∇ϕL(ϕt). (6)

This equation for an absolutely continuous curve is called the gradient flow [2] and a curve satisfying this will reduce the objective . Indeed, we can find by the chain rule such a curve that also satisfies the following:

 ddtL(ϕt)=−∥∇ϕL(ϕt)∥2L2(νt).

Recalling that can be discretized well by , we notice that Algorithm 3 is a discretization method of the gradient flow (6). In other words, building deep neural networks by stacking gradient layers is such a discretization procedure.

### 7 Experiments

In this section, we show the powerful optimization ability of the gradient layer method empirically on training WGANs. Our implementation is done using Theano

[21]. We first used three toy datasets: swiss roll, -gaussian, and -gaussian datasets (see Figure 2) to confirm the convergence behavior of the gradient layer. The sizes of toy datasets are , , and , respectively. We next used the CIFAR-10 containing 50,000 images of size 3232, and STL-10 containing 100,000 images. For STL-10 dataset, we downsample each dimension by 2, resulting image size is 4848. We reported inception scores [19] for image datasets, which is one of the conventional scores commonly used to measure the quality of generated samples.

##### Toy datasets

We ran Algorithm 1 without pre-training of generators (i.e.,

) on toy datasets from Gaussian noise distributions with the standard deviation

. We used four-layer neural networks for the critics where the dimension of hidden layers were set to for swiss roll and -gaussian datasets and for -gaussian dataset. We adopted one-sided penalty with regularization parameter . The output of generator was activated by . We used ADAM for training critics with parameters , minibatch size . When we run Algorithm 1, gradient layers are stacked below . The learning rates were set to . The number of inner iterations for training the critics was set to . Figure 2 shows the results for toy datasets for running iterations of generators. Although we ran the algorithm without pre-training the generators, we obtained better results only for a few iterations. This is surprising, because these toy datasets are difficult to learn and fail to converge in the standard GANs and WGANs. Whereas improved variants of these models overcome this difficulty, they usually require more than 1,000 iterations to converge.

##### CIFAR-10 and STL-10

We first trained WGAN-GP with a two-sided penalty (

) on the CIFAR-10 and STL-10 datasets. We used DCGAN for both the critic and the generator. The batch normalization

[10] was used only for the generator. The critic and the generator were trained by using ADAM with , and minibatch size . The number of inner iterations for training the critics were and we ran ADAM for -iterations. The left side of Figure 3 shows the inception scores obtained by WGAN-GP. It seems that the learning procedure is slowed down in a final training phase, especially for CIFAR-10. The final inception score on CIFAR-10 and STL-10 are and , respectively. We next ran Algorithm 1 starting from the result of WGAN-GP. The critics were trained by ADAM with the same parameters, except for and . The learning rates were set to for CIFAR-10 and for STL-10. The right side of Figure 3 shows the inception scores obtained by Algorithm 1. Note that, since we focus on the optimization ability of generators, we plotted results with the horizontal axis as the number of outer-iterations. We observed a rapid increase in the inception scores, which were improved to and on CIFAR-10 and STL-10, respectively.

### 8 Conclusion

We have proposed a gradient layer that enhances the convergence speed of adversarial training. Because this layer is based on the perspective of infinite-dimensional optimization, it can avoid local optima induced by non-convexity and parameterization. We have also provided two perspectives of the gradient layer: (i) the functional gradient method and (ii) the discretization procedure of the gradient flow. We have proven the fast convergence of the gradient layer by utilizing this perspective, and experimental results have empirically shown its reliable performance.

### Acknowledgements

This work was partially supported by MEXT KAKENHI (25730013, 25120012, 26280009, 15H01678 and 15H05707), JST-PRESTO (JPMJPR14E4), and JST-CREST (JPMJCR14D7, JPMJCR1304).

### References

• [1] Luigi Ambrosio. Lecture notes on optimal transport problems. In Mathematical aspects of evolving interfaces, pages 1–52. Springer, 2003.
• [2] Luigi Ambrosio, Nicola Gigli, and Giuseppe Savaré. Gradient Flows in Metric Spaces and in the Space of Probability Measures. Lectures in Mathematics. ETH Zürich. Birkhäuser Basel, 2008.
• [3] Martin Arjovsky, Soumith Chintala, and Léon Bottou. Wasserstein generative adversarial networks. In

International Conference on Machine Learning

, pages 214–223, 2017.
• [4] Yann Brenier. Décomposition polaire et réarrangement monotone des champs de vecteurs. CR Acad. Sci. Paris Sér. I Math, 305(19):805–808, 1987.
• [5] Yann Brenier. Polar factorization and monotone rearrangement of vector-valued functions. Communications on pure and applied mathematics, 44(4):375–417, 1991.
• [6] Xi Chen, Xi Chen, Yan Duan, Rein Houthooft, John Schulman, Ilya Sutskever, and Pieter Abbeel. InfoGAN: Interpretable representation learning by information maximizing generative adversarial nets. In Advances in Neural Information Processing Systems 29, pages 2172–2180. 2016.
• [7] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In Advances in Neural Information Processing Systems 27. 2014.
• [8] Ishaan Gulrajani, Faruk Ahmed, Martin Arjovsky, Vincent Dumoulin, and Aaron C Courville. Improved training of wasserstein gans. In Advances in Neural Information Processing Systems 30, pages 5769–5779, 2017.
• [9] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In

Proceedings of the IEEE conference on computer vision and pattern recognition

, pages 770–778, 2016.
• [10] Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In Proceedings of International Conference on Machine Learning 32, pages 448–456, 2015.
• [11] Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In Proceedings of the 3rd International Conference on Learning Representations (ICLR), 2015.
• [12] Naveen Kodali, Jacob Abernethy, James Hays, and Zsolt Kira. How to train your DRAGAN. arXiv preprint arXiv:1705.07215, 2017.
• [13] Anders Boesen Lindbo Larsen, Søren Kaae Sønderby, Hugo Larochelle, and Ole Winther. Autoencoding beyond pixels using a learned similarity metric. In Proceedings of international conference on Machine learning 33, pages 1558–1566, 2016.
• [14] David G Luenberger. Optimization by vector space methods. John Wiley & Sons, 1969.
• [15] Paul Milgrom and Ilya Segal. Envelope theorems for arbitrary choice sets. Econometrica, 70(2):583–601, 2002.
• [16] Sebastian Nowozin, Botond Cseke, and Ryota Tomioka. f-GAN: Training generative neural samplers using variational divergence minimization. In Advances in Neural Information Processing Systems 29, pages 271–279. 2016.
• [17] Felix Otto. The geometry of dissipative evolution equations: the porous medium equation. 2001.
• [18] Alec Radford, Luke Metz, and Soumith Chintala. Unsupervised representation learning with deep convolutional generative adversarial networks. In Proceedings of International Conference on Learning Representations 4, 2016.
• [19] Tim Salimans, Ian Goodfellow, Wojciech Zaremba, Vicki Cheung, Alec Radford, and Xi Chen. Improved techniques for training GANs. In Advances in Neural Information Processing Systems 29, pages 2234–2242, 2016.
• [20] Vladimir N Sudakov. Geometric problems in the theory of infinite-dimensional probability distributions. Number 141. American Mathematical Soc., 1979.
• [21] Theano Development Team. Theano: A Python framework for fast computation of mathematical expressions. arXiv preprint arXiv:1605.02688, 2016.
• [22] Tijmen Tieleman and Geoffrey Hinton. Lecture 6.5-rmsprop: Divide the gradient by a running average of its recent magnitude. COURSERA: Neural networks for machine learning, 4(2):26–31, 2012.
• [23] Cédric Villani. Optimal transport: old and new, volume 338. Springer Science & Business Media, 2008.
• [24] Dilin Wang and Qiang Liu. Learning to draw samples: With application to amortized mle for generative adversarial learning. arXiv preprint arXiv:1611.01722, 2016.
• [25] Han Zhang, Tao Xu, Hongsheng Li, Shaoting Zhang, Xiaogang Wang, Xiaolei Huang, and Dimitris N Metaxas. StackGAN: Text to photo-realistic image synthesis with stacked generative adversarial networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 5907–5915, 2017.

### A The Other Usage

We introduce the usage that inserts a fixed number of gradient layers into the bottom of the generator to assist overall training procedure, which is described in Algorithm 4. Note that we always use latest parameters of for gradient layers in Algorithm 4. When gradient layers are inserted in the middle of the generator: , we can apply Algorithm 4 by setting . After training, we can generate samples by using parameters of the critic and the generator, the learning rate, and the number of gradient layers, which is described in Algorithm5.

We next briefly review Algorithm 4 in which a fixed number of gradient layers with latest parameters is inserted in the bottom of a generator of WGAN-GP. That is, gradient layers modify a noise distribution to improve the quality of a generator by the functional gradient method.

### B Brief Review of Wasserstein Distance

We introduce some facts concerning the Wasserstein distance, which is used for the proof of Proposition 2. We first describe a primal form of the Wasserstein distance. For let be the set of Borel probability measures with finite -the moment on . For a probability measure on satisfying and is called a plan (coupling), where denotes the projection from to the -th space . We denote by the set of all plans between and . We now introduce Kantorovich’s formulation of the -Wasserstein distance for .

 Wpp(μ,ν)=minγ∈Γ(μ,ν)∫X×X∥x−y∥p2dγ(x,y) (7)

When and have bounded supports, there is the Kantorovich-Rubinstein dual formulation of the -Wasserstein distance, which coincide with the definition introduced in the paper. The existence of optimal plans is guaranteed under more general integrand (c.f. [23, 2]) and we denote by the set of optimal plans. Prior to this formulation, the optimal transport problem in Monge’s formulation was proposed.

 infϕ♯μ=ν∫X∥x−ϕ(x)∥p2dμ(x), (8)

where the infimum is taken over all transport maps from to , i.e., . Because a transport map gives a plan , we can easily find (7) (8). In general, an optimal transport map that solves the problem (8) does not always exist unlike Kantrovich problem (7). However, in the case where , , and is absolutely continuous with respect to the Lebesgue measure, the existence of optimal transport maps is guaranteed [4, 5] and it is extended to more general integrand (see [2]). Moreover, this optimal transport map also solves Kantrovich problem (7), i.e., these two distances coincide. On the other hand, in the case , the existence of optimal transport maps is much more difficult, but it is shown in limited settings as follows.

Let be a compact convex subset in and assume that is absolutely continuous with respect to Lebesgue measure. Then, there exists an optimal transport map from to for the problem 8 with . Moreover, if is also absolutely continuous with respect to Lebesgue measure, we can choose so that is well defined -a.e., and .

Under the same assumption in Proposition A, it is known that two distances (8) and (7) coincide [1], that is, the Kantrovich problem (7) is solved by an optimal transport map.

### C Proofs

We here the give proof of Proposition 1.

###### Proof of Proposition 1.

Note that . For , we divide into two terms as follows.

 L(ψ)=(^L(f∗ψ,ψ)−^L(f∗ϕ,ψ))+^L(f∗ϕ,ψ). (9)

We first bound the first term in (9) by -smoothness of with respect to at in .

 ∣∣^L(f∗ϕ,ψ)−(^L(f∗ψ,ψ)+⟨∇ψ′^L(f∗ψ′,ψ)∣∣ψ′=ψ,ϕ−ψ⟩L2(μg))∣∣≤L2∥ϕ−ψ∥2L2(μg).

Since attains the maximum, we have and have

 ∣∣^L(f∗ϕ,ψ)−^L(f∗ψ,ψ)∣∣≤L2∥ϕ−ψ∥2L2(μg). (10)

We next bound in (9). We remember that

 ^L(f∗ϕ,ψ)=Ex∼μD[f∗ϕ(x)]−Ex∼μg[f∗ϕ∘ψ(x)]−λRf∗ϕ. (11)

By -smoothness of , it follows that

 ∣∣f∗ϕ(ψ(x))−(f∗ϕ(ϕ(x))+⟨∇zf∗ϕ(z)∣∣z=ϕ(x),ψ(x)−ϕ(x)⟩2)∣∣≤L2∥ψ(x)−ϕ(x)∥22.

By taking the expectation with respect to , we get

 ∣∣−Ex∼μg[f∗ϕ∘ψ(x)]+Eμg[f∗ϕ(ϕ(x))]+⟨∇zf∗ϕ∘ϕ,ψ−ϕ⟩L2(μg)∣∣≤L2∥ψ−ϕ∥2L2(μg).

We substitute this inequality into (11), we have

 ^L(f∗ϕ,ψ) ≤Ex∼μD[f∗ϕ(x)]+L2∥ψ−ϕ∥2L2(μg)−(Eμg[f∗ϕ(ϕ(x))]+⟨∇zf∗ϕ∘ϕ,ψ−ϕ⟩L2(μg))−λRf∗ϕ =^L(f∗ϕ,ϕ)−⟨∇zf∗ϕ∘ϕ,ψ−ϕ⟩L2(μg)+L2∥ψ−ϕ∥2L2(μg) =L(ϕ)+⟨∇ϕL(ϕ),ψ−ϕ⟩L2(μg)+L2∥ψ−ϕ∥2L2(μg), (12)

and the opposite inequality

 ^L(f∗ϕ,ψ)≥L(ϕ)+⟨∇ϕL(ϕ),ψ−ϕ⟩L