1 Introduction
In recent years, Generative adversarial networks (GANs) Goodfellow et al. (2014) have been becoming the stateoftheart in several generative modeling tasks, ranging from image generation Karras et al. (2018)
Ho and Ermon (2016). They are based on an idea of a twoplayer game, in which a discriminator tries to distinguish between real and generated data samples, while a generator tries to fool the discriminator, learning to produce realistic samples on the long run. Wasserstein GAN (WGAN) was proposed as a solution to the issues present in the original GAN formulation. WGAN trains a critic to approximate the Wasserstein distance between the real and generated distributions. This introduced a new challenge, as WGAN requires the function space of the critic to only consist of 1Lipschitz functions.To enforce the Lipschitz constraint on the WGAN critic, Arjovsky et al. (2017) originally used weight clipping, which was soon replaced by the much more effective method of Gradient Penalty (GP) Gulrajani et al. (2017), which consists of penalizing the deviation of the critic’s gradient norm from 1 at certain input points. Since then, several variants of gradient norm penalization have been introduced Petzka et al. (2018); Wei et al. (2018); Adler and Lunz (2018); Zhou et al. (2019b). As an alternative, a weight normalization technique called Spectral Normalization (SN) Miyato et al. (2018)
is a very efficient and simple method for enforcing a Lipschitz constraint on a perlayer basis, which is applicable to neural networks consisting of affine layers and KLipschitz activation functions.
Virtual Adversarial Training (VAT) Miyato et al. (2017)
is a wellknown semisupervised learning method for regularizing neural networks. It is applied to improve the network’s robustness against local perturbations of the input. Using an iterative method based on power iteration, it approximates the adversarial direction corresponding to certain input points. Perturbing an input towards its adversarial direction changes the network’s output the most.
We propose a method called Virtual Adversarial Lipschitz Regularization (VALR) as a generalization of VAT, that enables the training of neural networks with regularization terms penalizing the violation of the Lipschitz constraint explicitly, instead of through the norm of the gradient. VALR can be used with all kinds of activation functions and neural network layers. It provides means to generate a pair for each input point, for which the Lipschitz constraint is likely to be violated with high probability. In general, enforcing Lipschitz continuity of complex models can be useful for a lot of applications. In this work, we focus on applying VALR to Wasserstein GANs, as regularizing or constraining Lipschitz continuity has proven to have a high impact on training stability and reducing mode collapse.
Our contributions are as follows:

We derive VALR as a generalization of VAT.

We apply VALR to penalize the violation of the Lipschitz constraint directly, resulting in Virtual Adversarial Lipschitz Penalty (VALP).

Applying VALP on the critic in WGAN (WGANVALP), we show stateoftheart performance in terms of Inception Score and Fréchet Inception Distance when trained on CIFAR10.
2 Background
2.1 Wasserstein Generative Adversarial Networks
Generative adversarial networks (GANs) provide generative modeling by a generator network that transforms samples of a lowdimensional latent space into samples from the data space , transporting mass from a fixed noise distribution to the generated distribution . The generator is trained simultaneously with another network called the discriminator, which is trained to distinguish between fake samples drawn from and real samples drawn from the real distribution , which is often represented by a fixed dataset. This network provides the learning signal to the generator, which is trained to generate samples that the discriminator considers real. This iterative process implements the minimax game
(1) 
played by the networks and . This training procedure minimizes the approximate JensenShannon divergence (JSD) between and Goodfellow et al. (2014). However, during training these two distributions might differ strongly or even have nonoverlapping supports, which might result in gradients received by the generator that are unstable or zero Arjovsky and Bottou (2017).
Wasserstein GAN (WGAN) Arjovsky et al. (2017) was proposed as a solution to this instability. Originating from Optimal Transport theory Villani (2008)
, the Wasserstein metric provides a distance between probability distributions with much better theoretical and practical properties than the JSD. It provides a smooth optimizable distance even if the two distributions have nonoverlapping supports, which is not the case for JSD. It raises a metric
from the space of the supports of the probability distributions and to the space of the probability distributions itself. For these purposes, the Wasserstein distance requires the probability distributions to be defined on a metric space and is defined as(2) 
where is the set of distributions on the product space whose marginals are and , respectively. The optimal achieving the infimum in (2) is called the optimal coupling of and , and is denoted by . The case of has an equivalent formulation
(3) 
called the KantorovichRubinstein formula Villani (2008), where is called the potential function, is the set of all functions that are 1Lipschitz with respect to the ground metric , and the Wasserstein1 distance corresponds to the supremum over all 1Lipschitz potential functions. The smallest Lipschitz constant for a realvalued function with the metric space as its domain is given by
(4) 
Based on (3), the critic in WGAN Arjovsky et al. (2017) implements an approximation of the Wasserstein1 distance between and . The minimax game played by the critic and the generator becomes
(5) 
a formulation that proved to be superior to the standard GAN in practice, with substantially more stable training behaviour and improved sample quality. The challenge became effectively restricting the smallest Lipschitz constant of the critic , sparking the birth of a plethora of Lipschitz regularization techniques for neural networks.
2.2 Lipschitz Function Approximation
A general definition of the smallest Lipschitz constant of a function is
(6) 
where the metric spaces and are the domain and codomain of the function , respectively. The function is called Lipschitz continuous if there exists a real constant for which for any . Then, the function is also called KLipschitz. Theoretical properties of KLipschitz neural networks with low values of were explored in Oberman and Calder (2018), showing that training neural networks with Lipschitz constraints is good for generalization and convergence.
Learning mappings with Lipschitz constraints became prevalent in the field of deep learning with the introduction of WGAN
Arjovsky et al. (2017). Enforcing the Lipschitz property on the critic was first done by clipping the weights of the network Arjovsky et al. (2017). This approach achieved superior results compared to the standard GAN formulation, but still sometimes yielded poor quality samples or even failed to converge. While clipping the weights enforces a global Lipschitz constant, it also reduces the function space, which might not include the optimal critic any more. Soon this method has been replaced by a softened one called Gradient Penalty (GP) Gulrajani et al. (2017). Motivated by the fact that the optimal critic should have unit gradient norm on lines connecting the coupled points according to (2), they proposed a regularizer that enforces unit gradient norm along these lines, which not only enforces the Lipschitz constraint, but other properties of the optimal solution as well. However, is not known in practice, which is why Gulrajani et al. (2017) proposed to apply GP on samples of the induced distribution, by interpolating samples from the marginals
and . The critic in the WGANGP formulation is regularized with the loss(7) 
where denotes the distribution of samples obtained by interpolating pairs of samples drawn from and , and
is a hyperparameter acting as a Lagrange multiplier.
Theoretical arguments against GP were pointed out in Petzka et al. (2018), arguing that unit gradient norm on samples of the distribution is not valid, as the pairs of samples being interpolated are generally not from the optimal coupling , and thus do not necessarily need to match gradient norm 1. Furthermore, they point out that differentiability assumptions of the optimal critic are not met. Therefore, the regularizing effect of GP might be too strong. As a solution, they suggested using a loss penalizing the violation of the Lipschitz constraint either explicitly with
(8) 
or implicitly with
(9) 
where in both cases denotes . The first method has only proved viable when used on toy datasets, and led to considerably worse results on relatively more complex datasets like CIFAR10, which is why Petzka et al. (2018) used the second one, which they termed Lipschitz Penalty (LP). Compared to GP, this term only penalizes the gradient norm when it exceeds . As they evaluated the interpolation method described above, and also sampling random local perturbations of real and generated samples, but found no significant improvement compared to . Wei et al. (2018) proposed dropout in the critic as a way for creating perturbed input pairs to evaluate the explicit Lipschitz penalty, which led to improvements, but still relied on using GP simultaneously. One of the strengths of the Wasserstein distance is that it can be defined with any metric , a fact that Adler and Lunz (2018) built on by proposing Banach WGAN (BWGAN), which generalizes WGAN to separable Banach spaces. They resort to these spaces because in order to use GP, they need a tractable dual metric on the topological dual of . This approach brought considerable improvements, and Adler and Lunz (2018) emphasized the fact that through explicit Lipschitz penalties one could extend WGANs to general metric spaces as well. We hypothesize that using the explicit Lipschitz penalty in itself is insufficient because if one takes pairs of samples randomly from , or
(or just one sample and generates a pair for it with random perturbation), the violation of the Lipschitz penalty evaluated at these sample pairs will be of high variance, hence a more sophisticated strategy for sampling pairs is required.
Recently Zhou et al. (2019b) argued that both GP and LP introduce superfluous constraints, altering the optimal critic and hence damaging the gradient that the generator receives. Their contribution is twofold. They introduced maximum gradient penalty, penalizing only the maximum of the Lipschitz constraint violations instead of their mean, not to overconstrain the critic, and employed the augmented Lagrangian method, widely used for strict constraint satisfaction in constrained optimization problems, to enforce the Lipschitz constraint more strictly. They formulated the regularization term as
(10) 
where is a hyperparameter, and is the Lagrange multiplier, which is updated during training iteratively by the rule
(11) 
They found that this formulation makes tuning the hyperparameter easier and restricts the Lipschitz constant of the network more strictly, but found no significant increase in sample quality compared to GP and LP.
A second family of Lipschitz regularization methods is based on weight normalization, restricting the Lipschitz constant of a network globally instead of only at points of the input space. One such technique is called spectral normalization (SN) proposed in Miyato et al. (2018), which is a very efficient and simple method for enforcing a Lipschitz constraint with respect to the norm on a perlayer basis, applicable to neural networks consisting of affine layers and KLipschitz activation functions. Gouk et al. (2018) proposed a similar approach, which can be used to enforce a Lipschitz constraint with respect to the norm and norm in addition to the
norm, while also being compatible with batch normalization and dropout.
Anil et al. (2018)argued that any Lipschitzconstrained neural network must preserve the norm of the gradient during backpropagation, and to this end proposed another weight normalization technique, showing that it compares favorably to SN, and an activation function based on sorting.
2.3 Virtual Adversarial Training
VAT Miyato et al. (2017) is a semisupervised learning method that is able to regularize networks to be robust to local adversarial perturbation. Virtual adversarial perturbation means perturbing input sample points in such a way that the change in the output of the network induced by the perturbation is maximal in terms of a distance between distributions. This defines a direction for each sample point called the virtual adversarial direction, in which the perturbation is performed. It is called virtual to make the distinction with the adversarial direction introduced in Goodfellow et al. (2015) clear, as VAT uses unlabeled data with virtual labels, assigned to the sample points by the network being trained. The regularization term of VAT is called Local Distributional Smoothness (LDS). It is defined as
(12) 
where is a conditional distribution implemented by a neural network, is a divergence between two distributions and (for which KL divergence is a natural choice), and
(13) 
is the virtual adversarial direction, where is a hyperparameter. It is approximated by
(14) 
where
(15) 
represents one iteration of a more general iterative approximation scheme,
is a randomly sampled unit vector and
is another hyperparameter. In this work, we generalize the formulation of VAT to show that it is actually a special case of a more general Lipschitz regularization scheme, and show that in practice this renders viable the use of the explicit Lipschitz penalty previously explored in Petzka et al. (2018).3 Virtual Adversarial Lipschitz Regularization
VALR is the method of adding a regularization term to the training objective that penalizes the violation of the Lipschitz constraint evaluated at sample pairs obtained by virtual adversarial perturbation. We call this term Virtual Adversarial Lipschitz Penalty (VALP) and define it as
(16) 
where
(17) 
is the virtual adversarial direction, is a neural network and is the Lipschitz constant that we’d like to enforce. and are metrics on the domain and codomain of , respectively, and is drawn uniformly from the interval , where and are hyperparameters.
VALP can be seen as a generalization of VAT if we disregard the fact that is a metric and its analogy in VAT is a divergence. To recover VAT from the above formula, let , , , and . Substituting these into (16) and (17) results in (12) and (13), respectively.
To put it in words, VALP measures the deviation of from being KLipschitz evaluated at pairs of sample points where one is the virtual adversarial perturbation of the other. If added to the training objective, it makes the learned mapping approximately KLipschitz in an radius around the sample points it is applied at.
3.1 Approximation of
Similarly to VAT, the virtual adversarial perturbation is approximated by
(18) 
where
(19) 
is the approximated virtual adversarial direction and is the norm corresponding to the metric . In the case where the chosen does not define a norm this way, one has to construct a different method for ensuring .
The derivation of this formula is essentially the same as the one described in Miyato et al. (2017), but is included here for completeness. We assume that and are both twice differentiable with respect to their arguments almost everywhere, the latter specifically at . Note that one can easily find a for which the last assumption does not hold, for example the distance. If is translation invariant, meaning that for each , then its subderivatives at will be independent of , hence the method described below will still work. Otherwise, one can resort to using a proxy metric in place of for the approximation, for example the distance.
We denote by for simplicity. Because and , it is easy to see that
(20) 
so that the secondorder Taylor approximation of is , where
is the Hessian matrix. The eigenvector
ofcorresponding to its eigenvalue with the greatest absolute value is the direction of greatest curvature, which is approximately the adversarial direction that we are looking for. The power iteration
Householder (1964) defined by(21) 
where is a randomly sampled unit vector, converges to if and are not perpendicular. We use the Euclidean norm , as this numerical algorithm is oblivious of the metric spaces and . Calculating is computationally heavy, which is why is approximated using the finite differences method as
(22) 
where the equality follows from (20). The hyperparameter is introduced here. In summary, the virtual adversarial direction is approximated by the iterative scheme
(23) 
of which one iteration is found to be sufficient and necessary in practice, which is why we presented the formulas corresponding to the one iteration case above in (19).
3.2 Comparison with other Lipschitz regularization techniques
In terms of applicability and usage of metrics, VALR is the most flexible Lipschitz regularization method to the best of our knowledge. Theoretically, it can be used with all kinds of metrics and , and any kind of model that is twice differentiable, but the approximation of described above imposes a practical restriction. It searches for the adversarial perturbation around in the ball of radius , which is why the topology induced by has to be similar enough to the one induced by the distance for VALR to be efficient. Additionally, the normalization in (18) might be difficult with certain metrics.
We experimented with WGAN variants where we regularize the critic to be 1Lipschitz with respect to different metrics, like the Sobolev norms introduced in (Adler and Lunz, 2018), and also learned metrics such as the distance on the activation values of the upper hidden layers of a pretrained Inception network Szegedy et al. (2015). The trainings did converge and the results were competitive, but still underperformed slightly compared to the setting. Also one can easily think of metrics for which the approximation scheme would most probably fail, such as certain string metrics like the Levenshtein distance Deza and Deza (2009). More work is needed to explore the family of metrics for which VALR is efficient in its current form, and also to improve the approximation of the virtual adversarial direction described above to be able to handle a wider range of metrics, or replace it entirely with a more flexible one.
In terms of efficiency, VALR compares favorably to the implicit methods penalizing the gradient norms, and to weight normalization techniques as well, as demonstrated in the experiments section. Adler and Lunz (2018) argued that penalizing the norm of the gradient as in (9) is more effective than penalizing the Lipschitz quotient directly as in (8), as the former regularizes in all spatial directions around , unlike the latter which does so only in the direction . We argue that this is the exact reason why the explicit method works better when the samples to evaluate (8) are chosen appropriately, as the regularization effect in all spatial directions can result in being overregularized. Regarding weight normalization techniques such as SN (Miyato et al., 2018), they can be prone to overregularize as well. Compared to the actual spectral norm of a weight matrix, its approximation can be either lower or higher, and dividing the weights with the latter results in overregularization, which can even exclude the optimal critic from the hypothesis space during training. The former case could theoretically result in underregularization, but there’s no evidence that this is actually the case in practice. We argue that VALR with the explicit method (8) outperforms the implicit methods GP and LP, as well as weight normalization methods like SN, because it results in the softest form of regularization, only resulting in a nonzero penalty when, where and in which direction the Lipschitz constraint is actually violated.
Regarding performance, the approximation of is as computationally demanding as the evaluation of the gradient norms, resulting in a similar running speed for VALR as for GP and LP, which also means that VALR cannot compete with SN on these grounds. However, one can sacrifice some of the efficiency by only applying VALP to a fraction of the minibatch the network is trained with, resulting in a slight decrease in Inception Score and FID in the case of WGAN with half batch regularization.
See Table 1 for a summary of the comparison detailed above.
Method 

Usable metrics 

Performance  
GP  2x differentiable  Sobolev norms  No  Slower  
LP  2x differentiable  Sobolev norms  Yes  Slower  
SN 

distance  No  Fast  
VALR  2x differentiable 

Yes 

3.3 WganValp
We specialize the VALP formula (16) with being the critic, , and , to arrive at a version of the explicit penalty described in Petzka et al. (2018), which uses virtual adversarial perturbations as a sampling strategy. It is formulated as
(24) 
where we denote the virtual adversarial perturbation as to emphasize that it’s a function of . We found it beneficial to use the augmented Lagrangian method described in Zhou et al. (2019b), albeit with lower values of , and penalizing the mean of the Lipschitz constraint violations instead of their maximum. To sum up, we define the training objective of the critic in WGANVALP as
(25) 
where is a combination of the real and generated distributions, meaning that a sample can come from both, and and , respectively, are the hyperparameter and the Lagrange multiplier of the augmented Lagrangian method. is updated iteratively according to the rule
(26) 
This formulation of WGAN results in a stable explicit Lipschitz penalty, overcoming the difficulties experienced when one tries to apply it to random sample pairs, resulting in stateoftheart sample quality when trained on CIFAR10, as demonstrated below.
4 Computational Results
To evaluate the performance of WGANVALP, we used the residual architecture from Gulrajani et al. (2017), with the number of channels in the generator layers doubled from to
for a slight increase in performance. The implementation was done in TensorFlow, and the trainings were run on a single NVIDIA GTX 1080Ti GPU. Following
Gulrajani et al. (2017), we used the Adam optimizer Kingma and Ba (2015) with parameters , and an initial learning rate of decaying linearly to 0 over iterations, training the critic for steps and the generator for per iteration with minibatches of size (doubled for the generator). We used an exponential moving average (EMA) Yazıcı et al. (2019) of the weights to evaluate performance. We used (25) as a loss function to optimize the critic. The hyperparameters of the approximation of
were set to , , with power iteration. The only difference compared to the values used in Miyato et al. (2017) is that they used a fixed , but for our purposes it’s important to apply the penalty at different scales randomly. was an obvious choice, and we found to be optimal. Both batches from and were used for regularization, but using half of them selected randomly decreased performance only marginally while considerably reducing the run time.We monitored the Inception Score and FID during training using samples every iteration, and evaluated them at the end of training using
samples. To measure the performance of WGANVALP, we ran the training setting described above 10 times, and calculated the mean, standard deviation and maximum of the final Inception Scores and FIDs, which we report for WGANVALP and other relevant GANs
Gulrajani et al. (2017); Petzka et al. (2018); Zhou et al. (2019a); Wei et al. (2018); Miyato et al. (2018); Adler and Lunz (2018); Karras et al. (2018); Yazıcı et al. (2019) in Table 2. Competing variants reported either or both of the average and the best Inception Scores in the corresponding papers, which is why we chose to report both. When trained on CIFAR10, our model is state of the art both in terms of Inception Score and FID. We note that this is achieved without using any progressive growing techniques, which could possibly be combined with VALP to reach an even higher performance. We show some generated samples in Figure 1.Inception Score  
Method  Average  Best  FID 
WGANGP  
WGANLP  
LGAN  
CTGAN  
SNGAN  
BWGAN  
Progressive GAN  
EMA WGANGP  
EMA Progressive GAN  
WGANVALP (ours) 
5 Conclusions
Derived as a generalization of VAT, we have shown that VALR is an efficient and powerful method for learning Lipschitz constrained mappings implemented by neural networks. Already resulting in stateoftheart performance when applied to the training of WGANs, VALR is a generally applicable regularization method for a potentially wide range of applications, providing more flexibility than other Lipschitz regularization methods. The growing interest in Lipschitz constrained deep learning suggests an increasing demand for such methods in the future.
References
 Adler and Lunz (2018) J. Adler and S. Lunz. Banach wasserstein gan. In S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. CesaBianchi, and R. Garnett, editors, Advances in Neural Information Processing Systems 31, pages 6754–6763. Curran Associates, Inc., 2018. URL http://papers.nips.cc/paper/7909banachwassersteingan.pdf.
 Anil et al. (2018) C. Anil, J. Lucas, and R. Grosse. Sorting out lipschitz function approximation. CoRR, abs/1811.05381, 2018. URL http://arxiv.org/abs/1811.05381.
 Arjovsky and Bottou (2017) M. Arjovsky and L. Bottou. Towards principled methods for training generative adversarial networks. In 5th International Conference on Learning Representations, ICLR 2017, Toulon, France, April 2426, 2017, Conference Track Proceedings, 2017. URL https://openreview.net/forum?id=Hk4_qw5xe.

Arjovsky et al. (2017)
M. Arjovsky, S. Chintala, and L. Bottou.
Wasserstein generative adversarial networks.
In D. Precup and Y. W. Teh, editors,
Proceedings of the 34th International Conference on Machine Learning
, volume 70 of Proceedings of Machine Learning Research, pages 214–223, International Convention Centre, Sydney, Australia, 06–11 Aug 2017. PMLR. URL http://proceedings.mlr.press/v70/arjovsky17a.html.  Deza and Deza (2009) M. Deza and E. Deza. Encyclopedia of Distances. Encyclopedia of Distances. Springer Berlin Heidelberg, 2009. ISBN 9783642002342.
 Goodfellow et al. (2014) I. Goodfellow, J. PougetAbadie, M. Mirza, B. Xu, D. WardeFarley, S. Ozair, A. Courville, and Y. Bengio. Generative adversarial nets. In Z. Ghahramani, M. Welling, C. Cortes, N. D. Lawrence, and K. Q. Weinberger, editors, Advances in Neural Information Processing Systems 27, pages 2672–2680. Curran Associates, Inc., 2014. URL http://papers.nips.cc/paper/5423generativeadversarialnets.pdf.
 Goodfellow et al. (2015) I. Goodfellow, J. Shlens, and C. Szegedy. Explaining and harnessing adversarial examples. In International Conference on Learning Representations, 2015. URL http://arxiv.org/abs/1412.6572.
 Gouk et al. (2018) H. Gouk, E. Frank, B. Pfahringer, and M. J. Cree. Regularisation of neural networks by enforcing lipschitz continuity. CoRR, abs/1804.04368, 2018. URL http://arxiv.org/abs/1804.04368.
 Gulrajani et al. (2017) I. Gulrajani, F. Ahmed, M. Arjovsky, V. Dumoulin, and A. Courville. Improved training of wasserstein gans. In Proceedings of the 31st International Conference on Neural Information Processing Systems, NIPS’17, pages 5769–5779, USA, 2017. Curran Associates Inc. ISBN 9781510860964. URL http://dl.acm.org/citation.cfm?id=3295222.3295327.
 Ho and Ermon (2016) J. Ho and S. Ermon. Generative adversarial imitation learning. In NIPS, pages 4565–4573, 2016.
 Householder (1964) A. Householder. The Theory of Matrices in Numerical Analysis. A Blaisdell book in pure and applied sciences : introduction to higher mathematics. Blaisdell Publishing Company, 1964.
 Karras et al. (2018) T. Karras, T. Aila, S. Laine, and J. Lehtinen. Progressive growing of gans for improved quality, stability, and variation. In 6th International Conference on Learning Representations, ICLR 2018, Vancouver, BC, Canada, April 30  May 3, 2018, Conference Track Proceedings, 2018. URL https://openreview.net/forum?id=Hk99zCeAb.
 Kingma and Ba (2015) D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. In 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 79, 2015, Conference Track Proceedings, 2015. URL http://arxiv.org/abs/1412.6980.
 Miyato et al. (2017) T. Miyato, S. Maeda, M. Koyama, and S. Ishii. Virtual adversarial training: a regularization method for supervised and semisupervised learning. CoRR, abs/1704.03976, 2017. URL http://arxiv.org/abs/1704.03976.
 Miyato et al. (2018) T. Miyato, T. Kataoka, M. Koyama, and Y. Yoshida. Spectral normalization for generative adversarial networks. In 6th International Conference on Learning Representations, ICLR 2018, Vancouver, BC, Canada, April 30  May 3, 2018, Conference Track Proceedings, 2018. URL https://openreview.net/forum?id=B1QRgziT.
 Oberman and Calder (2018) A. M. Oberman and J. Calder. Lipschitz regularized deep neural networks converge and generalize. CoRR, abs/1808.09540, 2018. URL http://arxiv.org/abs/1808.09540.
 Petzka et al. (2018) H. Petzka, A. Fischer, and D. Lukovnikov. On the regularization of wasserstein gans. In 6th International Conference on Learning Representations, ICLR 2018, Vancouver, BC, Canada, April 30  May 3, 2018, Conference Track Proceedings, 2018. URL https://openreview.net/forum?id=B1hYRMbCW.
 Szegedy et al. (2015) C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. E. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich. Going deeper with convolutions. In CVPR, pages 1–9. IEEE Computer Society, 2015.
 Villani (2008) C. Villani. Optimal Transport: Old and New. Grundlehren der mathematischen Wissenschaften. Springer Berlin Heidelberg, 2008. ISBN 9783540710509.
 Wei et al. (2018) X. Wei, B. Gong, Z. Liu, W. Lu, and L. Wang. Improving the improved training of wasserstein gans: A consistency term and its dual effect. In 6th International Conference on Learning Representations, ICLR 2018, Vancouver, BC, Canada, April 30  May 3, 2018, Conference Track Proceedings, 2018. URL https://openreview.net/forum?id=SJx9GQb0.
 Yazıcı et al. (2019) Y. Yazıcı, C.S. Foo, S. Winkler, K.H. Yap, G. Piliouras, and V. Chandrasekhar. The unusual effectiveness of averaging in GAN training. In International Conference on Learning Representations, 2019. URL https://openreview.net/forum?id=SJgw_sRqFQ.
 Zhou et al. (2019a) Z. Zhou, J. Liang, Y. Song, L. Yu, H. Wang, W. Zhang, Y. Yu, and Z. Zhang. Lipschitz generative adversarial nets. CoRR, abs/1902.05687, 2019a. URL http://arxiv.org/abs/1902.05687.
 Zhou et al. (2019b) Z. Zhou, J. Shen, Y. Song, W. Zhang, and Y. Yu. Towards efficient and unbiased implementation of lipschitz continuity in gans. CoRR, abs/1904.01184, 2019b. URL http://arxiv.org/abs/1904.01184.