1 Introduction
Generative adversarial networks (GANs) [Goodfellow et al.2014]
is one of the most promising generative models, which has achieved great success in various challenging tasks. The basic framework of GANs consists of a generator and a discriminator, which are both parameterized by neural networks. The generator learns to generate samples to fit the target distribution, while the discriminator measures the distance between the generated distribution and the target distribution. Through a minimax game, the adversarially trained discriminator guides the optimization of the generator.
In the vanilla GAN, the discriminator is formulated to estimate the JensenShannon (JS) divergence between the two distributions. However, the resulting model suffers from numerous training problems,
e.g., gradient vanishing and mode collapse. The common understanding [Arjovsky and Bottou2017] is that these problems stem from the undesirable property of JS divergence, i.e., when two distributions are disjoint, the JS divergence remains constant and thus cannot provide meaningful guidance to the optimization of the generator. According to [Arjovsky and Bottou2017], such a case is very common in practice. Wasserstein distance was thus proposed [Arjovsky et al.2017] as a new objective of GANs, which can provide continuous measure between two distributions. Empirical experiments demonstrated that Wasserstein distance can significantly improve the training stability.Wasserstein distance in its primal form is hard to deal with and is thus usually solved in its dual form [Seguy et al.2017, Arjovsky et al.2017]. Wasserstein distance in its dual form requires the discriminative function to be Lipschitz. This arises the problem of how to effectively impose the Lipschitz continuity in the discriminative function.
Initially, Wasserstein GAN enforces Lipschitz continuity via weight clipping, which was later shown to lead to suboptimal solutions [Gulrajani et al.2017, Petzka et al.2017]. The most common practice of imposing Lipschitz continuity would be gradient penalty as introduced in [Gulrajani et al.2017], which is based on the fact that the Lipschitz constant of a function is equivalent to its maximum gradient scale [Adler and Lunz2018] and imposes Lipschitz continuity via penalizing the gradients at sampled points. As an alternative method, [Miyato et al.2018]
introduced spectral normalization which restricts the maximum singular value of each layer of the neural network and thereby achieves a global restriction on the Lipschitz constant of the neural network. In this paper, we provide further investigations on the implementation of Lipschitz continuity in GANs.
First, we will theoretically show that restricting the Lipschitz constant in the blending region of real and fake distributions is sufficient to leverage the theoretical benefit of Wasserstein distance^{1}^{1}1The gradient from the optimal discriminative function corresponds to the optimal transport. See Proposition 1., which indicates that spectral normalization that restricts the global Lipschitz constant of the discriminative function might be over restrictive. We provide empirical evidences that spectral normalization leads to difficulty in solving of optimal discriminative function.
On the other hand, we demonstrate that the current implementation of gradient penalty actually introduces extra constraints into the optimization problem, which bias the optimal discriminative function such that the theoretical benefit of Wasserstein distance is impaired. Gradient penalty imposes Lipschitz continuity via penalty method. It is worth noticing that penalty method is soft restriction and the resulting Lipschitz constant is usually larger than . Given that the Lipschitz constant is larger than , many sample points which do not hold the maximum gradient might also have gradients that are larger than . It gives rise to the superfluous constraints imposed by gradient penalty. To regularize the Lipschitz constant, it should only penalize the maximum gradient. The penalties introduced on other sample points are superfluous. We study the impact of these superfluous constraints with synthetic experiments and notice that these superfluous constraints indeed harm the optimization and alter the property of the optimal discriminative function in a bad way.
Based on the analysis, we propose to impose the Lipschitz continuity in the blending region of real and fake samples via regularizing the maximum gradient. Unlike gradient penalty that casually penalizes all sample gradients, we estimate the maximum gradient and only penalize the maximum gradient, which avoids introducing superfluous constraints. In addition, to provide a method for strict implementation of Lipschitz, we also present an augmented Lagrangian [Nocedal and Wright2006] based method. The augmented Lagrangian is classic replacement of penalty method where an additional Lagrange multiplier term is introduced. Due to the presence of the Lagrange multiplier, it is able to strictly impose the constraint and thus benefits the situations where strict Lipschitz is required.
The remainder of this paper is organized as follows. In Section 2, we review the background and current implementations of Lipschitz continuity in GANs. In Section 3, we analyze the properties of Lipschitz continuity in GANs and the behaviors of existing implementations. In Section 4, we accordingly propose our methods that aim at eliminating the potential issues in the existing methods. We empirically study these methods in Section 5 and finally conclude this paper in Section 6.
2 Preliminaries
2.1 Wasserstein Distance and Lipschitz Continuity
Given two metric spaces and , a function is said to be Lipschitz continuous if there exists some constant such that
(1) 
In this paper and in most existing GANs, the metrics and are by default Euclidean distance which we denote by . The smallest constant is called the (best) Lipschitz constant of which we denote by .
The firstorder Wasserstein distance
between two probability distributions is defined as
(2) 
where denotes the set of all probability measures with marginals and . It can be interpreted as the minimum cost of transporting the distribution to the distribution . The KantorovichRubinstein (KR) duality [Villani2008] provides an more efficient way to compute the Wasserstein distance. The duality states that
(3)  
The constraint in Eq. (3) requires to be Lipschitz continuous with .
Interestingly, we have the following connection between the optimal solutions in the primal form and dual form [Gulrajani et al.2017].
Proposition 1.
The Proposition indicates that the gradient from the optimal , which we will later refer as discriminative function in context of GANs, follows the optimal transport plan .
2.2 Generative Adversarial Networks
Generative adversarial networks (GANs) performs generative modeling via a game between two competing neural networks. The generative network learns to map the samples in a prior distribution to a target distribution, while the discriminative network is trained to measure the distance between the target distribution and the distribution of generated samples. The generator and discriminator are connected via a minimax game, so as the generator is able to minimize the distance metric between the two distributions estimated by the adversarially trained discriminator [Milgrom and Segal2002].
In the vanilla GAN [Goodfellow et al.2014], the discriminator is formulated to estimate the JensenShannon (JS) divergence between the two distributions. However, the resulting model suffers from numerous training problems, e.g., gradient vanishing and mode collapse. According to [Arjovsky and Bottou2017], with high dimensional real data, the supports of the target distribution and the generated distribution are very likely to have an intersection of zero measure. However, in such cases, the JS divergence remains constant and is not able to provide meaningful guidance to the optimization of the generator.
Wasserstein distance was thus proposed [Arjovsky et al.2017] as an new objective of GANs, which can provide continuous measure between two distributions. Given that the discriminator is welltrained, the generator will receive sustained supervision from the discriminator towards minimizing the Wasserstein distance. Formally, their proposed Wasserstein GAN is defined as follows:
(5) 
where is the target data distribution and is the prior distribution (of noise), and represent the generative and discriminative function respectively, and represent the function class of .
2.3 Implementations of Lipschitz Continuity
Wasserstein GAN requires the discriminative function to be Lipschitz. After that, researchers [Kodali et al.2017, Fedus et al.2017, Miyato et al.2018] also empirically found that Lipschitz continuity is also useful when combined with other GAN objectives, e.g., the vanilla GAN objective. Recently, such phenomenon is also theoretically explained [Farnia and Tse2018, Zhou et al.2019], i.e., combining Lipschitz continuity with common GAN objective yields an variant distance metric that is also able to provide continuous measure between the real and fake distributions as Wasserstein distance. As it stands, Lipschitz continuity is a promising technique for improving the training of GANs with theoretical guarantee. However, the implementation of Lipschitz continuity remains challenging.
Quite a few recent works are devoted to investigate the implementation of Lipschitz continuity. The initial attempt in [Arjovsky et al.2017] regularizes the Lipschitz continuity via weight clipping, i.e., restricting the maximum value of each weight. It was later shown to lead to suboptimal solution [Gulrajani et al.2017, Petzka et al.2017].
And the corresponding alternative methods were thus proposed for imposing the Lipschitz continuity, named gradient penalty () and Lipschitz penalty () respectively. The two methods share the same spirit and achieve Lipschitz continuity via penalizing the gradient at sampled points towards a given target value (which is usually , however not necessary [Karras et al.2017, Adler and Lunz2018]). They are based on the fact that the Lipschitz constant of a function is equivalent to its max gradient scale [Adler and Lunz2018].
Formally, the two methods introduce the following regularization terms, respectively:
(6)  
(7) 
where
denotes the sampling distribution defined by the sample strategy which is typically random linear interpolation between real and fake samples.
[Petzka et al.2017] argued that gradient penalty is less reasonable because Lipschitz does not necessarily implies that the gradient scale at every sample point is . It is also the main reason why they proposed to only penalize gradients whose scale is larger than .
Apart from those already mentioned, [Miyato et al.2018] provide a new direction for enforcing the Lipschitz continuity, named spectral normalization [Yoshida and Miyato2017], which is based on another fact that the Lipschitz constant of a linear function, , is equivalent to the weight matrix’s maximum singular value. Given the singular value of a weight matrix is easily attainable, they proposed to divide the weight of each linear layer of a neural network by its maximum singular value, i.e.,
(8) 
where denotes the maximum singular value of . As the result, the Lipschitz constant of every linear layer is fixed as . Then if the nonlinearity parts (i.e.
, activation functions) are also Lipschitz continuous (which is true for common activation functions), the resulting model will have an upper bound on the Lipschitz constant.
It is worth noting that spectral normalization results in a hard global restriction on the Lipschitz constant, while gradient penalty and Lipschitz penalty are soft local regularizations.
3 Analysis and Motivations
3.1 The Local Lipschitz Continuity
The most common choice of in gradient penalty and Lipschitz penalty is the distribution formed by random linear interpolations between the real and fake samples. Currently, why such as choice is valid is still not clear and people tends to believe that it is only a deleterious practical trick [Miyato et al.2018].
Here, we provide a theoretical justification as follows. Let denote the support of the linear interpolations between real and fake distributions. We have
Lemma 1.
Imposing the Lipschitz continuity over is sufficient to maintain the property of Proposition 1.
To get such conclusion, we need to delve more deep into the wellknown KR duality (Eq. (3)). The fact is that the constraint in the dual form of Wasserstein distance can be looser than the one in KR duality. Specifically, one sufficient constraint for Wasserstein distance in the dual form is as follows:
(9)  
where and denotes the supports of and . Note the difference with KR duality is that: and are now from and , instead of being arbitrary, which means the constraints in Eq. (9) is a subset of the one in Eq. (3).
It is worthy noticing that given the constraints in Eq. (9), any others constraints in Eq. (3) does not affect the final result of , and more importantly, any in Eq. (9) corresponds to one in Eq. (3) with the value of on and unchanged. Thus, any in Eq. (9) also holds the following key property of Wasserstein distance [Villani2008, Gulrajani et al.2017]:
Lemma 2.
Note that the Proposition 1 is based on Eq. (10) and the Lipschitz continuity of . And we can further notice that being local Lipschitz continuity over is sufficient for the proof of Proposition 1. The last thing is that being local Lipschitz continuity over is also a sufficient condition for the constraint in Eq. 9. Thus, as long as is local Lipschitz continuity over , the Proposition 1 holds.
Note that Lemma 1 also indicates that for training GANs, restricting the global Lipschitz constant might be unnecessary. We next show that although imposing local Lipschitz continuity is sufficient, the current implementations of local Lipschitz continuity is biased.
3.2 The Superfluous Constraints
Gradient penalty and Lipschitz penalty impose the Lipschitz continuity via penalty method. Penalty method is a soft regularization, where the constraint is usually slightly drifted.
To be more concrete, we consider the following objective, assuming we can directly optimize the Lipschitz constant :
(11) 
where denotes the supremum of Wasserstein distance objective, i.e., , under the restriction that .
It is clear that . Given that and is fixed, is a constant. Therefore, is quadratic function of and we have that the optimal is . Note that replacing with will result in the same optimal .
From the above, we can see that when is small or the distance between and is large, the resulting Lipschitz constant can be much larger than . Under these circumstances, both gradient penalty and Lipschitz penalty introduce superfluous constraints. Saying the Lipschitz constant is , sampled points with gradient larger than but smaller than are penalized, inadvertently.
We will see in the experiments that these superfluous constraints alter the optimal discriminative function and damage the property of the gradient received by the generator. [Petzka et al.2017] noted that Lipschitz penalty has a connection to regularized Wasserstein distance. Unfortunately, regularized Wasserstein distance usually also alters the property of the optimal discriminative function and blurs the [Seguy et al.2017], which is consistent with our analysis here.
4 The Proposed Methods
Now we present our investigation towards more efficient and unbiased implementation of Lipschitz continuity. Given that the local Lipschitz continuity over the support of the linear interpolations between real and fake distributions () is sufficient, we would consider only restricting the Lipschitz constant in such region.
4.1 Max Gradient Penalty
Similar as gradient penalty, we can regularize the Lipschitz constant via penalty method. But, to avoid the superfluous constraints, we need to only penalize the maximum gradient in . The resulting regularization is as follows:
(12) 
Analogy to Lipschitz penalty, we can also extend the penalty term with . However, when only regularizing the maximum gradient, it is less necessary. Because it will only take effect when the discriminator is underfitting.
Practically, we follow [Gulrajani et al.2017] and sample as random linear interpolations of real and fake samples in the parallel minibatch. We can either directly use the maximum of the gradient sampled in a minibatch, or keep a historical buffer of the points with maximum gradients which is update in every iteration and then take the maximum gradient over the buffer. The latter is trying to avoid inaccurate estimation of maximum gradient. We have studied this two in experiments. According to our experiments, the buffer is usually unnecessary. Using the maximum gradient in a minibatch would be good enough.
4.2 Augmented Lagrangian
With the penalty method, the constraint is usually not strictly satisfied. The resulting Lipschitz constant, as discussed around Eq. (11), is floating. In the situation where people would like the constraint to be strictly imposed, the augmented Lagrangian is a classic alternative to penalty method, where the constraint would be strictly imposed. In the circumstances of GANs, strictly imposed the Lipschitz might benefit to control the variable in the contrast experiments, e.g., when comparing different networks and objectives. Also, if one would like to strictly evaluate the Wasserstein distance, a strict restrict of the Lipschitz constant would be favorable.
The augmented Lagrangian method is a classic method for strict constraint satisfaction. It extends the penalty method by including an extra Lagrange multiplier term. Given that the augmented Lagrangian is a simple extension and there exists potential benefits. We also investigated the practical performance of augmented Lagrangian in imposing of Lipschitz continuity. The regularization term derived from the augmented Lagrangian can be written as follows:
(13) 
where is the Lagrange multiplier.
The socalled augmented Lagrangian method can also be viewed as an extension of Lagrange multiplier method where only the first term is introduced, and the quadratic penalty term is regarded as the augmentation. Based on the first order optimality of the Lagrange multiplier method and the augmented Lagrangian method, it holds that the optimal in the Lagrange multiplier method equals to the optimal in the augmented Lagrangian minus . Thus, there is an common used intuitive update rule for in augmented Lagrangian, i.e.,
(14) 
Thus, to optimize the augmented Lagrangian, one need only introduce the augmented Lagrangian regularization and add an extra update step for according to Eq. (14) after each iteration.
5 Experiments
In this section, we study the practical behaviors of various implementations of Lipschitz continuity, including spectral normalization (SN), gradient penalty (GP), maximum gradient penalty (MAXGP), and the augmented Lagrangian with maximum gradient constraint (MAXAL). In our experiments, Lipschitz penalty shares a very similar performance as gradient penalty.
We use multilayer perceptron for all toy experiments and use a Resnet architecture
[He et al.2016] that is similar to the one used in [Gulrajani et al.2017] for all other real data experiments. We use Adam optimizer [Kingma and Ba2014] with , . Frechet Inception Distance (FID) [Heusel et al.2017] was used to quantitatively evaluate the resulting models. The anonymous code is provided at https://bit.ly/2H4i3Cy.5.1 Two Dimensional Toy Data
To intuitively study the property of different methods, we first test their performances with simple twodimensional data. In this experiment, we randomly sample two data points in twodimensional space as and another two points as . We fix this two distributions and train a discriminator with different implementations of Lipschitz continuity.
We want to check whether these methods are able to achieve the optimal discriminative function, by verifying the gradients of generated samples, which should follow the Proposition 1 and point towards their target real samples that minimize the transport cost.
Our first interesting observation is that SN in some cases failed to achieve the optimal discriminator. As shown in Figure 1, SN quickly converged to a suboptimal solution and stuck there. We currently do not fully understood how such phenomenon appears. We consider that it might because the global Lipschitz constraint makes the capacity of the discriminator extremely underused such that the optimal discriminative function is not attainable. We have tried fairly large network, but it does not help eliminating this problem. It might also stems from the imperfect singular value estimation of power iteration. We have tried increasing the number of the power iteration that used to acquire the singular value, it does not solve this problem. We have also tried both inplace update of and update with collection, the problem consistently exists. Training the discriminator for a very long time with decreasing learning rate also cannot solve this problem and the final result keeps unchanged. We would leave further investigation as future work.
In Figure 1, we also noticed that GP leads to an oscillatory discriminator, which evidences that the superfluous constraints affect the optimal discriminator. By contrast, we see that MAXGP stably converged to the optimal discriminator where the gradients of the fake samples point towards the real samples in an optimal transport way.
5.2 Toy Real World Data
We further compare these methods with real world data. We still want to check whether these method converges to the optimal discriminative function. However, real world dataset is too large, and we found practically, the optimal discriminator is almost nonachievable. Hence, in this experiment, we use a small subset of the real world dataset instead. Specifically, we select ten representative CIFAR10 images as and use ten random noise as . Then, same as above, we train the discriminator till optimal and check the gradient of the resulting discriminative function of different methods.
For the high dimensional case, visualizing the gradient direction is nontrivial. Hence, we plot the gradient and corresponding increments. In Figure 2, the leftmost in each row is a sample from and the second is its gradient . The interiors are with increasing , and the rightmost is the nearest (being closest to any point in the incremental path) real sample from .
From the results, MAXGP is also able to achieve the optimal discriminative function in the high dimensional case. We see that the gradient of ten noises in is pointing towards the ten real images in , respectively. However, the resulting gradients of GP do not clearly point towards real samples. The gradient tends to be a blending of several images in the target domain, and it also appears a sort of mode collapse (multiple cats and birds). This experiment once again verifies that these superfluous constraints introduced by GP are harmful.
5.3 Sample Quality on CIFAR10
We now test the practical difference when train a complete GAN model using these methods to impose Lipschitz constraint. In this experiments, we not only train the model with WGAN objective but also with the hinge loss [Miyato et al.2018] and vanilla GAN objective [Goodfellow et al.2014, Fedus et al.2017], which has also found work well under Lipschitz continuity constraint. The results in terms of training curve of FID are plotted in Figure 3.
In Figure 2(a), we compare GP, MAXGP and MAXAL with different regularization weights under the objective of WGAN. We see that the training progresses and final results are quite similar to each other. As we found in the experiments of toy real world data, given and both consist of ten images, the optimal discriminator is already very hard to achieve. We train the discriminator for iterations with decreasing learning rate to achieve the result in Figure 2. We believe that the reason why these methods do not show obvious difference in these real world applications lies in the optimization level. That is, in the current hyperparameter settings, e.g., DCGAN or shallow Resnet, the optimal discriminative function of WGAN is almost impossible to achieve. It might also related to the issues of the optimizer. Amam [Kingma and Ba2014], the commonused and somewhat powerful optimizer for GANs, is recently shown to do not guarantee the convergence [Reddi et al.2018, Zhou et al.2018, Zou et al.2018].
In this experiments, we initially use the WGAN objective for all methods. However, we found that with the Resnet architecture [Gulrajani et al.2017], SN failed to converge. The same holds with various small modifications of hyperparameters. We notice that in [Miyato et al.2018], when using Resnet architecture, the model with SN is trained using a hinge loss. We therefore also tested SN with the hinge loss, and in additional, the vanilla GAN objective which was also found to also work well given the Lipschitz constraint. The results are plotted in Figure 2(b). We also included the results of MAXGP with these objectives for comparison. As we can see, the result of MAXGP is generally better than SN.
Lastly, we inspect the properties of MAXAL. As shown in Figure 3(a), MAXAL is able to quickly restrict the Lipschitz constant to the given target and keep the Lipschitz constant fairly stable during the training. By contrast, the Lipschitz constants under GP and MAXGP keep changing. Another interesting fact about MAXAL is that when trained with the WGAN objective, the optimal is equivalent to . We verify this fact by plotting this two terms during training together. As showed in Figure 3(b), the two lines are basically overlapped.
6 Conclusion
In this paper, we demonstrated that restricting the Lipschitz constant over the support of the interpolations of real and fake samples is sufficient to gain the advantageous gradient properties induced by Lipschitz continuity. It provides theoretical guarantee on the validity of these empirical gradientpenalty based methods. In the mean time, it suggests that global restriction on the Lipschitz constant is unnecessary. Combined with the fact that we found the spectral normalization, the method that provides global restriction on the Lipschitz constant somehow fail in many practical scenarios, we suggest to use these methods that regularize local Lipschitz constant. On the other hand, we also observed that the current implementations of local Lipschitz continuity, i.e., gradient penalty and Lipschitz penalty, introduce superfluous constraints to the optimization problem, which evidently alter the optimal discriminative function and impair the favorable gradient properties.
We have accordingly proposed revision to gradient penalty. Our experiments demonstrated that the proposed method is able to achieve the optimal discriminative function in an unbiased manner. In addition, we suggested augmented Lagrangian as a simple yet good alternative of penalty method which is able to strictly restrict the Lipschitz constant to a given target.
References
 [Adler and Lunz2018] Jonas Adler and Sebastian Lunz. Banach wasserstein gan. arXiv preprint arXiv:1806.06621, 2018.
 [Arjovsky and Bottou2017] Martin Arjovsky and Léon Bottou. Towards principled methods for training generative adversarial networks. In ICLR, 2017.
 [Arjovsky et al.2017] Martin Arjovsky, Soumith Chintala, and Léon Bottou. Wasserstein gan. arXiv preprint arXiv:1701.07875, 2017.
 [Farnia and Tse2018] Farzan Farnia and David Tse. A convex duality framework for gans. In Advances in Neural Information Processing Systems 31, 2018.
 [Fedus et al.2017] William Fedus, Mihaela Rosca, Balaji Lakshminarayanan, Andrew M Dai, Shakir Mohamed, and Ian Goodfellow. Many paths to equilibrium: Gans do not need to decrease adivergence at every step. arXiv preprint arXiv:1710.08446, 2017.
 [Goodfellow et al.2014] Ian Goodfellow, Jean PougetAbadie, Mehdi Mirza, Bing Xu, David WardeFarley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In Advances in neural information processing systems, pages 2672–2680, 2014.
 [Gulrajani et al.2017] Ishaan Gulrajani, Faruk Ahmed, Martin Arjovsky, Vincent Dumoulin, and Aaron Courville. Improved training of wasserstein gans. arXiv preprint arXiv:1704.00028, 2017.

[He et al.2016]
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun.
Deep residual learning for image recognition.
In
Proceedings of the IEEE conference on computer vision and pattern recognition
, pages 770–778, 2016.  [Heusel et al.2017] Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, Günter Klambauer, and Sepp Hochreiter. Gans trained by a two timescale update rule converge to a nash equilibrium. arXiv preprint arXiv:1706.08500, 2017.
 [Karras et al.2017] Tero Karras, Timo Aila, Samuli Laine, and Jaakko Lehtinen. Progressive growing of gans for improved quality, stability, and variation. arXiv preprint arXiv:1710.10196, 2017.
 [Kingma and Ba2014] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
 [Kodali et al.2017] Naveen Kodali, Jacob Abernethy, James Hays, and Zsolt Kira. On convergence and stability of gans. arXiv preprint arXiv:1705.07215, 2017.
 [Milgrom and Segal2002] Paul Milgrom and Ilya Segal. Envelope theorems for arbitrary choice sets. Econometrica, 70(2):583–601, 2002.
 [Miyato et al.2018] Takeru Miyato, Toshiki Kataoka, Masanori Koyama, and Yuichi Yoshida. Spectral normalization for generative adversarial networks. arXiv preprint arXiv:1802.05957, 2018.
 [Nocedal and Wright2006] Jorge Nocedal and Stephen Wright. Numerical optimization. Springer Science & Business Media, 2006.
 [Petzka et al.2017] Henning Petzka, Asja Fischer, and Denis Lukovnicov. On the regularization of wasserstein gans. arXiv preprint arXiv:1709.08894, 2017.
 [Reddi et al.2018] Sashank J Reddi, Satyen Kale, and Sanjiv Kumar. On the convergence of adam and beyond. arXiv preprint, 2018.
 [Seguy et al.2017] Vivien Seguy, Bharath Bhushan Damodaran, Rémi Flamary, Nicolas Courty, Antoine Rolet, and Mathieu Blondel. Largescale optimal transport and mapping estimation. arXiv preprint arXiv:1711.02283, 2017.
 [Villani2008] Cédric Villani. Optimal transport: old and new, volume 338. Springer Science & Business Media, 2008.
 [Yoshida and Miyato2017] Yuichi Yoshida and Takeru Miyato. Spectral norm regularization for improving the generalizability of deep learning. arXiv preprint arXiv:1705.10941, 2017.
 [Zhou et al.2018] Zhiming Zhou, Qingru Zhang, Guansong Lu, Hongwei Wang, Weinan Zhang, and Yong Yu. Adashift: Decorrelation and convergence of adaptive learning rate methods. arXiv preprint arXiv:1810.00143, 2018.
 [Zhou et al.2019] Zhiming Zhou, Jiadong Liang, Yuxuan Song, Lantao Yu, Hongwei Wang, Weinan Zhang, Yong Yu, and Zhihua Zhang. Lipschitz generative adversarial nets. arXiv preprint arXiv:1902.05687, 2019.
 [Zou et al.2018] Fangyu Zou, Li Shen, Zequn Jie, Weizhong Zhang, and Wei Liu. A sufficient condition for convergences of adam and rmsprop. arXiv preprint arXiv:1811.09358, 2018.