The Generative Adversarial Network (GAN) Goodfellow et al. (2014) has become one of the most powerful paradigms in learning real-world distributions, especially for image-related data. It has been successfully applied to a host of applications such as image translation Isola et al. (2017); Kim et al. (2017); Zhu et al. (2017)
, super-resolution imagingWang et al. (2015), pose editing Pumarola et al. (2018b), and facial animation Pumarola et al. (2018a).
Despite of the many accomplishments, the major hurdle blocking the full impact of GAN is its notoriously difficult training phase. In the language of game theory, GAN seeks for apure strategy equilibrium, which is well-known to be ill-posed in many scenarios Dasgupta & Maskin (1986). Indeed, it is known that a pure strategy equilibrium might not exist Arora et al. (2017), might be degenerate Sønderby et al. (2017), or cannot be reliably reached by existing algorithms Mescheder et al. (2017).
Empirically, it has also been observed that common algorithms, such as SGD or Adam Kingma & Ba (2015), lead to unstable training. While much efforts have been devoted into understanding the training dynamics of GANs Balduzzi et al. (2018); Gemp & Mahadevan (2018); Gidel et al. (2018a, b); Liang & Stokes (2018), a provably convergent algorithm for general GANs, even under reasonably strong assumptions, is still lacking.
In this paper, we address the above problems with the following contributions:
We propose to study the mixed Nash Equilibrium (NE) of GANs: Instead of searching for an optimal pure strategy which might not even exist, we optimize over the set of probability distributions over pure strategies of the networks. The existence of a solution to such problems was long established amongst the earliest game theory work Glicksberg (1952), leading to well-posed optimization problems.
We demonstrate that the prox methods of Nemirovsky & Yudin (1983); beck2003mirror; Nemirovski (2004), which are fundamental building blocks for solving two-player games with finitely many strategies, can be extended to continuously many strategies, and hence applicable to training GANs. We provide an elementary proof for their convergence rates to learning the mixed NE.
. We further establish heuristic guidelines to greatly scale down the memory and computational costs, resulting in simple algorithms whose per-iteration complexity is almost as cheap as SGD.
We experimentally show that our algorithms consistently achieve better or comparable performance than popular baselines such as SGD, Adam, and RMSProp Tieleman & Hinton (2012).
Related Work: While the literature on training GANs is vast, to our knowledge, there exist only few papers on the mixed NE perspective. The notion of mixed NE is already present in Goodfellow et al. (2014), but is stated only as an existential result. The authors of Arora et al. (2017) advocate the mixed strategies, but do not provide a provably convergent algorithm. Oliehoek et al. (2018) also considers mixed NE, but only with finitely many parameters. The work Grnarova et al. (2018)
proposes a provably convergent algorithm for finding the mixed NE of GANs under the unrealistic assumption that the discriminator is a single-layered neural network. In contrast, our results are applicable to arbitrary architectures, including popular onesArjovsky et al. (2017); Gulrajani et al. (2017).
Due to its fundamental role in game theory, many prox methods have been applied to study the training of GANs Daskalakis et al. (2018); Gidel et al. (2018a); Mertikopoulos et al. (2018). However, these works focus on the classical pure strategy equilibria and are hence distinct from our problem formulation. In particular, they give rise to drastically different algorithms from ours and do not provide convergence rates for GANs.
In terms of analysis techniques, our framework is closely related to Balandat et al. (2016), but with several important distinctions. First, the analysis of Balandat et al. (2016) is based on dual averaging Nesterov (2009), while we consider Mirror Descent and also the more sophisticated Mirror-Prox (see Section 3). Second, unlike our work, Balandat et al. (2016) do not provide any convergence rate for learning mixed NE of two-player games. Finally, Balandat et al. (2016) is only of theoretical interest with no practical algorithm.
Notation: Throughout the paper, we use to denote a generic variable and its domain. We denote the set of all Borel probability measures on by , and the set of all functions on by .111Strictly speaking, our derivation requires mild regularity (see Appendix A.1) assumptions on the probability measure and function classes, which are met by most practical applications. We write to mean that the density function of with respect to the Lebesgue measure is . All integrals without specifying the measure are understood to be with respect to Lebesgue. For any objective of the form with achieving the saddle-point value, we say that is an -NE if . Similarly we can define -NE. The symbol denotes the -norm of functions, and denotes the total variation norm of probability measures.
2 Problem Formulation
We review standard results in game theory in Section 2.1, whose proof can be found in Bubeck (2013a, b, c). Section 2.2 relates training of GANs to the two-player game in Section 2.1, thereby suggesting to generalize the prox methods to infinite dimension.
2.1 Preliminary: Prox Methods for Finite Games
Consider the classical formulation of a two-player game with finitely many strategies:
where is a payoff matrix,
is a vector, andis the probability simplex, representing the mixed strategies (i.e., probability distributions) over pure strategies. A pair achieving the min-max value in (1) is called a mixed NE.
Assume that the matrix is too expensive to evaluate whereas the (stochastic) gradients of (1) are easy to obtain. Under such settings, a celebrated algorithm, the so-called entropic Mirror Descent (entropic MD), learns an -NE: Let be the entropy function and be its Fenchel dual. For a learning rate and an arbitrary vector , define the MD iterates as
The equivalence of the last two formulas in (2) can be readily checked.
Denote by and the ergodic average of two sequences and . Then, with a properly chosen step-size , we have
Moreover, a slightly more complicated algorithm, called the entropic Mirror-Prox (entropy MP) Nemirovski (2004), achieves faster rate than the entropic MD:
If, instead of deterministic gradients, one uses unbiased stochastic gradients for entropic MD and MP, then both algorithms achieve -NE in expectation.
2.2 Mixed Strategy Formulation for Generative Adversarial Networks
For illustration, let us focus on the Wasserstein GAN Arjovsky et al. (2017). The training objective of Wasserstein GAN is
where is the set of parameters for the generator and the set of parameters for the discriminator222Also known as “critic” in Wasserstein GAN literature. , typically both taken to be neural nets. As mentioned in the introduction, such an optimization problem can be ill-posed, which is also supported by empirical evidence.
The high-level idea of our approach is, instead of solving (3) directly, we focus on the mixed strategy formulation of (3). In other words, we consider the set of all probability distributions over and , and we search for the optimal distribution that solves the following program:
Define the function by and the operator as . Denoting for any probability measure and function , we may rewrite (4) as
Furthermore, the Fréchet derivative (the analogue of gradient in infinite dimension) of (5) with respect to is simply , and the derivative of (5) with respect to is , where is the adjoint operator of defined via the relation
One can easily check that achieves the equality in (6).
To summarize, the mixed strategy formulation of Wasserstein GAN is (5), whose derivatives can be expressed in terms of and . We now make the crucial observation that (5) is exactly the infinite-dimensional analogue of (1): The distributions over finite strategies are replaced with probability measures over a continuous parameter set, the vector is replaced with a function , the matrix is replaced with a linear operator333The linearity of trivially follows from the linearity of expectation. , and the gradients are replaced with Fréchet derivatives. Based on Section 2.1, it is then natural to ask:
Can the entropic Mirror Descent and Mirror-Prox be extended to infinite dimension to solve (5)? Can we retain the convergence rates?
We provide an affirmative answer to both questions in the next section.
Remark. The derivation in Section 2.2 can be applied to any GAN objective.
3 Infinite-Dimensional Prox Methods
This section builds a rigorous infinite-dimensional formalism in parallel to the finite-dimensional prox methods and proves their convergence rates. While simple in retrospect, to our knowledge, these results are new.
3.1 Preparation: The Mirror Descent Iterates
We first recall the notion of (Fréchet) derivative in infinite-dimensional spaces. A (nonlinear) functional is said to possess a derivative at if there exists a function such that, for all , we have
Similarly, a (nonlinear) functional is said to possess a derivative at if there exists a measure such that, for all , we have
The most important functionals in this paper are the (negative) Shannon entropy
and its Fenchel dual
The first result of our paper is to show that, in direct analogy to (2), the infinite-dimensional MD iterates can be expressed as:
Theorem 1 (Infinite-Dimensional Mirror Descent, informal).
For a learning rate and an arbitrary function , we can equivalently define
Moreover, most the essential ingredients in the analysis of finite-dimensional prox methods can be generalized to infinite dimension.
3.2 Infinite-Dimensional Prox Methods and Convergence Rates
Armed with results in Section 3.1, we now introduce two “conceptual” algorithms for solving the mixed NE of Wasserstein GANs: The infinite-dimensional entropic MD in Algorithm 1 and MP in Algorithm 2. These algorithms iterate over probability measures and cannot be directly used in practice, but they possess rigorous convergence rates, and hence motivate the reduction procedure in Section 4 to come.
Theorem 2 (Convergence Rates).
Let . Let be a constant such that , and be such that and . Let be the relative entropy, and denote by the initial distance to the mixed NE. Then
4 From Theory to Practice
that has widely been used in machine learning. Section4.2 proposes to further simplify the algorithms by summarizing a batch of samples by their mean.
For simplicity, we will only derive the algorithm for entropic MD; the case for entropic MP is similar but requires more computation. To ease the notation, we assume throughout this section as does not play an important role in the derivation below.
4.1 Implementable Entropic MD: From Probability Measure to Samples
Consider Algorithm 1. The reduction consists of three steps.
Step 1: Reformulating Entropic Mirror Descent Iterates
The definition of the MD iterate (7) relates the updated probability measure to the current probability measure , but it tells us nothing about the density function of , from which we want to sample. Our first step is to express (7) in a more tractable form. By recursively applying (7) and using Theorem 4.10 in Appendix A, we have, for some constants ,
For simplicity, assume that is uniform so that is a constant function. Then, by (13) and that , we see that the density function of is simply Similarly, we have
Step 2: Empirical Approximation for Stochastic Derivatives
The derivatives of (5) involve the function and operator . Recall that requires taking expectation over the real data distribution, which we do not have access to. A common approach is to replace the true expectation with its empirical average:
where ’s are real data and is the batch size. Clearly,
is an unbiased estimator of.
On the other hand, and involve expectation over and , respectively, and also over the fake data distribution . Therefore, if we are able to draw samples from and , then we can again approximate the expectation via the empirical average:
Now, assuming that we have obtained unbiased stochastic derivatives and , how do we actually draw samples from and ? Provided we can answer this question, then we can start with two easy-to-sample distributions , and then we will be able to draw samples from . These samples in turn will allow us to draw samples from , and so on. Therefore, it only remains to answer the above question. This leads us to:
Step 3: Sampling by Stochastic Gradient Langevin Dynamics
For any probability distribution with density function , the Stochastic Gradient Langevin Dynamics (SGLD) Welling & Teh (2011) iterates as
where is the step-size, is an unbiased estimator of , is the thermal noise, and is a standard normal vector, independently drawn across different iterations.
Suppose we start at . Plugging and into (8), we obtain, for and , the following update rules:
The theory of Welling & Teh (2011) states that, for large enough , the iterates of SGLD above (approximately) generate samples according to the probability measures . We can then apply this process recursively to obtain samples from . Finally, since the entropic MD and MP output the averaged measure , it suffices to pick a random index and then output samples from .
Remark. In principle, any first-order sampling method is valid above. In the experimental section, we also use a RMSProp-preconditioned version of the SGLD Li et al. (2016).
4.2 Summarizing Samples by Averaging: A Simple yet Effective Heuristic
Although Algorithm 4 and 5 are implementable, they are quite complicated and resource-intensive, as the total computational complexity is . This high complexity comes from the fact that, when computing the stochastic derivatives, we need to store all the historical samples and evaluate new gradients at these samples.
An intuitive approach to alleviate the above issue is to try to summarize each distribution by only one parameter. To this end, the mean of the distribution is the most natural candidate, as it not only stablizes the algorithm, but also is often easier to acquire than the actual samples. For instance, computing the mean of distributions of the form , where
is a loss function defined by deep neural networks, has been empirically proven successful inChaudhari et al. (2017, 2018); Dziugaite & Roy (2018) via SGLD. In this paper, we adopt the same approach as in Chaudhari et al. (2017) where we use exponential damping (the term in Algorithm 3) to increase stability. Algorithm 3, dubbed the Mirror-GAN, shows how to encompass this idea into entropic MD; the pseudocode for the similar Mirror-Prox-GAN can be found in Algorithm 6 of Appendix D.
5 Experimental Evidence
The purpose of our experiments is twofold. First, we use established baselines to demonstrate that Mirror- and Mirror-Prox-GAN consistently achieve better or comparable performance than common algorithms. Second, we report that our algorithms are stable and always improve as the training process goes on. This is in contrast to unstable training algorithms, such as Adam, which often collapse to noise as the iteration count grows. Cha (2017).
We use visual quality of the generated images to evaluate different algorithms. We avoid reporting numerical metrics, as recent studies Barratt & Sharma (2018); Borji (2018); Lucic et al. (2018) suggest that these metrics might be flawed. Setting of the hyperparameters and more auxiliary results can be found in Appendix E.
5.1 Synthetic Data
We repeat the synthetic setup as in Gulrajani et al. (2017)
. The tasks include learning the distribution of 8 Gaussian mixtures, 25 Gaussian mixtures, and the Swiss Roll. For both the generator and discriminator, we use two MLPs with three hidden layers of 512 neurons. We choose SGD and Adam as baselines, and we compare them to Mirror- and Mirror-Prox-GAN. All algorithms are run up toiterations444One iteration here means using one mini-batch of data. It does not correspond to the in our algorithms, as there might be multiple SGLD iterations within each time step .. The results of 25 Gaussian mixtures are shown in Figure 1; An enlarged figure of 25 Gaussian Mixtures and other cases can be found in Appendix E.1.
As Figure 1 shows, SGD performs poorly in this task, while the other algorithms yield reasonable results. However, compared to Adam, Mirror- and Mirror-Prox-GAN fit the true distribution better in two aspects. First, the modes found by Mirror- and Mirror-Prox-GAN are more accurate than the ones by Adam, which are perceivably biased. Second, Mirror- and Mirror-Prox-GAN perform much better in capturing the variance (how spread the blue dots are), while Adam tends to collapse to modes. These observations are consistent throughout the synthetic experiments; see Appendix E.1.
5.2 Real Data
We use the same architecture (DCGAN) as in Radford et al. (2015)
with batch normalization. As the networks become deeper in this case, the gradient magnitudes differ significantly across different layers. As a result, non-adaptive methods such as SGD or SGLD do not perform well in this scenario. To alleviate such issues, we replace SGLD by the RMSProp-preconditioned SGLDLi et al. (2016) for our sampling routines. For baselines, we consider two adaptive gradient methods: RMSprop and Adam.
Figure 2 shows the results at the th iteration. The RMSProp and Mirror-GAN produce images with reasonable quality, while Adam outputs noise. The visual quality of Mirror-GAN is better than RMSProp, as RMSProp sometimes generates blurry images (the - and -th entry of Figure 8.(b)).
It is worth mentioning that Adam can learn the true distribution at intermediate iterations, but later on suffers from mode collapse and finally degenerates to noise; see Appendix E.2.2.
Our goal of systematically understanding and expanding on the game theoretic perspective of mixed NE along with stochastic Langevin dynamics for training GANs is a promising research vein. While simple in retrospect, we provide guidelines in developing approximate infinite-dimensional prox methods that mimic closely the provable optimization framework to learn the mixed NE of GANs. Our proposed Mirror- and Mirror-Prox-GAN algorithm feature cheap per-iteration complexity while rapidly converging to solutions of good quality.
This project has received funding from the European Research Council (ERC) under the European Union’s Horizon 2020 research and innovation programme (grant agreement n 725594 - time-data), and Microsoft Research through its PhD scholarship Programme.
- Arjovsky et al. (2017) Martin Arjovsky, Soumith Chintala, and Léon Bottou. Wasserstein gan. arXiv preprint arXiv:1701.07875, 2017.
- Arora et al. (2017) Sanjeev Arora, Rong Ge, Yingyu Liang, Tengyu Ma, and Yi Zhang. Generalization and equilibrium in generative adversarial nets (gans). In International Conference on Machine Learning, pp. 224–232, 2017.
- Balandat et al. (2016) Maximilian Balandat, Walid Krichene, Claire Tomlin, and Alexandre Bayen. Minimizing regret on reflexive banach spaces and nash equilibria in continuous zero-sum games. In Advances in Neural Information Processing Systems, pp. 154–162, 2016.
- Balduzzi et al. (2018) David Balduzzi, Sebastien Racaniere, James Martens, Jakob Foerster, Karl Tuyls, and Thore Graepel. The mechanics of n-player differentiable games. In Jennifer Dy and Andreas Krause (eds.), Proceedings of the 35th International Conference on Machine Learning, volume 80 of Proceedings of Machine Learning Research, pp. 354–363, Stockholmsmässan, Stockholm Sweden, 10–15 Jul 2018. PMLR.
- Barratt & Sharma (2018) Shane Barratt and Rishi Sharma. A note on the inception score. arXiv preprint arXiv:1801.01973, 2018.
- Borji (2018) Ali Borji. Pros and cons of gan evaluation measures. arXiv preprint arXiv:1802.03446, 2018.
- Bubeck (2013a) Sebastien Bubeck. Orf523: Mirror descent, part i/ii, 2013a. URL https://blogs.princeton.edu/imabandit/2013/04/16/orf523-mirror-descent-part-iii/, 2013.
- Bubeck (2013b) Sebastien Bubeck. Orf523: Mirror descent, part ii/ii, 2013b. URL https://blogs.princeton.edu/imabandit/2013/04/18/orf523-mirror-descent-part-iiii/, 2013.
- Bubeck (2013c) Sebastien Bubeck. Orf523: Mirror prox, 2013c. URL https://blogs.princeton.edu/imabandit/2013/04/23/orf523-mirror-prox/, 2013.
- Cha (2017) Junbum Cha. Implementations of (theoretical) generative adversarial networks and comparison without cherry-picking. https://github.com/khanrc/tf.gans-comparison, 2017.
- Chaudhari et al. (2017) Pratik Chaudhari, Anna Choromanska, Stefano Soatto, Yann LeCun, Carlo Baldassi, Christian Borgs, Jennifer Chayes, Levent Sagun, and Riccardo Zecchina. Entropy-sgd: Biasing gradient descent into wide valleys. In International Conference on Learning Representations, 2017.
Chaudhari et al. (2018)
Pratik Chaudhari, Adam Oberman, Stanley Osher, Stefano Soatto, and Guillaume
Deep relaxation: partial differential equations for optimizing deep neural networks.Research in the Mathematical Sciences, 5(3):30, Jun 2018.
- Dasgupta & Maskin (1986) Partha Dasgupta and Eric Maskin. The existence of equilibrium in discontinuous economic games, i: Theory. The Review of economic studies, 53(1):1–26, 1986.
- Daskalakis et al. (2018) Constantinos Daskalakis, Andrew Ilyas, Vasilis Syrgkanis, and Haoyang Zeng. Training GANs with optimism. In International Conference on Learning Representations, 2018.
- Dziugaite & Roy (2018) Gintare Karolina Dziugaite and Daniel Roy. Entropy-sgd optimizes the prior of a pac-bayes bound: Generalization properties of entropy-sgd and data-dependent priors. In International Conference on Machine Learning, pp. 1376–1385, 2018.
- Gemp & Mahadevan (2018) Ian Gemp and Sridhar Mahadevan. Global convergence to the equilibrium of gans using variational inequalities. arXiv preprint arXiv:1808.01531, 2018.
- Gibbs (1902) J Willard Gibbs. Elementary principles in statistical mechanics. Yale University Press, 1902.
- Gidel et al. (2018a) Gauthier Gidel, Hugo Berard, Pascal Vincent, and Simon Lacoste-Julien. A variational inequality perspective on generative adversarial nets. arXiv preprint arXiv:1802.10551, 2018a.
- Gidel et al. (2018b) Gauthier Gidel, Reyhane Askari Hemmat, Mohammad Pezeshki, Gabriel Huang, Remi Lepriol, Simon Lacoste-Julien, and Ioannis Mitliagkas. Negative momentum for improved game dynamics. arXiv preprint arXiv:1807.04740, 2018b.
- Glicksberg (1952) Irving L Glicksberg. A further generalization of the kakutani fixed point theorem, with application to nash equilibrium points. Proceedings of the American Mathematical Society, 3(1):170–174, 1952.
- Goodfellow et al. (2014) Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In Advances in neural information processing systems, pp. 2672–2680, 2014.
- Gray (2011) Robert M Gray. Entropy and information theory. Springer Science & Business Media, 2011.
- Grnarova et al. (2018) Paulina Grnarova, Kfir Y Levy, Aurelien Lucchi, Thomas Hofmann, and Andreas Krause. An online learning approach to generative adversarial networks. In International Conference on Learning Representations, 2018.
- Gulrajani et al. (2017) Ishaan Gulrajani, Faruk Ahmed, Martin Arjovsky, Vincent Dumoulin, and Aaron C Courville. Improved training of wasserstein gans. In Advances in Neural Information Processing Systems, pp. 5767–5777, 2017.
- Halmos (2013) Paul R Halmos. Measure theory, volume 18. Springer, 2013.
- Isola et al. (2017) Phillip Isola, Jun-Yan Zhu, Tinghui Zhou, and Alexei A Efros. Image-to-image translation with conditional adversarial networks. arXiv preprint, 2017.
- Juditsky & Nemirovski (2011) Anatoli Juditsky and Arkadi Nemirovski. First order methods for nonsmooth convex large-scale optimization, ii: utilizing problems structure. Optimization for Machine Learning, pp. 149–183, 2011.
- Kim et al. (2017) Taeksoo Kim, Moonsu Cha, Hyunsoo Kim, Jung Kwon Lee, and Jiwon Kim. Learning to discover cross-domain relations with generative adversarial networks. In International Conference on Machine Learning, pp. 1857–1865, 2017.
- Kingma & Ba (2015) Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In International Conference on Learning Representations, 2015.
- Li et al. (2016) Chunyuan Li, Changyou Chen, David E Carlson, and Lawrence Carin. Preconditioned stochastic gradient langevin dynamics for deep neural networks. In AAAI, 2016.
- Liang & Stokes (2018) Tengyuan Liang and James Stokes. Interaction matters: A note on non-asymptotic local convergence of generative adversarial networks. arXiv preprint arXiv:1802.06132, 2018.
- Lucic et al. (2018) Mario Lucic, Karol Kurach, Marcin Michalski, Sylvain Gelly, and Olivier Bousquet. Are gans created equal? a large-scale study. In Advances in neural information processing systems, 2018.
- Mertikopoulos et al. (2018) Panayotis Mertikopoulos, Houssam Zenati, Bruno Lecouat, Chuan-Sheng Foo, Vijay Chandrasekhar, and Georgios Piliouras. Mirror descent in saddle-point problems: Going the extra (gradient) mile. arXiv preprint arXiv:1807.02629, 2018.
- Mescheder et al. (2017) Lars Mescheder, Sebastian Nowozin, and Andreas Geiger. The numerics of gans. In Advances in Neural Information Processing Systems, pp. 1825–1835, 2017.
- Nemirovski (2004) Arkadi Nemirovski. Prox-method with rate of convergence o (1/t) for variational inequalities with lipschitz continuous monotone operators and smooth convex-concave saddle point problems. SIAM Journal on Optimization, 15(1):229–251, 2004.
- Nemirovsky & Yudin (1983) AS Nemirovsky and DB Yudin. Problem complexity and method efficiency in optimization. 1983.
- Nesterov (2009) Yurii Nesterov. Primal-dual subgradient methods for convex problems. Mathematical programming, 120(1):221–259, 2009.
- Oliehoek et al. (2018) Frans A Oliehoek, Rahul Savani, Jose Gallego, Elise van der Pol, and Roderich Groß. Beyond local nash equilibria for adversarial networks. arXiv preprint arXiv:1806.07268, 2018.
Pumarola et al. (2018a)
Albert Pumarola, Antonio Agudo, Aleix M Martinez, Alberto Sanfeliu, and
Ganimation: Anatomically-aware facial animation from a single image.
Proceedings of the European Conference on Computer Vision (ECCV), pp. 818–833, 2018a.
Pumarola et al. (2018b)
Albert Pumarola, Antonio Agudo, Alberto Sanfeliu, and Francesc Moreno-Noguer.
Unsupervised person image synthesis in arbitrary poses.
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 8620–8628, 2018b.
- Radford et al. (2015) Alec Radford, Luke Metz, and Soumith Chintala. Unsupervised representation learning with deep convolutional generative adversarial networks. arXiv preprint arXiv:1511.06434, 2015.
- Sønderby et al. (2017) Casper Kaae Sønderby, Jose Caballero, Lucas Theis, Wenzhe Shi, and Ferenc Huszár. Amortised map inference for image super-resolution. 2017.
- Tieleman & Hinton (2012) Tijmen Tieleman and Geoffrey Hinton. Lecture 6.5-rmsprop: Divide the gradient by a running average of its recent magnitude. COURSERA: Neural networks for machine learning, 4(2):26–31, 2012.
- Wang et al. (2015) Zhaowen Wang, Ding Liu, Jianchao Yang, Wei Han, and Thomas Huang. Deep networks for image super-resolution with sparse prior. In Proceedings of the IEEE International Conference on Computer Vision, pp. 370–378, 2015.
- Welling & Teh (2011) Max Welling and Yee W Teh. Bayesian learning via stochastic gradient langevin dynamics. In Proceedings of the 28th International Conference on Machine Learning (ICML-11), pp. 681–688, 2011.
- Yu et al. (2015) Fisher Yu, Ari Seff, Yinda Zhang, Shuran Song, Thomas Funkhouser, and Jianxiong Xiao. Lsun: Construction of a large-scale image dataset using deep learning with humans in the loop. arXiv preprint arXiv:1506.03365, 2015.
- Zhu et al. (2017) Jun-Yan Zhu, Taesung Park, Phillip Isola, and Alexei A Efros. Unpaired image-to-image translation using cycle-consistent adversarial networks. In IEEE International Conference on Computer Vision, 2017.
Appendix A A Framework for Infinite-Dimensional Mirror Descent
a.1 A note on the regularity
It is known that the (negative) Shannon entropy is not Fréchet differentiable in general. However, below we show that the Fréchet derive can be well-defined if we restrict the probability measures to within the set
We will also restrict the set of functions to be bounded and integrable:
It is important to notice that and implies ; this readily follows from the formula (7).
a.2 Properties of Entropic Mirror Map
The total variation of a (possibly non-probability) measure is defined as 
We depart from the fundamental Gibbs Variational Principle, which dates back to the earliest work of statistical mechanics . For two probability measures , denote their relative entropy by (the reason for this notation will become clear in (14))
Theorem 3 (Gibbs Variation Principle).
Let and be a reference measure. Then
and equality is achieved by .
Part of the following theorem is folklore in the mathematics and learning community. However, to the best of our knowledge, the relation to the entropic MD has not been systematically studied before, as we now do.
For a probability measure , let be the negative Shannon entropy, and let . Then
is the Fenchel conjugate of :
The derivatives admit the expression
The Bregman divergence of is the relative entropy:
is 4-strongly convex with respect to the total variation norm: For all ,
The following duality relation holds: For any constant , we have
is -smooth with respect to :
Alternative to (17), we have the equivalent characterization of :
Similar to (16), we have
The following three-point identity holds for all :
Let the Mirror Descent iterate be defined as in (7). Then the following statements are equivalent:
There exists a constant such that .
In particular, for any we have
We prove a more general result on the Bregman divergence in (23) below.
Let , and . Let be small enough such that is absolutely continuous with respect to ; note that this is possible because , and . We compute
where (i) uses as . In short, for all ,