Learning generative models is becoming an increasingly important problem in machine learning and statistics with a wide range of applications in self-driving cars(Santana & Hotz, 2016), robotics (Hirose et al., 2017)2018), domain-transfer (Sankaranarayanan et al., 2018), computational biology (Ghahramani et al., 2018), etc. Two modern approaches to deal with this problem are Generative Adversarial Networks (GANs) (Goodfellow et al., 2014) and Variational AutoEncoders (VAEs) (Makhzani et al., 2015; Rosca et al., 2017; Tolstikhin et al., 2017; Mescheder et al., 2017b).
VAEs compute a generative model by maximizing a variational lower-bound on average sample likelihoods using an explicit probability distribution for the data. GANs, however, learn a generative model by minimizing a distance between observed and generated distributions without considering an explicit probability model for the data. Empirically, GANs have been shown to produce higher-quality generative samples than that of VAEs (Karras et al., 2017). However, since GANs do not consider an explicit probability model for the data, we are unable to compute sample likelihoods using their generative models. Computations of sample likelihoods and posterior distributions of latent variables are critical in several statistical inference. Inability to obtain such statistics within GAN’s framework severely limits their applications in such statistical inference problems.
In this paper, we resolve these issues for a general formulation of GANs by providing a theoretically-justified approach to compute sample likelihoods using GAN’s generative model. Our results can open new directions to use GANs in massive-data applications such as model selection, sample selection, hypothesis-testing, etc (see more details in Section 5). Below, we state our main results informally without going into technical conditions while precise statements of our results are presented in Section 2.
Let and represent observed (i.e. real) and generative (i.e. fake or synthetic) variables, respectively. (i.e. the latent variable) is the randomness used as the input to the generator . Consider the following explicit probability model of the data given a latent sample :
is a loss function. Under this explicit probability model, we show that minimizing the objective of anoptimal transport GAN (e.g. Wasserstein GAN Arjovsky et al. (2017)) with the cost function and an entropy regularization (Cuturi, 2013; Seguy et al., 2017) maximizes a variational lower-bound on average sample likelihoods. I.e.
If , the optimal transport (OT) GAN simplifies to WGAN (Arjovsky et al., 2017) while if , the OT GAN simplifies to the quadratic GAN (or, W2GAN) (Feizi et al., 2017). The precise statement of this result can be found in Theorem 1. This result provides a statistical justification for GAN’s optimization and puts it in par with VAEs whose goal is to maximize a lower bound on sample likelihoods. We note that the entropy regularization has been proposed primarily to improve computational aspects of GANs (Cuturi, 2013). Our results provide an additional statistical justification for this regularization term. Moreover, using GAN’s training, we obtain a coupling between the observed variable and the latent variable . This coupling provides the conditional distribution of the latent variable given an observed sample . The explicit model of equation 1.1 acts similar to the decoder in the VAE framework, while the coupling computed using GANs acts as an encoder.
Connections between GANs and VAEs have been investigated in some of the recent works as well (Hu et al., 2018; Mescheder et al., 2017a). In Hu et al. (2018), GANs are interpreted as methods performing variational inference on a generative model in the label space. In their framework, observed data samples are treated as latent variables while the generative variable is the indicator of whether data is real or fake. The method in Mescheder et al. (2017a), on the other hand, uses an auxiliary discriminator network to rephrase the maximum-likelihood objective of a VAE as a two-player game similar to the objective of a GAN. Our method is different from both these approaches as we consider an explicit probability model for the data, and show that the entropic GAN objective maximizes a variational lower bound under this probability model, thus allowing sample likelihood computation in GANs similar to VAEs.
Another key question that we address here is how to estimate the likelihood of a new samplegiven the generative model trained using GANs. For instance, if we train a GAN on stop-sign images, upon receiving a new image, one may wish to compute the likelihood of the new sample according to the trained generative model. In standard GAN formulations, the support of the generative distribution lies on the range of the optimal generator function. Thus, if the observed sample does not lie on that range (which is very likely in practice), there is no way to assign a sensible likelihood score to that sample. Below, we show that using the explicit probability model of equation 1.1, we can lower-bound the likelihood of this sample . This is similar to the variational lower-bound on sample likelihoods used in VAEs. Our numerical results show that this lower-bound well-reflect the expected trends of the true sample likelihoods.
Let and be the optimal generator function and the optimal coupling between real and latent variables, respectively. The optimal coupling can be computed efficiently for entropic GANs as we explain in Section 3. For other GAN architectures, one may approximate such optimal couplings as we explain in Section 4. Let be a new test sample. We can lower-bound the log likelihood of this sample as
We present the precise statement of this result in Corollary 2. This result combines three components in order to approximate the likelihood of a sample given a trained generative model:
The distance between to the generative model. If this distance is large, the likelihood of observing from the generative model is small.
The entropy of the coupled latent variable. If the entropy term is large, it means that the coupled latent variable has a large randomness. This contributes positively to the sample likelihood.
The likelihood of the coupled latent variable. If latent samples have large likelihoods, the likelihood of the observed test sample will be large as well.
Figure 1(a) provides a pictorial illustration of these components.
2 Main Results
Let. GAN’s goal is to find a generator function such that has a similar distribution to . Let be an -dimensional random variable with a fixed probability density function . Here, we assume
is the density of a normal distribution. In practice, we observesamples from and generate samples from , i.e., where for . We represent these empirical distributions by and , respectively. Note that the number of generative samples can be arbitrarily large. Finally, we assume that the generator function is injective which is often the case in practice since maps from an dimensional space to a dimensional one where .
GAN computes the optimal generator by minimizing a distance between the observed distribution and the generative one . Common distance measures include optimal transport measures (e.g. Wasserstein GAN (Arjovsky et al., 2017), WGAN+Gradient Penalty (Gulrajani et al., 2017), GAN+Spectral Normalization (Miyato et al., 2018), WGAN+Truncated Gradient Penalty (Petzka et al., 2017), relaxed WGAN (Guo et al., 2017)), and divergence measures (e.g. the original GAN’s formulation (Goodfellow et al., 2014), -GAN (Nowozin et al., 2016)), etc.
is the joint distribution whose marginal distributions are equal toand , respectively. If , this distance is called the first-order Wasserstein distance and is referred to by , while if , this measure is referred to by where is the second-order Wasserstein distance (Villani, 2008).
where is the set of generator functions. Examples of the OT GAN are WGAN (Arjovsky et al., 2017) corresponding to the first-order Wasserstein distance and the quadratic GAN (or, the W2GAN) (Feizi et al., 2017) corresponding to the second-order Wasserstein distance .
Note that optimization 2.2 is a min-min optimization. The objective of this optimization is not smooth in and it is often computationally expensive to obtain a solution (Sanjabi et al., 2018). One approach to improve computational aspects of this optimization is to add a regularization term to make its objective strictly convex (Cuturi, 2013; Seguy et al., 2017). A common strictly-convex regularization term is the negative Shannon entropy function defined as . This leads to the following optimal transport GAN formulation with the entropy regularization, or for simplicity, the entropic GAN formulation:
where is the regularization parameter.
There are two approaches to solve the optimization problem 2.3. The first approach uses an iterative method to solve the min-min formulation (Genevay et al., 2017). Another approach is to solve an equivelent min-max formulation by writing the dual of the inner minimization (Seguy et al., 2017; Sanjabi et al., 2018). The latter is often referred to as a GAN formulation since the min-max optimization is over a set of generator functions and a set of discriminator functions. The details of this approach are further explained in Section 3.
In the following, we present an explicit probability model for entropic GANs under which their objective can be viewed as maximizing a lower bound on average sample likelihoods.
Let the loss function be shift invariant, i.e., . Let
be an explicit probability model for given for a well-defined normalization
Then, we have
In words, the entropic GAN maximizes a lower bound on sample likelihoods according to the explicit probability model of equation 2.4.
The proof of this theorem is presented in Section A. This result has a similar flavor to that of VAEs (Makhzani et al., 2015; Rosca et al., 2017; Tolstikhin et al., 2017; Mescheder et al., 2017b) where a generative model is computed by maximizing a lower bound on sample likelihoods.
Having a shift invariant loss function is critical for Theorem 1 as this makes the normalization term independent from and (to see this, one can define in equation 2.6). The most standard OT GAN loss functions such as the for WGAN (Arjovsky et al., 2017) and the quadratic loss for W2GAN (Feizi et al., 2017) satisfy this property.
One can further simplify this result by considering specific loss functions. For example, we have the following result for the entropic GAN with the quadratic loss function.
Let and be optimal solutions of an entropic GAN optimization 2.3 (note that the optimal coupling can be computed efficiently for entropic GAN using equation 3.7). Let be a newly observed sample. An important question is what the likelihood of this sample is given the trained generative model. Using the explicit probability model of equation 2.4 and the result of Theorem 1, we can (approximately) compute sample likelihoods using the trained generative model. We explain this result in the following corollary.
3 GAN’s Dual Formulation
In this section, we discuss dual formulations for OT GAN (equation 2.2) and entropic GAN (equation 2.3) optimizations. These dual formulations are min-max optimizations over two function classes, namely the generator and the discriminator. Often local search methods such as alternating gradient descent (GD) are used to compute a solution for these min-max optimizations.
First, we discuss the dual formulation of OT GAN optimization 2.2
. Using the duality of the inner minimization, which is a linear program, we can re-write optimization2.2 as follows (Villani, 2008):
where for all . The maximization is over two sets of functions and which are coupled using the loss function. Using the Kantorovich duality Villani (2008), we can further simplify this optimization as follows:
where is the -conjugate function of and is restricted to -convex functions (Villani, 2008). The above optimization provides a general formulation for OT GANs. If the loss function is , then the optimal transport distance is referred to as the first order Wasserstein distance. In this case, the min-max optimization 3.2 simplifies to the following optimization (Arjovsky et al., 2017):
This is often referred to as Wasserstein GAN, or simply WGAN (Arjovsky et al., 2017). If the loss function is quadratic, then the OT optimization is referred to as the quadratic GAN (or, W2GAN) (Feizi et al., 2017).
Similarly, the dual formulation of the entropic GAN equation 2.3 can be written as the following optimization (Cuturi, 2013; Seguy et al., 2017) 111Note that optimization 3.4 is dual of optimization 2.3 when the terms have been added to its objective. Since we have assumed that is injective, these terms are constants and thus can be ignored from the optimization objective without loss of generality. :
Note that the hard constraint of optimization 3.1 is being replaced by a soft constraint in optimization 3.2. In this case, optimal primal variables can be computed according to the following lemma (Seguy et al., 2017):
Let and be the optimal discriminator functions for a given generator function according to optimization 3.4. Let
This lemma is important for our results since it provides an efficient way to compute the optimal coupling between real and generative variables (i.e. ) using the optimal generator () and discriminators ( and ) of optimization 3.4. It is worth to note that without the entropy regularization term, computing the optimal coupling in OT GAN using the optimal generator and discriminator functions is not straightforward in general (unless in some special cases such as that of the W2GAN (Villani, 2008; Feizi et al., 2017)). This is another additional computational benefit of using entropic GAN. We use the algorithm presented in Sanjabi et al. (2018) to solve optimization 3.4.
4 Experimental Results
In this section, we supplement our theoretical results with experimental validations. One of the main objectives of our work is to provide a framework to compute sample likelihoods in GANs. Such likelihood statistics can then be used in several statistical inference applications that we discuss in Section 5. With a trained entropic WGAN, the likelihood of a test sample can be lower-bounded using Corollary 2. As shown in Lemma 1, WGAN with entropy regularization provides a closed-form solution to the conditional density of the latent variable. From equation 3.7, we have
By change of variables (and under the assumption that the generator is injective), we have
In order to compute our proposed surrogate likelihood of Corollary 2, we need to draw samples from the distribution
. One approach is to use a Markov chain Monte Carlo (MCMC) method to sample from this distribution. In our experiments, however, we found that MCMC demonstrates poor performance owing to the high dimensional nature of. A similar issue with MCMC has been reported for VAEs in Kingma & Welling (2013). Thus, we use a different estimator to compute the likelihood surrogate which provides a better exploration of the latent space. We present our sampling procedure in Alg. 1.
4.1 Likelihood Evolution in GAN’s Training
In the experiments of this section, we study how sample likelihoods vary during GAN’s training. An entropic WGAN is first trained on MNIST dataset. Then, we randomly choose samples from MNIST test-set to compute the surrogate likelihoods using Algorithm 1 at different training iterations. We expect sample likelihoods to increase over training iterations as the quality of the generative model improves. A proper surrogate likelihood function should capture this trend.
Fig. 1(a) demonstrates the evolution of sample likelihood distributions at different training iterations of the entropic WGAN. At iteration , surrogate likelihood values are very low as GAN’s generated images are merely random noise. The likelihood distribution shifts towards high values during the training and saturates beyond a point. Details of this experiment are presented in Appendix D.
4.2 Likelihood Comparison Across different datasets
In this section, we perform experiments across different datasets. An entropic WGAN is first trained on a subset of samples from the MNIST dataset containing digit (which we call the MNIST-1 dataset). With this trained model, likelihood estimates are computed for (1) samples from the entire MNIST dataset, and (2) samples from the Street View House Numbers (SVHN) dataset (Netzer et al., 2011) (Fig. 1(b)). In each experiment, the likelihood estimates are computed for samples. We note that highest likelihood estimates are obtained for samples from MNIST-1 dataset, the same dataset on which the GAN was trained. The likelihood distribution for the MNIST dataset is bimodal with one mode peaking inline with the MNIST-1 mode. Samples from this mode correspond to digit
in the MNIST dataset. The other mode, which is the dominant one, contains the rest of the digits and has relatively low likelihood estimates. The SVHN dataset, on the other hand, has much smaller likelihoods as its distribution is significantly different than that of MNIST. Furthermore, we observe that the likelihood distribution of SVHN samples has a large spread (variance). This is because samples of the SVHN dataset is more diverse with varying backgrounds and styles than samples from MNIST. We note that SVHN samples with high likelihood estimates correspond to images that are similar to MNIST digits, while samples with low scores are different than MNIST samples. Details of this experiment are presented in AppendixD.
4.3 Approximate Likelihood Computation in Un-regularized GANs
Most standard GAN architectures do not have the entropy regularization. Likelihood lower bounds of Theorem 1 and Corollary 2 hold even for those GANs as long as we obtain the optimal coupling in addition to the optimal generator from GAN’s training. Computation of optimal coupling from the dual formulation of OT GAN can be done when the loss function is quadratic (Feizi et al., 2017). In this case, the gradient of the optimal discriminator provides the optimal coupling between and (Villani, 2008) (see Lemma. 2 in Appendix B).
For a general GAN architecture, however, the exact computation of optimal coupling may be difficult. One sensible approximation is to couple with a single latent sample (we are assuming the conditional distribution is an impulse function). To compute corresponding to a , we sample latent samples and select the whose is closest to
. This heuristic takes into account both the likelihood of the latent variable as well as the distance betweenand the model (similarly to equation 3.7). We can then use Corollary 2 to approximate sample likelihoods for various GAN architectures.
We use this approach to compute likelihood estimates for CIFAR-10 (Krizhevsky, 2009) and LSUN-Bedrooms (Yu et al., 2015) datasets. For CIFAR-10, we train DCGAN while for LSUN, we train WGAN (details of these experiments can be found in Appendix D).
Fig. 2(a) demonstrates sample likelihood estimates of different datasets using a GAN trained on CIFAR-10. Likelihoods assigned to samples from MNIST and Office datasets are lower than that of the CIFAR dataset. Samples from the Office dataset, however, are assigned to higher likelihood values than MNIST samples. We note that the Office dataset is indeed more similar to the CIFAR dataset than MNIST. A similar experiment has been repeated for LSUN-Bedrooms (Yu et al., 2015) dataset. We observe similar performance trends in this experiment (Fig. 2(b)).
In this paper, we have provided a statistical framework for a family of GANs. Our main result shows that the entropic GAN optimization can be viewed as maximization of a variational lower-bound on average log-likelihoods, an approach that VAEs are based upon. This result makes a connection between two most-popular generative models, namely GANs and VAEs. More importantly, our result constructs an explicit probability model for GANs that can be used to compute a lower-bound on sample likelihoods. Our experimental results on various datasets demonstrate that this likelihood surrogate can be a good approximation of the true likelihood function. Although in this paper we mainly focus on understanding the behavior of the sample likelihood surrogate in different datasets, the proposed statistical framework of GANs can be used in various statistical inference applications. For example, our proposed likelihood surrogate can be used as a quantitative measure to evaluate the performance of different GAN architectures, it can be used to quantify the domain shifts, it can be used to select a proper generator class by balancing the bias term vs. variance, it can be used to detect outlier samples, it can be used in statistical tests such as hypothesis testing, etc. We leave exploring these directions for future work.
- Arjovsky et al. (2017) Martin Arjovsky, Soumith Chintala, and Léon Bottou. Wasserstein GAN. arXiv preprint arXiv:1701.07875, 2017.
- Cuturi (2013) Marco Cuturi. Sinkhorn distances: Lightspeed computation of optimal transport. In Advances in neural information processing systems, pp. 2292–2300, 2013.
- Feizi et al. (2017) Soheil Feizi, Changho Suh, Fei Xia, and David Tse. Understanding GANs: the LQG setting. arXiv preprint arXiv:1710.10793, 2017.
- Genevay et al. (2017) Aude Genevay, Gabriel Peyré, and Marco Cuturi. Sinkhorn-autodiff: Tractable wasserstein learning of generative models. arXiv preprint arXiv:1706.00292, 2017.
- Ghahramani et al. (2018) Arsham Ghahramani, Fiona M Watt, and Nicholas M Luscombe. Generative adversarial networks uncover epidermal regulators and predict single cell perturbations. bioRxiv, pp. 262501, 2018.
- Goodfellow et al. (2014) Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In Advances in neural information processing systems, pp. 2672–2680, 2014.
- Gulrajani et al. (2017) Ishaan Gulrajani, Faruk Ahmed, Martin Arjovsky, Vincent Dumoulin, and Aaron Courville. Improved training of Wasserstein GANs. arXiv preprint arXiv:1704.00028, 2017.
- Guo et al. (2017) Xin Guo, Johnny Hong, Tianyi Lin, and Nan Yang. Relaxed wasserstein with applications to GANs. arXiv preprint arXiv:1705.07164, 2017.
- Hirose et al. (2017) Noriaki Hirose, Amir Sadeghian, Patrick Goebel, and Silvio Savarese. To go or not to go? a near unsupervised learning approach for robot navigation. arXiv preprint arXiv:1709.05439, 2017.
- Hu et al. (2018) Zhiting Hu, Zichao Yang, Ruslan Salakhutdinov, and Eric P. Xing. On unifying deep generative models. In International Conference on Learning Representations, 2018. URL https://openreview.net/forum?id=rylSzl-R-.
- Karras et al. (2017) Tero Karras, Timo Aila, Samuli Laine, and Jaakko Lehtinen. Progressive growing of gans for improved quality, stability, and variation. CoRR, abs/1710.10196, 2017.
- Kingma & Welling (2013) Diederik P Kingma and Max Welling. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114, 2013.
- Krizhevsky (2009) Alex Krizhevsky. Learning multiple layers of features from tiny images. Technical report, 2009.
- LeCun et al. (1998) Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11), 1998.
- Lee & Tsao (2018) Hung-yi Lee and Yu Tsao. Generative adversarial network and its applications to speech signal and natural language processing. IEEE International Conference on Acoustics, Speech and Signal Processing, 2018.
- Makhzani et al. (2015) Alireza Makhzani, Jonathon Shlens, Navdeep Jaitly, Ian Goodfellow, and Brendan Frey. Adversarial autoencoders. arXiv preprint arXiv:1511.05644, 2015.
- Mescheder et al. (2017a) L. Mescheder, S. Nowozin, and A. Geiger. Adversarial variational bayes: Unifying variational autoencoders and generative adversarial networks. In Proceedings of the 34th International Conference on Machine Learning, volume 70 of Proceedings of Machine Learning Research. PMLR, August 2017a.
- Mescheder et al. (2017b) Lars Mescheder, Sebastian Nowozin, and Andreas Geiger. Adversarial variational bayes: Unifying variational autoencoders and generative adversarial networks. arXiv preprint arXiv:1701.04722, 2017b.
- Miyato et al. (2018) Takeru Miyato, Toshiki Kataoka, Masanori Koyama, and Yuichi Yoshida. Spectral normalization for generative adversarial networks. arXiv preprint arXiv:1802.05957, 2018.
- Netzer et al. (2011) Yuval Netzer, Tao Wang, Adam Coates, Alessandro Bissacco, Bo Wu, and Andrew Y Ng. Reading digits in natural images with unsupervised feature learning. In NIPS workshop on deep learning and unsupervised feature learning, volume 2011, pp. 5, 2011.
- Nowozin et al. (2016) Sebastian Nowozin, Botond Cseke, and Ryota Tomioka. f-GAN: training generative neural samplers using variational divergence minimization. In Advances in Neural Information Processing Systems, pp. 271–279, 2016.
- Petzka et al. (2017) Henning Petzka, Asja Fischer, and Denis Lukovnicov. On the regularization of Wasserstein GANs. arXiv preprint arXiv:1709.08894, 2017.
- Radford et al. (2015) Alec Radford, Luke Metz, and Soumith Chintala. Unsupervised representation learning with deep convolutional generative adversarial networks. arXiv preprint arXiv:1511.06434, 2015.
- Rosca et al. (2017) Mihaela Rosca, Balaji Lakshminarayanan, David Warde-Farley, and Shakir Mohamed. Variational approaches for auto-encoding generative adversarial networks. arXiv preprint arXiv:1706.04987, 2017.
- Sanjabi et al. (2018) Maziar Sanjabi, Jimmy Ba, Meisam Razaviyayn, and Jason D Lee. Solving approximate Wasserstein GANs to stationarity. arXiv preprint arXiv:1802.08249, 2018.
- Sankaranarayanan et al. (2018) Swami Sankaranarayanan, Yogesh Balaji, Carlos D. Castillo, and Rama Chellappa. Generate to adapt: Aligning domains using generative adversarial networks. In
- Santana & Hotz (2016) Eder Santana and George Hotz. Learning a driving simulator. arXiv preprint arXiv:1608.01230, 2016.
- Seguy et al. (2017) Vivien Seguy, Bharath Bhushan Damodaran, Rémi Flamary, Nicolas Courty, Antoine Rolet, and Mathieu Blondel. Large-scale optimal transport and mapping estimation. arXiv preprint arXiv:1711.02283, 2017.
- Tolstikhin et al. (2017) Ilya Tolstikhin, Olivier Bousquet, Sylvain Gelly, and Bernhard Schoelkopf. Wasserstein auto-encoders. arXiv preprint arXiv:1711.01558, 2017.
- Villani (2008) Cédric Villani. Optimal transport: old and new, volume 338. Springer Science & Business Media, 2008.
- Yu et al. (2015) Fisher Yu, Yinda Zhang, Shuran Song, Ari Seff, and Jianxiong Xiao. Lsun: Construction of a large-scale image dataset using deep learning with humans in the loop. arXiv preprint arXiv:1506.03365, 2015.
Appendix A Proof of Theorem 1
Using the Baye’s rule, one can compute the -likelihood of an observed sample as follows:
where the second step follows from equation 2.4.
Consider a joint density function such that its marginal distributions match and . Note that the equation A.1 is true for every . Thus, we can take the expectation of both sides with respect to a distribution . This leads to the following equation:
where is the Shannon-entropy function.
Next we take the expectation of both sides with respect to :
Here, we replaced the expectation over with the expectation over since one can generate an arbitrarily large number of samples from the generator. Since the KL divergence is always non-negative, we have
This inequality is true for every satisfying the marginal conditions. Thus, similar to VAEs, we can pick to maximize the lower bound on average sample -likelihoods. This leads to the entropic GAN optimization 2.3.
Appendix B Optimal Coupling for W2GAN
Let be absolutely continuous whose support contained in a convex set in . Let be the optimal discriminator for a given generator in W2GAN. This solution is unique. Moreover, we have
where means matching distributions.
Appendix C Sinkhorn Loss
In practice, it has been observed that a slightly modified version of the entropic GAN demonstrates improved computational properties (Genevay et al., 2017; Sanjabi et al., 2018). We explain this modification in this section. Let
where is the Kullback–Leibler divergence. Note that the objective of this optimization differs from that of the entropic GAN optimization 2.3 by a constant term . A sinkhorn distance function is then defined as (Genevay et al., 2017):
is called the Sinkhorn loss function. Reference Genevay et al. (2017) has shown that as , approaches . For a general , we have the following upper and lower bounds:
For a given , we have
Since is constant in our setup, optimizing the GAN with the Sinkhorn loss is equivalent to optimizing the entropic GAN. So, our likelihood estimation framework can be used with models trained using Sinkhorn loss as well. This is particularly important from a practical standpoint as training models with Sinkhorn loss tends to be more stable in practice.
Appendix D Training Entropic GANs
In this section, we discuss how WGANs with entropic regularization is trained. As discussed in Section 3, the dual of the entropic GAN formulation can be written as
We can optimize this min-max problem using alternating optimization. A better approach would be to take into account the smoothness introduced in the problem due to the entropic regularizer, and solve the generator problem to stationarity using first-order methods. Please refer to Sanjabi et al. (2018) for more details. In all our experiments, we use Algorithm 1 of Sanjabi et al. (2018) to train our GAN model.
d.1 GAN’s Training on MNIST
MNIST dataset constains grayscale images. As a pre-processing step, all images were resized in the range . The Discriminator and the Generator architectures used in our experiments are given in Tables. 1,2. Note that the dual formulation of GANs employ two discriminators - and
, and we use the same architecture for both. The hyperparameter details are given in Table3. Some sample generations are shown in Fig. 4
|Deconv2d (, str )|
|Remove border row and col.||-|
|Deconv2d (, str )|
|Deconv2d (, str )|
|Conv2D(, str )|
|Conv2D(, str )|
|Conv2d (, str )|
|Generator learning rate|
|Discriminator learning rate|
|Number of critic iters / gen iter||5|
|Number of training iterations||10000|
d.2 GAN’s Training on CIFAR
d.3 GAN’s Training on LSUN-Bedrooms dataset
We trained a WGAN model on LSUN-Bedrooms dataset with DCGAN architectures for generator and discriminator networks (Arjovsky et al., 2017). The hyperparameter details are given in Table. 5, and some sample generations are provided in Fig. 7
|Generator learning rate|
|Discriminator learning rate|
Number of training epochs
|Generator learning rate|
|Discriminator learning rate|
|Number of critic iters per gen iter||5|
|Number of training iterations||70000|