1 Introduction
The field of representation learning was initially driven by supervised approaches, with impressive results using large labelled datasets. Unsupervised generative modeling, in contrast, used to be a domain governed by probabilistic approaches focusing on lowdimensional data. Recent years have seen a convergence of those two approaches. In the new field that formed at the intersection, variational autoencoders (VAEs) [1] constitute one wellestablished approach, theoretically elegant yet with the drawback that they tend to generate blurry samples when applied to natural images. In contrast, generative adversarial networks (GANs) [3] turned out to be more impressive in terms of the visual quality of images sampled from the model, but come without an encoder, have been reported harder to train, and suffer from the “mode collapse” problem where the resulting model is unable to capture all the variability in the true data distribution. There has been a flurry of activity in assaying numerous configurations of GANs as well as combinations of VAEs and GANs. A unifying framework combining the best of GANs and VAEs in a principled way is yet to be discovered.
This work builds up on the theoretical analysis presented in [4]. Following [5, 4], we approach generative modeling from the optimal transport (OT) point of view. The OT cost [6]
is a way to measure a distance between probability distributions and provides a much weaker topology than many others, including
divergences associated with the original GAN algorithms [7]. This is particularly important in applications, where data is usually supported on low dimensional manifolds in the input space . As a result, stronger notions of distances (such as divergences, which capture the density ratio between distributions) often max out, providing no useful gradients for training. In contrast, OT was claimed to have a nicer behaviour [5, 8] although it requires, in its GANlike implementation, the addition of a constraint or a regularization term into the objective.In this work we aim at minimizing OT between the true (but unknown) data distribution and a latent variable model specified by the prior distribution of latent codes and the generative model of the data points given . Our main contributions are listed below (cf. also Figure 1):

A new family of regularized autoencoders (Algorithms 1, 2 and Eq. 4), which we call Wasserstein AutoEncoders (WAE), that minimize the optimal transport for any cost function . Similarly to VAE, the objective of WAE is composed of two terms: the reconstruction cost and a regularizer penalizing a discrepancy between two distributions in : and a distribution of encoded data points, i.e. . When is the squared cost and is the GAN objective, WAE coincides with adversarial autoencoders of [2].

Empirical evaluation of WAE on MNIST and CelebA datasets with squared cost . Our experiments show that WAE keeps the good properties of VAEs (stable training, encoderdecoder architecture, and a nice latent manifold structure) while generating samples of better quality, approaching those of GANs.

We propose and examine two different regularizers . One is based on GANs and adversarial training in the latent space
. The other uses the maximum mean discrepancy, which is known to perform well when matching highdimensional standard normal distributions
[9]. Importantly, the second option leads to a fully adversaryfree minmin optimization problem. 
Finally, the theoretical considerations presented in [4] and used here to derive the WAE objective might be interesting in their own right. In particular, Theorem 1 shows that in the case of generative models, the primal form of is equivalent to a problem involving the optimization of a probabilistic encoder .
The paper is structured as follows. In Section 2 we review a novel autoencoder formulation for OT between and the latent variable model derived in [4]. Relaxing the resulting constrained optimization problem we arrive at an objective of Wasserstein autoencoders. We propose two different regularizers, leading to WAEGAN and WAEMMD algorithms. Section 3 discusses the related work. We present the experimental results in Section 4 and conclude by pointing out some promising directions for future work.
2 Proposed method
Our new method minimizes the optimal transport cost based on the novel autoencoder formulation (see Theorem 1 below). In the resulting optimization problem the decoder tries to accurately reconstruct the encoded training examples as measured by the cost function . The encoder tries to simultaneously achieve two conflicting goals: it tries to match the encoded distribution of training examples to the prior as measured by any specified divergence , while making sure that the latent codes provided to the decoder are informative enough to reconstruct the encoded training examples. This is schematically depicted on Fig. 1.
2.1 Preliminaries and notations
We use calligraphic letters (i.e.) for sets, capital letters (i.e.
) for random variables, and lower case letters (i.e.
) for their values. We denote probability distributions with capital letters (i.e.) and corresponding densities with lower case letters (i.e.). In this work we will consider several measures of discrepancy between probability distributions and . The class of divergences [10] is defined by , where is any convex function satisfying . Classical examples include the KullbackLeibler and JensenShannon divergences.2.2 Optimal transport and its dual formulations
A rich class of divergences between probability distributions is induced by the optimal transport (OT) problem [6]. Kantorovich’s formulation of the problem is given by
(1) 
where is any measurable cost function and
is a set of all joint distributions of
with marginals and respectively. A particularly interesting case is when is a metric space and for . In this case , the th root of , is called the Wasserstein distance.When the following KantorovichRubinstein duality holds^{1}^{1}1 Note that the same symbol is used for and , but only is a number and thus the above refers to the 1Wasserstein distance. :
(2) 
where is the class of all bounded 1Lipschitz functions on .
2.3 Application to generative models: Wasserstein autoencoders
One way to look at modern generative models like VAEs and GANs is to postulate that they are trying to minimize certain discrepancy measures between the data distribution and the model . Unfortunately, most of the standard divergences known in the literature, including those listed above, are hard or even impossible to compute, especially when is unknown and
is parametrized by deep neural networks. Previous research provides several tricks to address this issue.
In case of minimizing the KLdivergence , or equivalently maximizing the marginal loglikelihood , the famous variational lower bound provides a theoretically grounded framework successfully employed by VAEs [1, 11]. More generally, if the goal is to minimize the divergence (with one example being ), one can resort to its dual formulation and make use of GANs and the adversarial training [7]. Finally, OT cost is yet another option, which can be, thanks to the celebrated KantorovichRubinstein duality (2), expressed as an adversarial objective as implemented by the WassersteinGAN [5]. We include an extended review of all these methods in Supplementary A.
In this work we will focus on latent variable models defined by a twostep procedure, where first a code is sampled from a fixed distribution on a latent space and then is mapped to the image with a (possibly random) transformation. This results in a density of the form
(3) 
assuming all involved densities are properly defined. For simplicity we will focus on nonrandom decoders, i.e.generative models deterministically mapping to for a given map . Similar results for random decoders can be found in Supplementary B.1.
It turns out that under this model, the OT cost takes a simpler form as the transportation plan factors through the map : instead of finding a coupling in (1) between two random variables living in the space, one distributed according to and the other one according to , it is sufficient to find a conditional distribution such that its marginal is identical to the prior distribution . This is the content of the theorem below proved in [4]. To make this paper self contained we repeat the proof in Supplementary B.
Theorem 1.
For as defined above with deterministic and any function
where is the marginal distribution of when and .
This result allows us to optimize over random encoders instead of optimizing over all couplings between and . Of course, both problems are still constrained. In order to implement a numerical solution we relax the constraints on by adding a penalty to the objective. This finally leads us to the WAE objective:
(4) 
where is any nonparametric set of probabilistic encoders, is an arbitrary divergence between and , and
is a hyperparameter. Similarly to VAE, we propose to use deep neural networks to parametrize both encoders
and decoders . Note that as opposed to VAEs, the WAE formulation allows for nonrandom encoders deterministically mapping inputs to their latent codes.We propose two different penalties :
GANbased . The first option is to choose
and use the adversarial training to estimate it. Specifically, we introduce an adversary (discriminator) in the latent space
trying to separate^{2}^{2}2 We noticed that the famous “log trick” (also called “non saturating loss”) proposed by [3] leads to better results. “true” points sampled from and “fake” ones sampled from [3]. This results in the WAEGAN described in Algorithm 1. Even though WAEGAN falls back to the minmax problem, we move the adversary from the input (pixel) space to the latent space . On top of that, may have a nice shape with a single mode (for a Gaussian prior), in which case the task should be easier than matching an unknown, complex, and possibly multimodal distributions as usually done in GANs. This is also a reason for our second penalty:MMDbased . For a positivedefinite reproducing kernel the following expression is called the maximum mean discrepancy (MMD):
where is the RKHS of realvalued functions mapping to . If is characteristic then defines a metric and can be used as a divergence measure. We propose to use
. Fortunately, MMD has an unbiased Ustatistic estimator, which can be used in conjunction with stochastic gradient descent (SGD) methods. This results in the WAEMMD described in Algorithm
2. It is well known that the maximum mean discrepancy performs well when matching highdimensional standard normal distributions [9] so we expect this penalty to work especially well working with the Gaussian prior .3 Related work
Literature on autoencoders Classical unregularized autoencoders minimize only the reconstruction cost. This results in different training points being encoded into nonoverlapping zones chaotically scattered all across the space with “holes” in between where the decoder mapping has never been trained. Overall, the encoder trained in this way does not provide a useful representation and sampling from the latent space becomes hard [12].
Variational autoencoders [1] minimize a variational bound on the KLdivergence which is composed of the reconstruction cost plus the regularizer . The regularizer captures how distinct the image by the encoder of each training example is from the prior , which is not guaranteeing that the overall encoded distribution matches like WAE does. Also, VAEs require nondegenerate (i.e.nondeterministic) Gaussian encoders and random decoders for which the term can be computed and differentiated with respect to the parameters. Later [11] proposed a way to use VAE with nonGaussian encoders. WAE minimizes the optimal transport and allows both probabilistic and deterministic encoderdecoder pairs of any kind.
The VAE regularizer can be also equivalently written [13] as a sum of and a mutual information between the images and latent codes jointly distributed according to . This observation provides another intuitive way to explain a difference between our algorithm and VAEs: WAEs simply drop the mutual information term in the VAE regularizer.
When used with WAEGAN is equivalent to adversarial autoencoders (AAE) proposed by [2]. Theory of [4] (and in particular Theorem 1) thus suggests that AAEs minimize the 2Wasserstein distance between and . This provides the first theoretical justification for AAEs known to the authors. WAE generalizes AAE in two ways: first, it can use any cost function in the input space ; second, it can use any discrepancy measure in the latent space (for instance MMD), not necessarily the adversarial one of WAEGAN.
Finally, [14] independently proposed a regularized autoencoder objective similar to [4] and our (4) based on very different motivations and arguments. Following VAEs their objective (called InfoVAE) defines the reconstruction cost in the image space implicitly through the negative log likelihood term , which should be properly normalized for all . In theory VAE and InfoVAE can both induce arbitrary cost functions, however in practice this may require an estimation of the normalizing constant (partition function) which can^{3}^{3}3Two popular choices are Gaussian and Bernoulli decoders leading to pixelwise squared and crossentropy losses respectively. In both cases the normalizing constants can be computed in closed form and don’t depend on . be different for different values of . WAEs specify the cost explicitly and don’t constrain it in any way.
Literature on OT [15] address computing the OT cost in large scale using SGD and sampling. They approach this task either through the dual formulation, or via a regularized version of the primal. They do not discuss any implications for generative modeling. Our approach is based on the primal form of OT, we arrive at regularizers which are very different, and our main focus is on generative modeling.
The WGAN [5] minimizes the 1Wasserstein distance for generative modeling. The authors approach this task from the dual form. Their algorithm comes without an encoder and can not be readily applied to any other cost , because the neat form of the KantorovichRubinstein duality (2) holds only for . WAE approaches the same problem from the primal form, can be applied for any cost function , and comes naturally with an encoder.
In order to compute the values (1) or (2) of OT we need to handle nontrivial constraints, either on the coupling distribution or on the function being considered. Various approaches have been proposed in the literature to circumvent this difficulty. For [5] tried to implement the constraint in the dual formulation (2) by clipping the weights of the neural network . Later [8] proposed to relax the same constraint by penalizing the objective of (2) with a term which should not be greater than 1 if . In a more general OT setting of [16] proposed to penalize the objective of (1) with the KLdivergence between the coupling distribution and the product of marginals. [15] showed that this entropic regularization drops the constraints on functions in the dual formulation as opposed to (2). Finally, in the context of unbalanced optimal transport it has been proposed to relax the constraint in (1) by regularizing the objective with [17, 18], where and are marginals of . In this paper we propose to relax OT in a way similar to the unbalanced optimal transport, i.e. by adding additional divergences to the objective. However, we show that in the particular context of generative modeling, only one extra divergence is necessary.
Literature on GANs Many of the GAN variations (including GAN and WGAN) come without an encoder. Often it may be desirable to reconstruct the latent codes and use the learned manifold, in which cases these models are not applicable.
There have been many other approaches trying to blend the adversarial training of GANs with autoencoder architectures [19, 20, 21, 22]. The approach proposed by [21] is perhaps the most relevant to our work. The authors use the discrepancy between and the distribution
of autoencoded noise vectors as the objective for the maxmin game between the encoder and decoder respectively. While the authors showed that the saddle points correspond to
, they admit that encoders and decoders trained in this way have no incentive to be reciprocal. As a workaround they propose to include an additional reconstruction term to the objective. WAE does not necessarily lead to a minmax game, uses a different penalty, and has a clear theoretical foundation.Several works used reproducing kernels in context of GANs. [23, 24] use MMD with a fixed kernel to match and directly in the input space . These methods have been criticised to require larger minibatches during training: estimating requires number of samples roughly proportional to the dimensionality of the input space [25] which is typically larger than . [26] take a similar approach but further train
adversarially so as to arrive at a meaningful loss function. WAEMMD uses MMD to match
to the prior in the latent space . Typically has no more than dimensions and is Gaussian, which allows us to use regular minibatch sizes to accurately estimate MMD.4 Experiments
In this section we empirically evaluate^{4}^{4}4 The code is available at github.com/tolstikhin/wae. the proposed WAE model. We would like to test if WAE can simultaneously achieve (i) accurate reconstructions of data points, (ii) reasonable geometry of the latent manifold, and (iii) random samples of good (visual) quality. Importantly, the model should generalize well: requirements (i) and (ii) should be met on both training and test data. We trained WAEGAN and WAEMMD (Algorithms 1 and 2) on two realworld datasets: MNIST [27] consisting of 70k images and CelebA [28] containing roughly 203k images.
Experimental setup In all reported experiments we used Euclidian latent spaces for various depending on the complexity of the dataset, isotropic Gaussian prior distributions over , and a squared cost function for data points . We used deterministic encoderdecoder pairs, Adam [29] with , and convolutional deep neural network architectures for encoder mapping and decoder mapping similar to the DCGAN ones reported by [30]
with batch normalization
[31]. We tried various values of and noticed that seems to work good across all datasets we considered.Since we are using deterministic encoders, choosing larger than intrinsic dimensionality of the dataset would force the encoded distribution to live on a manifold in . This would make matching to impossible if is Gaussian and may lead to numerical instabilities. We use for MNIST and for CelebA which seems to work reasonably well.
We also report results of VAEs. VAEs used the same latent spaces as discussed above and standard Gaussian priors . We used Gaussian encoders with mean and diagonal covariance . For MNIST we used Bernoulli decoders parametrized by and for CelebA the Gaussian decoders with mean . Functions , , and were parametrized by deep nets of the same architectures as in WAE.
WAEGAN and WAEMMD specifics In WAEGAN we used discriminator
composed of several fully connected layers with ReLu. We tried WAEMMD with the RBF kernel but observed that it fails to penalize the outliers of
because of the quick tail decay. If the codes for some of the training points end up far away from the support of (which may happen in the early stages of training) the corresponding terms in the Ustatistic will quickly approach zero and provide no gradient for those outliers. This could be avoided by choosing the kernel bandwidthin a datadependent manner, however in this case perminibatch Ustatistic would not provide an unbiased estimate for the gradient. Instead, we used the
inverse multiquadratics kernel which is also characteristic and has much heavier tails. In all experiments we used , which is the expected squared distance between two multivariate Gaussian vectors drawn from . This significantly improved the performance compared to the RBF kernel (even the one with ). Trained models are presented in Figures 2 and 3. Further details are presented in Supplementary C.Random samples are generated by sampling and decoding the resulting noise vectors into . As expected, in our experiments we observed that for both WAEGAN and WAEMMD the quality of samples strongly depends on how accurately matches . To see this, notice that during training the decoder function is presented only with encoded versions of the data points . Indeed, the decoder is trained on samples from and thus there is no reason to expect good results when feeding it with samples from . In our experiments we noticed that even slight differences between and may affect the quality of samples.
Algorithm  FID  Sharpness 

VAE  63  
WAEMMD  55  
WAEGAN  42  
True data  2 
In some cases WAEGAN seems to lead to a better matching and generates better samples than WAEMMD. However, due to adversarial training WAEGAN is less stable than WAEMMD, which has a very stable training much like VAE.
In order to quantitatively assess the quality of the generated images, we use the Fréchet Inception Distance introduced by [32] and report the results on CelebA based on
samples. We also heuristically evaluate the
sharpness of generated samples ^{5}^{5}5 Every image is converted to greyscale and convolved with the Laplace filter, which acts as an edge detector. We compute the variance of the resulting activations and average these values across 1000 images sampled from a given model. The blurrier the image, the less edges it has, and the more activations will be close to zero, leading to smaller variances.
using the Laplace filter. The numbers, summarized in Table 1, show that WAEMMD has samples of slightly better quality than VAE, while WAEGAN achieves the best results overall.Test reconstructions and interpolations. We take random points from the held out test set and report their autoencoded versions . Next, pairs of different data points are sampled randomly from the held out test set and encoded: , . We linearly interpolate between and with equallysized steps in the latent space and show decoded images.
5 Conclusion
Using the optimal transport cost, we have derived Wasserstein autoencoders—a new family of algorithms for building generative models. We discussed their relations to other probabilistic modeling techniques. We conducted experiments using two particular implementations of the proposed method, showing that in comparison to VAEs, the images sampled from the trained WAE models are of better quality, without compromising the stability of training and the quality of reconstruction. Future work will include further exploration of the criteria for matching the encoded distribution to the prior distribution , assaying the possibility of adversarially training the cost function in the input space , and a theoretical analysis of the dual formulations for WAEGAN and WAEMMD.
Acknowledgments
The authors are thankful to Carl Johann SimonGabriel, Mateo RojasCarulla, Arthur Gretton, Paul Rubenstein, and Fei Sha for stimulating discussions.
References
 [1] D. P. Kingma and M. Welling. Autoencoding variational Bayes. In ICLR, 2014.

[2]
A. Makhzani, J. Shlens, N. Jaitly, and I. Goodfellow.
Adversarial autoencoders.
In ICLR, 2016.  [3] Ian Goodfellow, Jean PougetAbadie, Mehdi Mirza, Bing Xu, David WardeFarley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In NIPS, pages 2672–2680, 2014.
 [4] O. Bousquet, S. Gelly, I. Tolstikhin, C. J. SimonGabriel, and B. Schölkopf. From optimal transport to generative modeling: the VEGAN cookbook, 2017.
 [5] M. Arjovsky, S. Chintala, and L. Bottou. Wasserstein GAN, 2017.
 [6] C. Villani. Topics in Optimal Transportation. AMS Graduate Studies in Mathematics, 2003.
 [7] Sebastian Nowozin, Botond Cseke, and Ryota Tomioka. fGAN: Training generative neural samplers using variational divergence minimization. In NIPS, 2016.
 [8] I. Gulrajani, F. Ahmed, M. Arjovsky, V. Domoulin, and A. Courville. Improved training of wasserstein GANs, 2017.

[9]
A. Gretton, K. M. Borgwardt, M. J. Rasch, B. Schölkopf, and A. J. Smola.
A kernel twosample test.
Journal of Machine Learning Research
, 13:723–773, 2012.  [10] F. Liese and K.J. Miescke. Statistical Decision Theory. Springer, 2008.
 [11] L. Mescheder, S. Nowozin, and A. Geiger. Adversarial variational bayes: Unifying variational autoencoders and generative adversarial networks, 2017.
 [12] Y. Bengio, A. Courville, and P. Vincent. Representation learning: A review and new perspectives. Pattern Analysis and Machine Intelligence, 35, 2013.

[13]
M. D. Hoffman and M. Johnson.
Elbo surgery: yet another way to carve up the variational evidence
lower bound.
In
NIPS Workshop on Advances in Approximate Bayesian Inference
, 2016.  [14] S. Zhao, J. Song, and S. Ermon. InfoVAE: Information maximizing variational autoencoders, 2017.
 [15] A. Genevay, M. Cuturi, G. Peyré, and F. R. Bach. Stochastic optimization for largescale optimal transport. In Advances in Neural Information Processing Systems, pages 3432–3440, 2016.
 [16] M. Cuturi. Sinkhorn distances: Lightspeed computation of optimal transport. In Advances in Neural Information Processing Systems, pages 2292–2300, 2013.
 [17] Lenaic Chizat, Gabriel Peyré, Bernhard Schmitzer, and FrançoisXavier Vialard. Unbalanced optimal transport: geometry and kantorovich formulation. arXiv preprint arXiv:1508.05216, 2015.
 [18] Matthias Liero, Alexander Mielke, and Giuseppe Savaré. Optimal entropytransport problems and a new hellingerkantorovich distance between positive measures. arXiv preprint arXiv:1508.07941, 2015.
 [19] J. Zhao, M. Mathieu, and Y. LeCun. Energybased generative adversarial network. In ICLR, 2017.
 [20] V. Dumoulin, I. Belghazi, B. Poole, A. Lamb, M. Arjovsky, O. Mastropietro, and A. Courville. Adversarially learned inference. In ICLR, 2017.
 [21] D. Ulyanov, A. Vedaldi, and V. Lempitsky. It takes (only) two: Adversarial generatorencoder networks, 2017.
 [22] D. Berthelot, T. Schumm, and L. Metz. Began: Boundary equilibrium generative adversarial networks, 2017.

[23]
Y. Li, K. Swersky, and R. Zemel.
Generative moment matching networks.
In ICML, 2015.  [24] G. K. Dziugaite, D. M. Roy, and Z. Ghahramani. Training generative neural networks via maximum mean discrepancy optimization. In UAI, 2015.
 [25] R. Reddi, A. Ramdas, A. Singh, B. Poczos, and L. Wasserman. On the highdimensional power of a lineartime two sample test under meanshift alternatives. In AISTATS, 2015.
 [26] C. L. Li, W. C. Chang, Y. Cheng, Y. Yang, and B. Poczos. Mmd gan: Towards deeper understanding of moment matching network, 2017.
 [27] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradientbased learning applied to document recognition. In Proceedings of the IEEE, volume 86(11), pages 2278–2324, 1998.

[28]
Ziwei Liu, Ping Luo, Xiaogang Wang, and Xiaoou Tang.
Deep learning face attributes in the wild.
In
Proceedings of International Conference on Computer Vision (ICCV)
, 2015.  [29] D. P. Kingma and J. Lei. Adam: A method for stochastic optimization, 2014.
 [30] A. Radford, L. Metz, and S. Chintala. Unsupervised representation learning with deep convolutional generative adversarial networks. In ICLR, 2016.
 [31] S. Ioffe and C. Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift, 2015.
 [32] Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, Günter Klambauer, and Sepp Hochreiter. GANs trained by a two timescale update rule converge to a nash equilibrium. arXiv preprint arXiv:1706.08500, 2017.
 [33] B. Poole, A. Alemi, J. SohlDickstein, and A. Angelova. Improved generator objectives for GANs, 2016.
Appendix A Implicit generative models: a short tour of GANs and VAEs
Even though GANs and VAEs are quite different—both in terms of the conceptual frameworks and empirical performance—they share important features: (a) both can be trained by sampling from the model without knowing an analytical form of its density and (b) both can be scaled up with SGD. As a result, it becomes possible to use highly flexible implicit models defined by a twostep procedure, where first a code is sampled from a fixed distribution on a latent space and then is mapped to the image with a (possibly random) transformation . This results in latent variable models of the form (3).
These models are indeed easy to sample and, provided can be differentiated analytically with respect to its parameters, can be trained with SGD. The field is growing rapidly and numerous variations of VAEs and GANs are available in the literature. Next we introduce and compare several of them.
The original generative adversarial network (GAN) [3] approach minimizes
(5) 
with respect to a deterministic decoder , where is any nonparametric class of choice. It is known that and the inequality turns into identity in the nonparametric limit, that is when the class becomes rich enough to represent all functions mapping to . Hence, GANs are minimizing a lower bound on the JSdivergence. However, GANs are not only linked to the JSdivergence: the GAN approach [7] showed that a slight modification of the objective (5) allows to lower bound any desired divergence in a similar way. In practice, both decoder and discriminator are trained in alternating SGD steps. Stopping criteria as well as adequate evaluation of the trained GAN models remain open questions.
Recently, the authors of [5] argued that the 1Wasserstein distance , which is known to induce a much weaker topology than , may be better suited for generative modeling. When and are supported on largely disjoint lowdimensional manifolds (which may be the case in applications), , , and other strong distances between and max out and no longer provide useful gradients for
. This “vanishing gradient” problem necessitates complicated scheduling between the
/ updates. In contrast, is still sensible in these cases and provides stable gradients. The Wasserstein GAN (WGAN) minimizeswhere is any subset of 1Lipschitz functions on . It follows from (2) that and thus WGAN is minimizing a lower bound on the 1Wasserstein distance.
Variational autoencoders (VAE) [1] utilize models of the form (3) and minimize
(6) 
with respect to a random decoder mapping . The conditional distribution is often parametrized by a deep net and can have any form as long as its density can be computed and differentiated with respect to the parameters of . A typical choice is to use Gaussians . If is the set of allconditional probability distributions , the objective of VAE coincides with the negative marginal loglikelihood . However, in order to make the term of (6) tractable in closed form, the original implementation of VAE uses a standard normal and restricts
to a class of Gaussian distributions
with mean and diagonal covariance parametrized by deep nets. As a consequence, VAE is minimizing an upper bound on the negative loglikelihood or, equivalently, on the KLdivergence .One possible way to reduce the gap between the true negative loglikelihood and the upper bound provided by is to enlarge the class . Adversarial variational Bayes (AVB) [11] follows this argument by employing the idea of GANs. Given any point , a noise , and any fixed transformation , a random variable implicitly defines one particular conditional distribution . AVB allows to contain all such distributions for different choices of , replaces the intractable term in (6) by the adversarial approximation corresponding to the KLdivergence, and proposes to minimize^{6}^{6}6 The authors of AVB [11] note that using GAN as described above actually results in “unstable training”. Instead, following the approach of [33], they use a trained discriminator resulting from the objective (5) to approximate the ratio of densities and then directly estimate the KL divergence .
(7) 
The term in (6) may be viewed as a regularizer. Indeed, VAE reduces to the classical unregularized autoencoder if this term is dropped, minimizing the reconstruction cost of the encoderdecoder pair . This often results in different training points being encoded into nonoverlapping zones chaotically scattered all across the space with “holes” in between where the decoder mapping has never been trained. Overall, the encoder trained in this way does not provide a useful representation and sampling from the latent space becomes hard [12].
Adversarial autoencoders (AAE) [2] replace the term in (6) with another regularizer:
(8) 
where is the marginal distribution of when first is sampled from and then is sampled from , also known as the aggregated posterior [2]. Similarly to AVB, there is no clear link to loglikelihood, as . The authors of [2] argue that matching to in this way ensures that there are no “holes” left in the latent space and generates reasonable samples whenever . They also report an equally good performance of different types of conditional distributions , including Gaussians as used in VAEs, implicit models as used in AVB, and deterministic encoder mappings, i.e. with .
Appendix B Proof of Theorem 1 and further details
We will consider certain sets of joint probability distributions of three random variables . The reader may wish to think of as true images, as images sampled from the model, and as latent codes. We denote by a joint distribution of a variable pair , where is first sampled from and next from . Note that defined in (3) and used throughout this work is the marginal distribution of when .
In the optimal transport problem (1), we consider joint distributions which are called couplings between values of and . Because of the marginal constraint, we can write and we can consider as a nondeterministic mapping from to . Theorem 1. shows how to factor this mapping through , i.e., decompose it into an encoding distribution and the generating distribution .
As in Section 2.2, denotes the set of all joint distributions of with marginals , and likewise for . The set of all joint distributions of such that , , and will be denoted by . Finally, we denote by and the sets of marginals on and (respectively) induced by distributions in . Note that , , and depend on the choice of conditional distributions , while does not. In fact, it is easy to check that . From the definitions it is clear that and we immediately get the following upper bound:
(9) 
If are Dirac measures (i.e., ), it turns out that :
Lemma 2.
with identity if ^{7}^{7}7We conjecture that this is also a necessary condition. The necessity is not used in the paper. are Dirac for all .
Proof.
The first assertion is obvious. To prove the identity, note that when is a deterministic function of , for any in the sigmaalgebra induced by we have . This implies and concludes the proof. ∎
We are now in place to prove Theorem 1. Lemma 2 obviously leads to
The tower rule of expectation, and the conditional independence property of implies
It remains to notice that as stated earlier.
b.1 Random decoders
If the decoders are nondeterministic, Lemma 2 provides only the inclusion of sets and we get the following upper bound on the OT:
Corollary 3.
Let and assume the conditional distributions have mean values and marginal variances for all , where . Take . Then
(10) 
Appendix C Further details on experiments
c.1 Mnist
We use minibatches of size 100 and trained the models for 100 epochs. We used
and . For the encoderdecoder pair we set for Adam in the beginning and for the adversary in WAEGAN to . After 30 epochs we decreased both by factor of 2, and after first 50 epochs further by factor of 5.Both encoder and decoder used fully convolutional architectures with 4x4 convolutional filters.
Encoder architecture:
Decoder architecture:
Adversary architecture for WAEGAN:
Here stands for a convolution with filters,
for the fractional strided convolution with
filters (first two of them were doubling the resolution, the third one kept it constant), for the batch normalization,for the rectified linear units, and
for the fully connected layer mapping to . All the convolutions in the encoder used vertical and horizontal strides 2 and SAMEpadding.Finally, we used two heuristics. First, we always pretrained separately the encoder for several minibatch steps before the main training stage so that the sample mean and covariance of would try to match those of . Second, while training we were adding a pixelwise Gaussian noise truncated at to all the images before feeding them to the encoder, which was meant to make the encoders random. We played with all possible ways of combining these two heuristics and noticed that together they result in slightly (almost negligibly) better results compared to using only one or none of them.
Our VAE model used crossentropy loss (Bernoulli decoder) and otherwise same architectures and hyperparameters as listed above.
c.2 CelebA
We preprocessed CelebA images by first taking a 140x140 center crops and then resizing to the 64x64 resolution. We used minibatches of size 100 and trained the models for various number of epochs (up to 250). All reported WAE models were trained for 55 epochs and VAE for 68 epochs. For WAEMMD we used and for WAEGAN . Both used .
For WAEMMD the learning rate of Adam was initially set to . For WAEGAN the learning rate of Adam for the encoderdecoder pair was initially set to and for the adversary to . All learning rates were decreased by factor of 2 after 30 epochs, further by factor of 5 after 50 first epochs, and finally additional factor of 10 after 100 first epochs.
Both encoder and decoder used fully convolutional architectures with 5x5 convolutional filters.
Encoder architecture: