Implicit probabilistic models (Mohamed and Lakshminarayanan, 2016) are defined to be the pushforward of a simple distribution over a latent space through a map , where
is the space of the input data. Such models allow easy sampling, but the computation of the corresponding probability density function is intractable. The goal of these methods is to matchto a target distribution by minimizing , for some discrepancy between distributions. An overwhelming number of methods have emerged after the introduction of Generative Adversarial Networks (Goodfellow et al., 2014; Nowozin et al., 2016) and Variational Autoencoders (Kingma and Welling, 2013) (GANs and VAEs), which have established two distinct paradigms: Adversarial (networks) training and Autoencoders respectively. Adversarial involves a set of functions , referred to as discriminators, with an objective of the form
for some functions and . Autoencoder methods involve finding a function , referred to as an encoder, whose goal is to reverse , and learn a feature space with the objective
where is the reconstruction loss and acts to ensure and reverse each other and is a regularization term. Much work on Autoencoder methods has focused upon the choice of .
Both methods have their own strengths and limitations, along with differing directions of progress. Indeed, there is a lack of theoretical understanding of how these frameworks are parametrized and it is not clear whether the methods are fundamentally different. For example, Adversarial training based methods have empirically demonstrated high performance when it comes to producing realistic looking samples from . However, GANs often have problems in convergence and stability of training (Goodfellow, 2016). Autoencoders, on the other hand, deal with a more well behaved objective and learn an encoder in the process, making them useful for feature representation. However in practice, Autoencoder based methods have reported shortfalls in practice, such as producing blurry samples for image based datasets (Tolstikhin et al., 2017). This has motivated researchers to borrow elements from Adversarial training in the hopes of achieving GAN performance. Examples include replacing with Adversarial objectives (Mescheder et al., 2017; Makhzani et al., 2015) or replacing the reconstruction loss with an adversarial objective (Dumoulin et al., 2016; Alanov et al., 2018). Recently, the Wasserstein Autoencoder (WAE) (Tolstikhin et al., 2017) has been shown to subsume these two methods, with an Adversarial based and has demonstrated performance similar to that of Adversarial methods.
When it comes to directions of progress, Adversarial training methods now have theoretical guarantees on generalization performance (Zhang et al., 2017), however no such theoretical results have been obtained to date for autoencoders. Generalization performance is a pressing concern, since both techniques implicitly assume the samples represent the target distribution (Li and Malik, 2018). A formal connection will benefit both methods, allowing them to inherit strengths from one another.
In this work, we study the two paradigms and in particular focus on the -GANs (Nowozin et al., 2016) for Adversarial training and Wasserstein Autoencoders (WAE) for Autoencoders, which generalize the original GAN and VAE models respectively. We prove that the -GAN objective with Lipschitz (with respect to a metric ) discriminators is equivalent to the WAE objective with cost . In particular, we show that the WAE objective is an upper bound so as to have
and discuss the tightness of this bound. Our result is a generalization of the Kantorovich-Rubinstein duality and thus suggests a primal-dual relationship between Adversarial and Autoencoder methods. Consequently we show, to the best of our knowledge, the first generalization bounds for autoencoders. Furthermore, using this equivalence, we show that the WAE objective is related to key statistical quantities such as the -divergence and Wasserstein distance, which allows us to tap into efficient (regularized) OT solvers.
We also present another contribution regarding the parametrization of WAE in the Appendix, relating optimization of Brenier potentials in transport theory, to the WAE objective (Section A.6). The main contributions can be summarized as the following:
(Theorem 4) Establish an equivalence between Adversarial training and Wasserstein Autoencoders, showing conditions under which the -GAN and WAE coincide. This further justifies the similar performance of WAE to GAN based methods. When the conditions are not met, we have an inequality, which allows us to comment on the behavior of the methods.
(Theorem 4, 4 and 5) Show that the WAE objective is related to other statistical quantities such as -divergence, Wasserstein distance and the entropy regularized Wasserstein distance.
) Provide generalization bounds for WAE. In particular, this focuses on the empirical variant of the WAE objective, which allows the use of OT solvers as they are concerned with discrete distributions. This allows one to employ efficient (regularized) OT solvers for the estimation of WAE,-GANs and the generalization bounds.
We will use to denote the input space (a Polish space), typically taken to be a Euclidean space. We use to denote the latent space, also taken to be Euclidean. We use to denote the natural numbers without : . The set contains the set of probability measures over , and elements of this set will be referred to distributions. If happens to be absolutely continuous with respect to the Lebesgue measure then we will use to refer to the density function (Radon-Nikodym derivative with respect to the Lebesgue measure). For any , for any measure , the pushforward measure of through denoted is such that for any measurable set . The set refers to all measurable functions from into the set . We will use functions to represent conditional distributions over a space conditioned on elements , for example so that for any , . For any , the support of is . In any metric space , for any set , we define the diameter of to be . For a metric over , then for any , denotes the Lipschitz constant of with respect to and . For some set , corresponds to the convex indicator function, ie. if and otherwise. For any , corresponds to the characteristic function, with if and if .
2.2.1 Probability Discrepancies
Probability discrepancies are central to the objective of finding the best fitting model. We introduce some key discrepancies and their notation, which will appear later. [-Divergence] For a convex function with , for any with absolutely continuous with respect to , the -Divergence between and is
In order to compute the -divergence, one can first compute and estimate the integral empirically using samples from . [Integral Probability Metric] For a fixed function class , the Integral Probability Metric (IPM) based on between is defined as
If we have that then forms a metric over (Müller, 1997). A particular IPM we will make use of is Total Variation (TV): where . We also note that when then and thus TV is both an IPM and an -divergence. For any , define the set of couplings between and to be
For a cost , the Wasserstein distance between and is
The Wasserstein distance can be regarded as an infinite linear program and thus admits a dual form, and in the case ofbeing a metric, belongs to the class of IPMs, which we summarize in the following lemma (Villani, 2008). [Wasserstein Duality] Let be a metric space, and suppose is the set of all -Lipschitz functions with respect to . Then for any , we have
2.3 Generative Models
In both GAN and VAE models, we have a latent space (typically taken to be , with being small) and a prior distribution
(eg. unit variance Gaussian). We have a function referred to as the generator, which induces the generated distribution, denoted by , as the pushforward of through : . The true data distribution will be referred to as . The common goal between the two methods is to find a generator such that the samples generated by pushing forward through () are close to the true data distribution (). More formally, one can cast this as an optimization problem by finding the best such that is minimized where is some discrepancy between distributions. Both methods (as we outline below) utilize their own discrepancies between and , which offer their own benefits and weaknesses.
2.3.1 Wasserstein Autoencoder
Let denote a probabilistic encoder, which maps each point to a conditional distribution , denoted as the posterior distribution. The pushforward of through : , will be referred to as the aggregated posterior. [Wasserstein Autoencoder (Tolstikhin et al., 2017)] Let , and with for all . The Wasserstein Autoencoder objective is
We remark that there are various choices of and . Tolstikhin et al. (2017) select these by tuning and selecting different probability distortions for .
2.3.2 -Generative Adversarial Network
Let denote a discriminator function. [-GAN (Nowozin et al., 2016)] Let denote a convex function with property and a set of discriminators. The -GAN model minimizes the following objective for a generator
where is the convex conjugate of . There are two knobs in this method, namely , the set of discriminators and the convex function . The objective in (3) is a variational approximation to (Nowozin et al., 2016); if , then (Nguyen et al., 2010). In the case of , we recover the original GAN (Goodfellow et al., 2014).
3 Related Work
Current attempts at building a taxonomy for generative models have largely been within each paradigm or the proposal of hybrid methods that borrow elements from the two. We first review major and relevant advances in each paradigm, and then move on to discuss results that are close to the technical contributions of our work.
The line of Autoencoders begin with , which is the original autoencoder concerned only with reconstruction loss. VAE then introduced a non-zero , along with implementing Gaussian encoders (Kingma and Welling, 2013). This was then replaced by an adversarial objective (Mescheder et al., 2017), which is sample based and consequently allows arbitrary encoders. In the spirit of unification, Adversarial Autoencoders (AAE) (Makhzani et al., 2015) proposed to be a discrepancy between the pushforward of the target distribution through the encoder () and the prior distribution () in the latent space, which was then figured out to be equivalent to the VAE minus a mutual information term (Hoffman and Johnson, 2016). Independently, InfoVAE (Zhao et al., 2017) proposed a similar objective, which was then figured out to be equivalent to adding mutual information. Tolstikhin et al. (2017) then reparametrized the Wasserstein distance into an Autoencoder objective (WAE) where the term generalizes AAE, and has reported performance comparable to that of Adversarial methods. Other attempts also include adjusting the reconstruction loss to be adversarial as well (Dumoulin et al., 2016; Alanov et al., 2018). Another work that focuses on WAE is the Sinkhorn Autoencoders (SAE) (Patrini et al., 2018), which select to be the Wasserstein distance and show that the overall objective is an upper bound to the Wasserstein distance between and .
Hu et al. (2017) discussed the two paradigms and their unification by interpretting GANs from the perspective of variational inference, which allowed a connection to VAE, resulting in a GAN implemented with importance weighting techniques. While this approach is the closest to our work in forming a link, their results apply to standard VAE (and not other AE methods such as WAE) and cannot be extended to all -GANs. Liu et al. (2017) introduced the notion of an Adversarial divergence, which subsumed mainstream adversarial based methods. This also lead to the formal understanding of how the selected discriminator set affects the final learned. However, this approach is silent with regard to Autoencoder based methods. Zhang et al. (2017) established the tradeoff between the Rademacher complexity of the discriminator class and generalization performance of , with no results present for Autoencoders. These theoretical advances in Adversarial training methods are inherited by Autoencoders as a consequence of the equivalence presented in our work.
One key point in the proof of our equivalence is the use of a result that decomposes the GAN objective into an -divergence and an IPM for a restricted class of discriminators (which we used for Lipschitz functions). This decomposition is used in (Liu and Chaudhuri, 2018) and applied to linear
-GANs, showing that the adversarial training objective decomposes into a mixture of maximum likelihood and moment matching.Farnia and Tse (2018) used this decomposition with Lipschitz discriminators like our work, however does not make any extension or further progress to establish the link to WAE. Indeed, GANs with Lipschitz discriminators have been independently studied in (Zhou et al., 2018), which suggest that one should enforce Lipschitz constraints to provide useful gradients.
4 -Wasserstein Autoencoders
In the sequel, for any considered, we will be assuming that . We introduce an objective, which we refer to as the -Wasserstein Autoencoder, that will help us in the proof of the main theorems of this paper. [-Wasserstein Autoencoder] Let , , be a convex function (with ) and defined in Section 2.3. We define the -Wasserstein Autoencoder (-WAE) objective to be
In the proof of the main result, we will show that the -WAE objective is indeed the same as the WAE objective when using the same cost and selecting the regularizer to be . The only difference between this and the standard WAE is the use of as reconstruction instead of the standard cost which is an upper bound (Lemma A.1), and the regularizer is chosen to be . We now present the main theorem that captures the relationship between -GANs, -WAE and WAE. [-GAN and WAE equivalence] Suppose is a metric and let denote the set of all functions from that are -Lipschitz (with respect to ). Let be a convex function with , then we have for all ,
with equality if is invertible. (This is a sketch, see Section A.1 for full proof). The proof begins by proving certain properties of (Lemma A.1), allowing us to use the dual form of restricted GANs (Theorem A.1),
We then have
with equality in if is invertible (Lemma A.1). A weaker condition is required if is differentiable, namely if is invertible with respect to in the sense that
noting that an invertible trivially satisfies this requirement. Letting , we have , and so from Equation 7, we have
where the final inequality follows from the fact that (Lemma A.1). Using the fact that (Lemma A.1) completes the proof. When is invertible, we remark that can still be expressive and capable of modelling complex distributions in WAE and GAN models. For example, if
is implemented with feedforward neural networks, andis invertible then can model deformed exponential families (Nock et al., 2017), which encompasses a large class appearing in statistical physics and information geometry (Amari, 2016; Borland, 1998)
. There exists many invertible activation functions under whichwill be invertible. Furthermore, in the proof of the Theorem it is clear that and are the same objective (from Lemma A.1 and Lemma A.1). When using ( if and otherwise), and noting that , meaning that Theorem 4 (with ) reduces to
which is the standard primal-dual relation between Wasserstein distances as in Lemma 2.2.1. Hence, Theorem 4 can be viewed as a generalization of this primal-dual relationship, where Autoencoder and Adversarial objectives represent primal and dual forms respectively.
We note that the left handside of Equation (5) does not explicitly engage the prior space as much as the right hand side in the sense that one can set , (which is invertible) and and indeed results in the exact same -GAN objective since , yet the equivalent -WAE objective (from Theorem 4) will be different. This makes the Theorem versatile in reparametrizations, which we exploit in the proof for Theorem 4. We now consider weighting the reconstruction along with the regularization term in (which is equivalent to weighting WAE), which simply amounts to re-weighting the cost since for any ,
The idea of weighting the regularization term by was introduced by (Higgins et al., 2016) and furthermore studied empirically, showing that the choice of influences learning disentanglement in the latent space. (Alemi et al., 2018). We show that if and is larger than some then will become an -divergence (Theorem 4). On the other hand if we fix and take is larger than some , then becomes the Wasserstein distance and in particular, all equalities hold in (5) (Theorem 4). We show explicitly how high and need to be for such equalities to occur. Since -divergence and Wasserstein distance are quite different distortions in terms of their properies, this gives an interpretation on weighting each term.
We now outline the -divergence case. We will be focusing on convex, differentiable and . In the case we assume that is absolutely continuous with respect to , so that . We then have the following Set and let be a convex function (with ) and differentiable. Let and suppose is absolutely continuous with respect to and that is invertible, then we have for all
(Proof in Appendix, Section A.3). One may actually pick the following value
for for Theorem 4 to hold, noting that it is smaller than since is increasing ( is convex) and . It is important to note that when and so Theorem 4 tells us that the objective with a weighted total variation reconstruction loss with a -divergence prior regularization amounts to the -divergence. It was shown that in (Nock et al., 2017) that when is an invertible feedforward neural network then is a bregman divergence (a well regarded quantity in information geometry) between the parametrizations of the network for a particular choice of activation function for , which depends on . Hence, a practioner should design with such activation function when using -WAE under the above setting ( and ) with being invertible, so that the information theoretic divergence () between the distributions becomes an information geometric divergence involving the network parameters.
We now show that if is selected high enough then becomes and furthermore we have equality between -GAN, -WAE and WAE. Let be a metric. For any convex function (with ), letting , we have for all
(Proof in Appendix, Section A.4). Note that Theorem 4 holds for any (satisfying properties of the Theorem) and so one can estimate the Wasserstein distance using any as long as is scaled to . In order to understand the quantity
there are two extremes in which the supremum may be unbounded. The first case is when is taken far from so that increases, however one should note that in the case when then and so will be finite whereas can possibly diverge to , making . The other case is when is made close to , in which case however so the quantity can still be small in this case, depending on the rate of decrease between and . Now suppose that and , in which case and thus . In this case, Theorem 4 reduces to the standard result regarding the equivalence between Wasserstein distance and -divergence intersecting at the variational divergence under these conditions.
5 Generalization bounds
We prove generalization bounds using machinery developed in (Weed and Bach, 2017) and thus introduce their definitions and notations. For a set , we denote to be the -covering number of , which is the smallest such that there exists closed balls of radius with . For any , the -covering number is
and the -dimension is
The -Upper Wasserstein dimension of is
We make an assumption of and having bounded support to achieve the following bounds. For any in a metric space , we use define . We are now ready to present the generalization bounds. Let be a metric space and suppose . For any , let and denote the empirical distribution with samples drawn i.i.d from and respectively. Let and . For all convex functions, and , we have
with probability at least for any and if is chosen then we have for all
with probability at least for any . (Proof in Appendix, Section A.2). First note that there is no requirement on to be invertible and no restriction on . Second, there are the quantities , and that are influenced by the distributions and . If is invertible in the above then the left hand side of both bounds becomes by Theorem 4. One could suspect that may be unbounded by drawing parallels to -divergences, in which case may be unbounded. However this is not the case since
and since we search and there exists a such that shares the support of , in which case will result in a bounded value. Using Theorem 4, one can set large enough so that the expressions in the bound can become Wasserstein distances.
We show now that the can be upper bounded by the Wasserstein distance. Consider the entropy regularized Wasserstein distance:
we have the following. For any , and convex function (with ) we have
(Proof in Appendix, Section A.5). Since our goal is to minimize , we can minimize the upper bounds. Since the above objectives are Wasserstein distances, we can make use of the existing efficient solvers for these quantities. Indeed, majority of these solvers are concerned with discrete problems, which is presented in Theorem 5: . Noting also that from Theorem 4, setting to be higher values closes the gap in (11).
6 Discussion and Conclusion
This work is the first to prove a generalized primal-dual betweenship between GANs and Autoencoders. Consequently, this results in the elucidation for the close performance between WAE and -GANs. Furthermore, we explored the effect of weighting the reconstruction and regularization on the WAE objective, showing relationships to both -divergences and Wasserstein metrics along with the impact on the duality relationship. This equivalence allows us to prove generalization results, which to the best of our knowledge, are the first bounds given for Autoencoder models. Furthermore, using connections to the Wasserstein metrics, we can employ efficient (regularized) OT solvers to approximate upper bounds on the generalization bounds, which involve discrete distributions and thus are natural for such solvers.
The consequences of unifying two paradigms are plentiful, generalization bounds being an example. One line of extending and continuing this line of work can explore the case when using a general cost (as opposed to a metric), invoking the generalized Wasserstein dual in the goal of forming a generalized GAN. Our paper provides a basis to unify Adversarial Networks and Autoencoders through a primal-dual relationship, and open doors for the further unification of related models.
We would like to acknowledge anonymous reviewers and the Australian Research Council of Data61.
- Alanov et al. (2018) Aibek Alanov, Max Kochurov, Daniil Yashkov, and Dmitry Vetrov. Pairwise augmented gans with adversarial reconstruction loss. arXiv preprint arXiv:1810.04920, 2018.
Alemi et al. (2018)
Alexander Alemi, Ben Poole, Ian Fischer, Joshua Dillon, Rif A Saurous, and
Fixing a broken elbo.
International Conference on Machine Learning, pages 159–168, 2018.
- Amari (2016) Shun-ichi Amari. Information geometry and its applications. Springer, 2016.
- Bartlett and Mendelson (2002) Peter L Bartlett and Shahar Mendelson. Rademacher and gaussian complexities: Risk bounds and structural results. Journal of Machine Learning Research, 3(Nov):463–482, 2002.
- Borland (1998) Lisa Borland. Ito-langevin equations within generalized thermostatistics. Physics Letters A, 245(1-2):67–72, 1998.
Polar factorization and monotone rearrangement of vector-valued functions.Communications on pure and applied mathematics, 44(4):375–417, 1991.
- Dumoulin et al. (2016) Vincent Dumoulin, Ishmael Belghazi, Ben Poole, Olivier Mastropietro, Alex Lamb, Martin Arjovsky, and Aaron Courville. Adversarially learned inference. arXiv preprint arXiv:1606.00704, 2016.
- Farnia and Tse (2018) Farzan Farnia and David Tse. A convex duality framework for gans. In Advances in Neural Information Processing Systems, pages 5254–5263, 2018.
- Goodfellow (2016) Ian Goodfellow. Nips 2016 tutorial: Generative adversarial networks. arXiv preprint arXiv:1701.00160, 2016.
- Goodfellow et al. (2014) Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In Advances in neural information processing systems, pages 2672–2680, 2014.
- Higgins et al. (2016) Irina Higgins, Loic Matthey, Arka Pal, Christopher Burgess, Xavier Glorot, Matthew Botvinick, Shakir Mohamed, and Alexander Lerchner. beta-vae: Learning basic visual concepts with a constrained variational framework. 2016.
Hoffman and Johnson (2016)
Matthew D Hoffman and Matthew J Johnson.
Elbo surgery: yet another way to carve up the variational evidence
Workshop in Advances in Approximate Bayesian Inference, NIPS, 2016.
- Hu et al. (2017) Zhiting Hu, Zichao Yang, Ruslan Salakhutdinov, and Eric P Xing. On unifying deep generative models. arXiv preprint arXiv:1706.00550, 2017.
- Kingma and Welling (2013) Diederik P Kingma and Max Welling. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114, 2013.
- Lei et al. (2018) Na Lei, Kehua Su, Li Cui, Shing-Tung Yau, and Xianfeng David Gu. A geometric view of optimal transportation and generative model. Computer Aided Geometric Design, 2018.
- Lei et al. (2019) Na Lei, Kehua Su, Li Cui, Shing-Tung Yau, and Xianfeng David Gu. A geometric view of optimal transportation and generative model. Computer Aided Geometric Design, 68:1–21, 2019.
- Li and Malik (2018) Ke Li and Jitendra Malik. On the implicit assumptions of gans. arXiv preprint arXiv:1811.12402, 2018.
- Liu and Chaudhuri (2018) Shuang Liu and Kamalika Chaudhuri. The inductive bias of restricted f-gans. arXiv preprint arXiv:1809.04542, 2018.
- Liu et al. (2017) Shuang Liu, Olivier Bousquet, and Kamalika Chaudhuri. Approximation and convergence properties of generative adversarial learning. In Advances in Neural Information Processing Systems, pages 5545–5553, 2017.
- Makhzani et al. (2015) Alireza Makhzani, Jonathon Shlens, Navdeep Jaitly, Ian Goodfellow, and Brendan Frey. Adversarial autoencoders. arXiv preprint arXiv:1511.05644, 2015.
- Mescheder et al. (2017) Lars Mescheder, Sebastian Nowozin, and Andreas Geiger. Adversarial variational bayes: Unifying variational autoencoders and generative adversarial networks. arXiv preprint arXiv:1701.04722, 2017.
- Mohamed and Lakshminarayanan (2016) Shakir Mohamed and Balaji Lakshminarayanan. Learning in implicit generative models. arXiv preprint arXiv:1610.03483, 2016.
- Müller (1997) Alfred Müller. Integral probability metrics and their generating classes of functions. Advances in Applied Probability, 29(2):429–443, 1997.
- Nguyen et al. (2010) XuanLong Nguyen, Martin J Wainwright, and Michael I Jordan. Estimating divergence functionals and the likelihood ratio by convex risk minimization. IEEE Transactions on Information Theory, 56(11):5847–5861, 2010.
- Nock et al. (2017) Richard Nock, Zac Cranko, Aditya K Menon, Lizhen Qu, and Robert C Williamson. f-gans in an information geometric nutshell. In Advances in Neural Information Processing Systems, pages 456–464, 2017.
- Nowozin et al. (2016) Sebastian Nowozin, Botond Cseke, and Ryota Tomioka. f-gan: Training generative neural samplers using variational divergence minimization. In Advances in Neural Information Processing Systems, pages 271–279, 2016.
- Patrini et al. (2018) Giorgio Patrini, Marcello Carioni, Patrick Forre, Samarth Bhargav, Max Welling, Rianne van den Berg, Tim Genewein, and Frank Nielsen. Sinkhorn autoencoders. arXiv preprint arXiv:1810.01118, 2018.
- Tolstikhin et al. (2017) Ilya Tolstikhin, Olivier Bousquet, Sylvain Gelly, and Bernhard Schoelkopf. Wasserstein auto-encoders. arXiv preprint arXiv:1711.01558, 2017.
- Villani (2008) Cédric Villani. Optimal transport: old and new, volume 338. Springer Science & Business Media, 2008.
- Weed and Bach (2017) Jonathan Weed and Francis Bach. Sharp asymptotic and finite-sample rates of convergence of empirical measures in wasserstein distance. arXiv preprint arXiv:1707.00087, 2017.
- Zhang et al. (2017) Pengchuan Zhang, Qiang Liu, Dengyong Zhou, Tao Xu, and Xiaodong He. On the discrimination-generalization tradeoff in gans. arXiv preprint arXiv:1711.02771, 2017.
- Zhao et al. (2017) Shengjia Zhao, Jiaming Song, and Stefano Ermon. Infovae: Information maximizing variational autoencoders. arXiv preprint arXiv:1706.02262, 2017.
- Zhou et al. (2018) Zhiming Zhou, Yuxuan Song, Lantao Yu, Hongwei Wang, Weinan Zhang, Zhihua Zhang, and Yong Yu. Understanding the effectiveness of lipschitz-continuity in generative adversarial nets. 2018.
Appendix A Appendix
a.1 Proof of Theorem 4
In order to prove the theorem, we make use of the dual form of the restricted variational form of an -divergence: [(Liu and Chaudhuri, 2018), Theorem 3] Let denote a convex function with property and suppose is a convex subset of with the property that for any and , we have . Then for any we have
The goal is now to set however there are some conditions of the above that we require If is a metric then is convex and closed under addition. Let and consider define for some , we then have
Consider some and set for some . We then have
for all . We require a lemma regarding the decomposibility of for -divergences. Let and let be two distributions over . We have that
with equality if is invertible. Furthermore, if is differentiable then we have equality for a weaker condition: for any . By writing the variational form from (Nguyen et al., 2010) (Lemma 1), we have
where we used the fact that . If is invertible then we applying the above with , and , we have
which is just the reverse direction , and so equality holds. Suppose now that is differentiable then note that inequality holds when (See proof of Lemma 1 in (Nguyen et al., 2010)), which is equivalent to asking if there exists a function such that
For any , we can construct to map to and due to the condition in the lemma, we can guarantee