Adversarial Networks and Autoencoders: The Primal-Dual Relationship and Generalization Bounds

02/03/2019 ∙ by Hisham Husain, et al. ∙ CSIRO Australian National University 0

Since the introduction of Generative Adversarial Networks (GANs) and Variational Autoencoders (VAE), the literature on generative modelling has witnessed an overwhelming resurgence. The impressive, yet elusive empirical performance of GANs has lead to the rise of many GAN-VAE hybrids, with the hopes of GAN level performance and additional benefits of VAE, such as an encoder for feature reduction, which is not offered by GANs. Recently, the Wasserstein Autoencoder (WAE) was proposed, achieving performance similar to that of GANs, yet it is still unclear whether the two are fundamentally different or can be further improved into a unified model. In this work, we study the f-GAN and WAE models and make two main discoveries. First, we find that the f-GAN objective is equivalent to an autoencoder-like objective, which has close links, and is in some cases equivalent to the WAE objective - we refer to this as the f-WAE. This equivalence allows us to explicate the success of WAE. Second, the equivalence result allows us to, for the first time, prove generalization bounds for Autoencoder models (WAE and f-WAE), which is a pertinent problem when it comes to theoretical analyses of generative models. Furthermore, we show that the f-WAE objective is related to other statistical quantities such as the f-divergence and in particular, upper bounded by the Wasserstein distance, which then allows us to tap into existing efficient (regularized) OT solvers to minimize f-WAE. Our findings thus recommend the f-WAE as a tighter alternative to WAE, comment on generalization abilities and make a step towards unifying these models.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Implicit probabilistic models (Mohamed and Lakshminarayanan, 2016) are defined to be the pushforward of a simple distribution over a latent space through a map , where

is the space of the input data. Such models allow easy sampling, but the computation of the corresponding probability density function is intractable. The goal of these methods is to match

to a target distribution by minimizing , for some discrepancy between distributions. An overwhelming number of methods have emerged after the introduction of Generative Adversarial Networks (Goodfellow et al., 2014; Nowozin et al., 2016) and Variational Autoencoders (Kingma and Welling, 2013) (GANs and VAEs), which have established two distinct paradigms: Adversarial (networks) training and Autoencoders respectively. Adversarial involves a set of functions , referred to as discriminators, with an objective of the form

(1)

for some functions and . Autoencoder methods involve finding a function , referred to as an encoder, whose goal is to reverse , and learn a feature space with the objective

(2)

where is the reconstruction loss and acts to ensure and reverse each other and is a regularization term. Much work on Autoencoder methods has focused upon the choice of .

Both methods have their own strengths and limitations, along with differing directions of progress. Indeed, there is a lack of theoretical understanding of how these frameworks are parametrized and it is not clear whether the methods are fundamentally different. For example, Adversarial training based methods have empirically demonstrated high performance when it comes to producing realistic looking samples from . However, GANs often have problems in convergence and stability of training (Goodfellow, 2016). Autoencoders, on the other hand, deal with a more well behaved objective and learn an encoder in the process, making them useful for feature representation. However in practice, Autoencoder based methods have reported shortfalls in practice, such as producing blurry samples for image based datasets (Tolstikhin et al., 2017). This has motivated researchers to borrow elements from Adversarial training in the hopes of achieving GAN performance. Examples include replacing with Adversarial objectives (Mescheder et al., 2017; Makhzani et al., 2015) or replacing the reconstruction loss with an adversarial objective (Dumoulin et al., 2016; Alanov et al., 2018). Recently, the Wasserstein Autoencoder (WAE) (Tolstikhin et al., 2017) has been shown to subsume these two methods, with an Adversarial based and has demonstrated performance similar to that of Adversarial methods.

When it comes to directions of progress, Adversarial training methods now have theoretical guarantees on generalization performance (Zhang et al., 2017), however no such theoretical results have been obtained to date for autoencoders. Generalization performance is a pressing concern, since both techniques implicitly assume the samples represent the target distribution (Li and Malik, 2018). A formal connection will benefit both methods, allowing them to inherit strengths from one another.

In this work, we study the two paradigms and in particular focus on the -GANs (Nowozin et al., 2016) for Adversarial training and Wasserstein Autoencoders (WAE) for Autoencoders, which generalize the original GAN and VAE models respectively. We prove that the -GAN objective with Lipschitz (with respect to a metric ) discriminators is equivalent to the WAE objective with cost . In particular, we show that the WAE objective is an upper bound so as to have

and discuss the tightness of this bound. Our result is a generalization of the Kantorovich-Rubinstein duality and thus suggests a primal-dual relationship between Adversarial and Autoencoder methods. Consequently we show, to the best of our knowledge, the first generalization bounds for autoencoders. Furthermore, using this equivalence, we show that the WAE objective is related to key statistical quantities such as the -divergence and Wasserstein distance, which allows us to tap into efficient (regularized) OT solvers.

We also present another contribution regarding the parametrization of WAE in the Appendix, relating optimization of Brenier potentials in transport theory, to the WAE objective (Section A.6). The main contributions can be summarized as the following:
(Theorem 4) Establish an equivalence between Adversarial training and Wasserstein Autoencoders, showing conditions under which the -GAN and WAE coincide. This further justifies the similar performance of WAE to GAN based methods. When the conditions are not met, we have an inequality, which allows us to comment on the behavior of the methods.
(Theorem 4, 4 and 5) Show that the WAE objective is related to other statistical quantities such as -divergence, Wasserstein distance and the entropy regularized Wasserstein distance.
(Theorem 5

) Provide generalization bounds for WAE. In particular, this focuses on the empirical variant of the WAE objective, which allows the use of OT solvers as they are concerned with discrete distributions. This allows one to employ efficient (regularized) OT solvers for the estimation of WAE,

-GANs and the generalization bounds.

2 Preliminaries

2.1 Notation

We will use to denote the input space (a Polish space), typically taken to be a Euclidean space. We use to denote the latent space, also taken to be Euclidean. We use to denote the natural numbers without : . The set contains the set of probability measures over , and elements of this set will be referred to distributions. If happens to be absolutely continuous with respect to the Lebesgue measure then we will use to refer to the density function (Radon-Nikodym derivative with respect to the Lebesgue measure). For any , for any measure , the pushforward measure of through denoted is such that for any measurable set . The set refers to all measurable functions from into the set . We will use functions to represent conditional distributions over a space conditioned on elements , for example so that for any , . For any , the support of is . In any metric space , for any set , we define the diameter of to be . For a metric over , then for any , denotes the Lipschitz constant of with respect to and . For some set , corresponds to the convex indicator function, ie. if and otherwise. For any , corresponds to the characteristic function, with if and if .

2.2 Background

2.2.1 Probability Discrepancies

Probability discrepancies are central to the objective of finding the best fitting model. We introduce some key discrepancies and their notation, which will appear later. [-Divergence] For a convex function with , for any with absolutely continuous with respect to , the -Divergence between and is

In order to compute the -divergence, one can first compute and estimate the integral empirically using samples from . [Integral Probability Metric] For a fixed function class , the Integral Probability Metric (IPM) based on between is defined as

If we have that then forms a metric over (Müller, 1997). A particular IPM we will make use of is Total Variation (TV): where . We also note that when then and thus TV is both an IPM and an -divergence. For any , define the set of couplings between and to be

For a cost , the Wasserstein distance between and is

The Wasserstein distance can be regarded as an infinite linear program and thus admits a dual form, and in the case of

being a metric, belongs to the class of IPMs, which we summarize in the following lemma (Villani, 2008). [Wasserstein Duality] Let be a metric space, and suppose is the set of all -Lipschitz functions with respect to . Then for any , we have

2.3 Generative Models

In both GAN and VAE models, we have a latent space (typically taken to be , with being small) and a prior distribution

(eg. unit variance Gaussian). We have a function referred to as the generator

, which induces the generated distribution, denoted by , as the pushforward of through : . The true data distribution will be referred to as . The common goal between the two methods is to find a generator such that the samples generated by pushing forward through () are close to the true data distribution (). More formally, one can cast this as an optimization problem by finding the best such that is minimized where is some discrepancy between distributions. Both methods (as we outline below) utilize their own discrepancies between and , which offer their own benefits and weaknesses.

2.3.1 Wasserstein Autoencoder

Let denote a probabilistic encoder, which maps each point to a conditional distribution , denoted as the posterior distribution. The pushforward of through : , will be referred to as the aggregated posterior. [Wasserstein Autoencoder (Tolstikhin et al., 2017)] Let , and with for all . The Wasserstein Autoencoder objective is

We remark that there are various choices of and . Tolstikhin et al. (2017) select these by tuning and selecting different probability distortions for .

2.3.2 -Generative Adversarial Network

Let denote a discriminator function. [-GAN (Nowozin et al., 2016)] Let denote a convex function with property and a set of discriminators. The -GAN model minimizes the following objective for a generator

(3)

where is the convex conjugate of . There are two knobs in this method, namely , the set of discriminators and the convex function . The objective in (3) is a variational approximation to (Nowozin et al., 2016); if , then (Nguyen et al., 2010). In the case of , we recover the original GAN (Goodfellow et al., 2014).

3 Related Work

Current attempts at building a taxonomy for generative models have largely been within each paradigm or the proposal of hybrid methods that borrow elements from the two. We first review major and relevant advances in each paradigm, and then move on to discuss results that are close to the technical contributions of our work.

The line of Autoencoders begin with , which is the original autoencoder concerned only with reconstruction loss. VAE then introduced a non-zero , along with implementing Gaussian encoders (Kingma and Welling, 2013). This was then replaced by an adversarial objective (Mescheder et al., 2017), which is sample based and consequently allows arbitrary encoders. In the spirit of unification, Adversarial Autoencoders (AAE) (Makhzani et al., 2015) proposed to be a discrepancy between the pushforward of the target distribution through the encoder () and the prior distribution () in the latent space, which was then figured out to be equivalent to the VAE minus a mutual information term (Hoffman and Johnson, 2016). Independently, InfoVAE (Zhao et al., 2017) proposed a similar objective, which was then figured out to be equivalent to adding mutual information. Tolstikhin et al. (2017) then reparametrized the Wasserstein distance into an Autoencoder objective (WAE) where the term generalizes AAE, and has reported performance comparable to that of Adversarial methods. Other attempts also include adjusting the reconstruction loss to be adversarial as well (Dumoulin et al., 2016; Alanov et al., 2018). Another work that focuses on WAE is the Sinkhorn Autoencoders (SAE) (Patrini et al., 2018), which select to be the Wasserstein distance and show that the overall objective is an upper bound to the Wasserstein distance between and .

Hu et al. (2017) discussed the two paradigms and their unification by interpretting GANs from the perspective of variational inference, which allowed a connection to VAE, resulting in a GAN implemented with importance weighting techniques. While this approach is the closest to our work in forming a link, their results apply to standard VAE (and not other AE methods such as WAE) and cannot be extended to all -GANs. Liu et al. (2017) introduced the notion of an Adversarial divergence, which subsumed mainstream adversarial based methods. This also lead to the formal understanding of how the selected discriminator set affects the final learned. However, this approach is silent with regard to Autoencoder based methods. Zhang et al. (2017) established the tradeoff between the Rademacher complexity of the discriminator class and generalization performance of , with no results present for Autoencoders. These theoretical advances in Adversarial training methods are inherited by Autoencoders as a consequence of the equivalence presented in our work.

One key point in the proof of our equivalence is the use of a result that decomposes the GAN objective into an -divergence and an IPM for a restricted class of discriminators (which we used for Lipschitz functions). This decomposition is used in (Liu and Chaudhuri, 2018) and applied to linear

-GANs, showing that the adversarial training objective decomposes into a mixture of maximum likelihood and moment matching.

Farnia and Tse (2018) used this decomposition with Lipschitz discriminators like our work, however does not make any extension or further progress to establish the link to WAE. Indeed, GANs with Lipschitz discriminators have been independently studied in (Zhou et al., 2018), which suggest that one should enforce Lipschitz constraints to provide useful gradients.

4 -Wasserstein Autoencoders

In the sequel, for any considered, we will be assuming that . We introduce an objective, which we refer to as the -Wasserstein Autoencoder, that will help us in the proof of the main theorems of this paper. [-Wasserstein Autoencoder] Let , , be a convex function (with ) and defined in Section 2.3. We define the -Wasserstein Autoencoder (-WAE) objective to be

(4)

In the proof of the main result, we will show that the -WAE objective is indeed the same as the WAE objective when using the same cost and selecting the regularizer to be . The only difference between this and the standard WAE is the use of as reconstruction instead of the standard cost which is an upper bound (Lemma A.1), and the regularizer is chosen to be . We now present the main theorem that captures the relationship between -GANs, -WAE and WAE. [-GAN and WAE equivalence] Suppose is a metric and let denote the set of all functions from that are -Lipschitz (with respect to ). Let be a convex function with , then we have for all ,

(5)

with equality if is invertible. (This is a sketch, see Section A.1 for full proof). The proof begins by proving certain properties of (Lemma A.1), allowing us to use the dual form of restricted GANs (Theorem A.1),

(6)

The key is to reparametrize (6) by optimizing over couplings. By rewriting for some and rewriting (6) as an optimization over . This is justified by Lemma A.1. We obtain

(7)

We then have

with equality in if is invertible (Lemma A.1). A weaker condition is required if is differentiable, namely if is invertible with respect to in the sense that

(8)

noting that an invertible trivially satisfies this requirement. Letting , we have , and so from Equation 7, we have

where the final inequality follows from the fact that (Lemma A.1). Using the fact that (Lemma A.1) completes the proof. When is invertible, we remark that can still be expressive and capable of modelling complex distributions in WAE and GAN models. For example, if

is implemented with feedforward neural networks, and

is invertible then can model deformed exponential families (Nock et al., 2017), which encompasses a large class appearing in statistical physics and information geometry (Amari, 2016; Borland, 1998)

. There exists many invertible activation functions under which

will be invertible. Furthermore, in the proof of the Theorem it is clear that and are the same objective (from Lemma A.1 and Lemma A.1). When using ( if and otherwise), and noting that , meaning that Theorem 4 (with ) reduces to

which is the standard primal-dual relation between Wasserstein distances as in Lemma 2.2.1. Hence, Theorem 4 can be viewed as a generalization of this primal-dual relationship, where Autoencoder and Adversarial objectives represent primal and dual forms respectively.

We note that the left handside of Equation (5) does not explicitly engage the prior space as much as the right hand side in the sense that one can set , (which is invertible) and and indeed results in the exact same -GAN objective since , yet the equivalent -WAE objective (from Theorem 4) will be different. This makes the Theorem versatile in reparametrizations, which we exploit in the proof for Theorem 4. We now consider weighting the reconstruction along with the regularization term in (which is equivalent to weighting WAE), which simply amounts to re-weighting the cost since for any ,

The idea of weighting the regularization term by was introduced by (Higgins et al., 2016) and furthermore studied empirically, showing that the choice of influences learning disentanglement in the latent space. (Alemi et al., 2018). We show that if and is larger than some then will become an -divergence (Theorem 4). On the other hand if we fix and take is larger than some , then becomes the Wasserstein distance and in particular, all equalities hold in (5) (Theorem 4). We show explicitly how high and need to be for such equalities to occur. Since -divergence and Wasserstein distance are quite different distortions in terms of their properies, this gives an interpretation on weighting each term.

We now outline the -divergence case. We will be focusing on convex, differentiable and . In the case we assume that is absolutely continuous with respect to , so that . We then have the following Set and let be a convex function (with ) and differentiable. Let and suppose is absolutely continuous with respect to and that is invertible, then we have for all

(Proof in Appendix, Section A.3). One may actually pick the following value

for for Theorem 4 to hold, noting that it is smaller than since is increasing ( is convex) and . It is important to note that when and so Theorem 4 tells us that the objective with a weighted total variation reconstruction loss with a -divergence prior regularization amounts to the -divergence. It was shown that in (Nock et al., 2017) that when is an invertible feedforward neural network then is a bregman divergence (a well regarded quantity in information geometry) between the parametrizations of the network for a particular choice of activation function for , which depends on . Hence, a practioner should design with such activation function when using -WAE under the above setting ( and ) with being invertible, so that the information theoretic divergence () between the distributions becomes an information geometric divergence involving the network parameters.

We now show that if is selected high enough then becomes and furthermore we have equality between -GAN, -WAE and WAE. Let be a metric. For any convex function (with ), letting , we have for all

(Proof in Appendix, Section A.4). Note that Theorem 4 holds for any (satisfying properties of the Theorem) and so one can estimate the Wasserstein distance using any as long as is scaled to . In order to understand the quantity

there are two extremes in which the supremum may be unbounded. The first case is when is taken far from so that increases, however one should note that in the case when then and so will be finite whereas can possibly diverge to , making . The other case is when is made close to , in which case however so the quantity can still be small in this case, depending on the rate of decrease between and . Now suppose that and , in which case and thus . In this case, Theorem 4 reduces to the standard result regarding the equivalence between Wasserstein distance and -divergence intersecting at the variational divergence under these conditions.

5 Generalization bounds

We prove generalization bounds using machinery developed in (Weed and Bach, 2017) and thus introduce their definitions and notations. For a set , we denote to be the -covering number of , which is the smallest such that there exists closed balls of radius with . For any , the -covering number is

and the -dimension is

The -Upper Wasserstein dimension of is

We make an assumption of and having bounded support to achieve the following bounds. For any in a metric space , we use define . We are now ready to present the generalization bounds. Let be a metric space and suppose . For any , let and denote the empirical distribution with samples drawn i.i.d from and respectively. Let and . For all convex functions, and , we have

(9)

with probability at least for any and if is chosen then we have for all

(10)

with probability at least for any . (Proof in Appendix, Section A.2). First note that there is no requirement on to be invertible and no restriction on . Second, there are the quantities , and that are influenced by the distributions and . If is invertible in the above then the left hand side of both bounds becomes by Theorem 4. One could suspect that may be unbounded by drawing parallels to -divergences, in which case may be unbounded. However this is not the case since

and since we search and there exists a such that shares the support of , in which case will result in a bounded value. Using Theorem 4, one can set large enough so that the expressions in the bound can become Wasserstein distances.

We show now that the can be upper bounded by the Wasserstein distance. Consider the entropy regularized Wasserstein distance:

we have the following. For any , and convex function (with ) we have

(11)

(Proof in Appendix, Section A.5). Since our goal is to minimize , we can minimize the upper bounds. Since the above objectives are Wasserstein distances, we can make use of the existing efficient solvers for these quantities. Indeed, majority of these solvers are concerned with discrete problems, which is presented in Theorem 5: . Noting also that from Theorem 4, setting to be higher values closes the gap in (11).

6 Discussion and Conclusion

This work is the first to prove a generalized primal-dual betweenship between GANs and Autoencoders. Consequently, this results in the elucidation for the close performance between WAE and -GANs. Furthermore, we explored the effect of weighting the reconstruction and regularization on the WAE objective, showing relationships to both -divergences and Wasserstein metrics along with the impact on the duality relationship. This equivalence allows us to prove generalization results, which to the best of our knowledge, are the first bounds given for Autoencoder models. Furthermore, using connections to the Wasserstein metrics, we can employ efficient (regularized) OT solvers to approximate upper bounds on the generalization bounds, which involve discrete distributions and thus are natural for such solvers.

The consequences of unifying two paradigms are plentiful, generalization bounds being an example. One line of extending and continuing this line of work can explore the case when using a general cost (as opposed to a metric), invoking the generalized Wasserstein dual in the goal of forming a generalized GAN. Our paper provides a basis to unify Adversarial Networks and Autoencoders through a primal-dual relationship, and open doors for the further unification of related models.

We would like to acknowledge anonymous reviewers and the Australian Research Council of Data61.

References

Appendix A Appendix

a.1 Proof of Theorem 4

In order to prove the theorem, we make use of the dual form of the restricted variational form of an -divergence: [(Liu and Chaudhuri, 2018), Theorem 3] Let denote a convex function with property and suppose is a convex subset of with the property that for any and , we have . Then for any we have

The goal is now to set however there are some conditions of the above that we require If is a metric then is convex and closed under addition. Let and consider define for some , we then have

Consider some and set for some . We then have

for all . We require a lemma regarding the decomposibility of for -divergences. Let and let be two distributions over . We have that

with equality if is invertible. Furthermore, if is differentiable then we have equality for a weaker condition: for any . By writing the variational form from (Nguyen et al., 2010) (Lemma 1), we have

where we used the fact that . If is invertible then we applying the above with , and , we have

which is just the reverse direction , and so equality holds. Suppose now that is differentiable then note that inequality holds when (See proof of Lemma 1 in (Nguyen et al., 2010)), which is equivalent to asking if there exists a function such that

For any , we can construct to map to and due to the condition in the lemma, we can guarantee