dpVAE
None
view repo
Unsupervised representation learning via generative modeling is a staple to many computer vision applications in the absence of labeled data. Variational Autoencoders (VAEs) are powerful generative models that learn representations useful for data generation. However, due to inherent challenges in the training objective, VAEs fail to learn useful representations amenable for downstream tasks. Regularization-based methods that attempt to improve the representation learning aspect of VAEs come at a price: poor sample generation. In this paper, we explore this representation-generation trade-off for regularized VAEs and introduce a new family of priors, namely decoupled priors, or dpVAEs, that decouple the representation space from the generation space. This decoupling enables the use of VAE regularizers on the representation space without impacting the distribution used for sample generation, and thereby reaping the representation learning benefits of the regularizations without sacrificing the sample generation. dpVAE leverages invertible networks to learn a bijective mapping from an arbitrarily complex representation distribution to a simple, tractable, generative distribution. Decoupled priors can be adapted to the state-of-the-art VAE regularizers without additional hyperparameter tuning. We showcase the use of dpVAEs with different regularizers. Experiments on MNIST, SVHN, and CelebA demonstrate, quantitatively and qualitatively, that dpVAE fixes sample generation for regularized VAEs.
READ FULL TEXT VIEW PDFNone
Is it possible to learn a powerful generative model that matches the true data distribution with useful data representations amenable to downstream tasks in an unsupervised way? —This question is the driving force behind most unsupervised representation learning via state-of-the-art generative modeling methods (e.g. [58, 20, 23, 9]), with applications in artificial creativity [41, 37]
[21], few-shot learning [44], and semi-supervised learning
[26]. A common theme behind such works is learning the data generation process using latent variable models [5, 1] that seek to learn representations useful for data generation; an approach known as analysis-by-synthesis [57, 39].The Variational Autoencoder (VAE) [24, 46]
marries latent variable models and deep learning by having independent, network-parameterized
generative and inferencemodels that are trained jointly to maximize the marginal log-likelihood of the training data. VAE introduces a variational posterior distribution that approximates the true posterior to derive a tractable lower bound on the marginal log-likelihood, a.k.a. the evidence lower bound (ELBO). The ELBO is then maximized using stochastic gradient descent by virtue of the reparameterization trick
[24, 46]. Among the many successes of VAEs in representation learning tasks, VAE-based methods have demonstrated state-of-the-art performance for semi-supervised image and text classification tasks [33, 26, 49, 44, 43, 56].Representation learning via VAEs is ill-posed due to the disconnect between the ELBO and the downstream task [54]. Specifically, optimizing the marginal log-likelihood is not always sufficient for good representation learning due to the inherent challenges rooted in the ELBO that result in the tendency to ignore latent variables and not encode information about the data in the latent space [54, 1, 58, 10, 22]. To improve the representations learned by VAEs, a slew of regularizations have been proposed. Many of these regularizers act on the VAE latent space to promote specific characteristics in the learned representations, such as disentanglement [58, 20, 23, 37, 8, 28] and informative latent codes [35, 2]. However, better representation learning usually sacrifices sample generation, which is manifested by a distribution mismatch between the marginal (a.k.a. aggregate) latent posterior and the latent prior. This mismatch results in latent pockets and leaks; a submanifold structure in the latent space (a phenomena demonstrated in Figure 1 and explored in more detail in section 4.1). Latent pockets contain samples that are highly supported under the prior but not covered by the aggregate posterior (i.e. low-posterior samples [47]), while latent leaks contain samples supported under the aggregate posterior but less likely to be generated under the prior (i.e. high-posterior samples). This behavior has been reported for VAE [47] but it is substantiated by augmenting the ELBO with regularizers (see Figure 1).
To address this representation-generation trade-off for regularized VAEs, we introduce the idea of decoupling the latent space for representation (representation space) from the space that drives sample generation (generation space); presenting a general framework for VAE regularization. To this end, we propose a new family of latent priors for VAEs — decoupled priors or dpVAEs — that leverages the merits of invertible deep networks. In particular, dpVAE transforms a tractable, simple base prior distribution in the generation space to a more expressive prior in the representation space that reflects the submanifold structure dictated by the regularizer. This is done using an invertible mapping that is jointly trained with the VAE’s inference and generative models. State-of-the-art VAE regularizers can thus be directly plugged in to promote specific characteristics in the representation space without impacting the distribution used for sample generation. We showcase, quantitatively and qualitatively, that dpVAE with different state-of-the-art regularizers improve sample generation, without sacrificing their representation learning benefits.
It is worth emphasizing that, being likelihood-based models, VAEs are trained to put probability mass on all training samples, forcing the model to over-generalize [48], and generating blurry samples (i.e. off data manifold). This is in contrast to generative adversarial networks (GANs) [16] that generate outstanding image quality but lack the full data support [3]. dpVAE is not expected to resolve the over-generalization problem in VAEs, but to mitigate poor sample quality resulting from regularization.
The contribution of this paper is fourfold:
Analyze the latent submanifold structure induced by VAE regularizers.
Introduce a decoupled prior family for VAEs as a general regularization framework that improves sample generation without sacrificing representation learning.
Derive the dpVAE ELBO of state-of-the-art regularized VAEs; -dpVAE, -TC-dpVAE, Factor-dpVAE, and Info-dpVAE.
Demonstrate empirically on three benchmark datasets the improved generation performance and the preservation of representation characteristics promoted via regularizers without additional hyperparameter tuning.
To improve sample quality, a family of approaches exist that combine the inference capability of VAEs and the outstanding sample quality of GANs [16]. Leveraging the density ratio trick [16, 51] that only requires samples, VAE-GAN hybrids in the latent (e.g. [38, 36]), data (e.g. [47, 48]), and joint (both latent and data e.g. [50]) spaces avoid restrictions to explicit posterior and/or likelihood distribution families, paving the way for marginals matching [47]
. However, such hybrids scale poorly with latent dimensions, lack accurate likelihood bound estimates, and do not provide better quality samples than GAN variants
[47]. Expressive posterior distributions can lead to better sample quality [38, 27] and are essential to prevent latent variables from being ignored in case of powerful generative models [10]. But results in [47] suggest that the posterior distribution is not the main learning roadblock for VAEs.More recently, the key role of the prior distribution family in VAE training has been investigated [22, 47]; poor latent representations are often attributed to restricting the latent prior to an overly simplistic distribution (e.g
. standard normal). This motivates several works to enrich VAEs with more expressive priors. Bauer and Mnih addressed the distribution mismatch between the aggregate posterior and the latent prior by learning a sampling function, parameterized by a neural network, in the latent space
[4]. However, this resampled prior requires the estimation of the normalization constant and dictates an inefficient iterative sampling, where a truncated sampling could be used at the price of a less expressive prior due to smoothing. Tomczak and Welling proposed the variational mixture of posteriors prior (VampPrior), which is a parameterized mixture distribution in the latent space given by a fixed number of learnable pseudo (i.e. virtual) data points [53]. VampPrior sampling is non-iterative and is therefore fast. However, density evaluation is expensive due to the requirement of a large number of pseudo points, typically in the order of hundreds, to match the aggregate posterior [4]. A cheaper version is a mixture of Gaussian prior proposed in [11], which gives an inferior performance compared to VampPrior and is more challenging to optimize [4]. Autoregressive priors (e.g. [17, 19]) come with fast density evaluation but a slow, sequential sampling process. With differences between expressiveness and efficiency, none of these methods address the fundamental challenge of VAE training in concert with existing representation-driven regularization frameworks.The proposed decoupled prior is inspired by flow-based generative models [27, 12, 13, 45], which have shown their efficacy in generating images (e.g. GLOW [25]). Such methods hinge on architectural designs that make the model invertible. These models are not used for representation learning though, since the data and latent spaces are required to have the same dimension.
In this section, we briefly lay down the foundations and motivations essential for the proposed VAE formulation.
VAE seeks to match the learned model distribution to the true data distribution , where is the observed variable in the data space. The generative and inference models in VAEs are thus jointly trained to maximize a tractable lower bound on the marginal log-likelihood of the training data, where is an unobserved latent variable in the latent space with a prior distribution , such as .
(1) |
where denotes the generative model parameters, denotes the inference model parameters, and is the variational posterior distribution that approximates the true posterior , where , , and .
Since the ELBO seeks to match the marginal data distribution without penalizing the poor quality of latent representation, VAE can easily ignore latent variables if a sufficiently expressive generative model is used (e.g. PixelCNN [55]) and still maximize the ELBO [1, 6, 10], a property known as information preference [10, 58]. Furthermore, VAE has the tendency to not encode information about the observed data in the latent codes since maximizing the ELBO is inherently minimizing the mutual information between and [22]. Without further assumptions or inductive biases, these failure modes hinder learning useful representations for downstream tasks.
The proposed decoupled prior family for VAEs leverages flow-based generative models that are formed by a sequence of invertible blocks (i.e
. transformations), parameterized by deep networks. Consider two random variables
and . There exist a bijective mapping between and defined by a function , where , and its inverse such that . Given the above condition, we can define the change of variable formulafor mapping probability distribution on
to as follows:(2) |
By maximizing the log-likelihood and parameterizing the invertible blocks with deep networks, flow-based methods learn to transform a simple, tractable base distribution (e.g. standard normal) into a more expressive one. To model distributions with arbitrary dimensions, the bijection needs to be defined such that the Jacobian determinant can be computed in a closed form. Dinh et al. [13] proposed the affine coupling layers to build a flexible bijection function by stacking a sequence of simple bijection blocks of the form,
(3) | |||||
(4) |
where , is the Hadamard (i.e. element-wise) product, is a binary mask used for partitioning the th block input, and are the deep networks parameters of the scaling and translation functions of the blocks.
In this section, we formally define and analyze how VAE regularizations affect the generative property of VAE. We also present the decoupled prior family for VAEs (see Figure 2) and analyze its utility to solve the submanifold problem of state-of-the-art regularization-based VAEs.
ELBO regularization is a conventional mechanism that enforces inductive biases (e.g. disentanglement [58, 20, 23, 37, 8, 28] and informative latent codes [35, 2]) to improve the representation learning aspect of VAEs [54]. These methods have shown their efficacy in learning good representations but neglect the generative property. Empirically, these regularizations improve the learned latent representation but inherently cause a mismatch between the aggregate posterior and the prior . This mismatch leads to latent pockets and leaks, or a submanifold in the aggregate posterior that results in poor generative capabilities. Specifically, if a sample (i.e. likely to be generated under the prior) lies in a pocket, (i.e. is low), then its corresponding decoded sample will not lie on the data manifold. This problem, caused by VAE regularizations, we call the submanifold problem.
To better understand this phenomena, we define two different types of samples in the VAE latent space that corresponds to two VAE failure modes.
Low-Posterior (LP) samples: These are samples that are highly likely to be generated under the prior (i.e. is high) but are not covered by the aggregate posterior (i.e. is low). The low-posterior samples are typically generated from the latent pockets dictated by the regularizer(s) used and are of poor quality since they lie off the data manifold. To generate low-posterior samples, we follow the logic of [47], where we sample , rank them according to their aggregate posterior support, i.e. values of , and choose the samples with lowest aggregate posterior values. In the case of dpVAEs, samples are generated from , which is a standard normal, and then transformed by before plugging it into the aggregate posterior.
High-Posterior (HP) samples: These are samples supported under the aggregate posterior (i.e. is high) but are less likely to be generated under the prior (i.e. is low). Specifically, these are samples in the latent space that can produce good generated samples but are unlikely to be sampled due to the low support of the prior, and thereby they are samples that are in the latent leaks. To generate high-posterior samples, we sample from , rank them according to their prior support, i.e. values of , and choose the samples with lowest prior support values. In the case of dpVAEs, sampled are first mapped to the space by
before computing prior probabilities
.In summary, a VAE performs well in the generative sense if the latent space is free of pockets and leaks. A pocket-free latent space is manifested by low-posterior samples that lie on the data manifold when mapped to the data space via the decoder . In a leak-free latent space, high-posterior samples are supported by the aggregate posterior, yet with a tiny probability under the prior, and thereby these samples fall off the data manifold. This submanifold problem is demonstrated using four state-of-the-art VAE regularizers (see Figure 1 and Figure 3). With -VAE [20], FactorVAE [23] and -TCVAE [8], we can clearly see that the low-posterior samples lie in the latent pockets formed in the aggregate posterior (see Figure 3b) and they lie outside the data manifold (see Figure 3c), causing the sample generation to be very noisy (see Figure 3a). In the case of InfoVAE, the low-posterior samples lie in regions with not much aggregate posterior support causing a slightly noisy sample generation (see Figure 3a). More importantly, there are high-posterior samples that come from but can very rarely be captured by a standard normal prior distribution. With the InfoVAE, for instance, the model fails to generate samples that lie on the tail-end of the top moon.
Although VAE regularizers improve latent representations, they sacrifice sample generation through the introduction of latent pockets and leaks. To fix sample generation, we propose a decoupling of the representation and generation spaces (see Figure 2a for illustration). This is demonstrated for -VAE with and without decoupled prior in Figure 1, where the decoupled generation space is used for generation and all the low-posterior samples lie on the data manifold. We formulate this prior in detail in following section.
Decoupled prior family, as the name suggests, decouples the latent space that performs the representation and the space that drives sample generation. For this decoupling to be meaningful, the representation and generation spaces should be related by a functional mapping. The decoupled prior effectively learns the latent space distribution by simultaneously learning the functional mapping together with the generative and inference models during optimization.
Specifically, the latent variables and are the random variables of the representation and generation spaces, respectively, where . The bijective mapping between the representation space and the generation space is defined by an invertible function , parameterized by the network parameters . VAE regularizers still act on the posteriors in the representation space, i.e. and/or , without affecting the latent distribution of the generation space . Sample generation starts by sampling , passing through the inverse mapping to obtain , which is then decoded by the generative model (see Figure 2a). These decoupled spaces can allow any modifications in the representation space dictated by the regularizer to infuse its submanifold structure in that space (see Figure 3b) without significantly impacting the generation space (see Figure 3c), and thereby improving sample generation for regularized VAEs (see Figure 3a). Moreover, the decoupled prior is an expressive prior that is learned jointly with the VAE, and thereby it can match an arbitrarily complex aggregate posterior , thanks to the flexibility of deep networks to model complex mappings. Additionally, due to the bijective mapping , we have a one-to-one correspondence between samples in and those in .
To derive the ELBO for dpVAE, we replace the standard normal prior in (3.1) with the decoupled prior defined in (2). Using the change of variable formula, the KL divergence term in (3.1) can be simplified into^{1}^{1}1Complete derivation in Appendix A.:
(5) |
where is the latent dimension, is number of invertible blocks used to define the decoupled prior (see (9)), is the scaling network of the th block, is the covariance matrix of the variational posterior (which is typically assumed to be diagonal), and and are the th element in and vectors, respectively.
The KL divergence in (5) can be directly used for any regularized ELBO. However, there are some regularized models such as -TCVAE [8], and InfoVAE [58] that introduce additional terms other than with . These regularizers need to be modified when used with decoupled priors^{2}^{2}2The ELBO definitions for these regularizers can be found in Appendix B.
-dpVAE: For -VAE (both -VAE-H [20] and -VAE-B [7] versions), the only difference in the ELBO (3.1) is reweighting the KL-term and the addition of certain constraints without introducing any additional terms. Hence, -dpVAE will retain the same reweighting and constraints, and only modify the KL divergence term according to (5).
Factor-dpVAE: FactorVAE [23] introduces a total correlation term to the ELBO in (3.1), where and is the th element of . This term promotes disentanglement of the latent dimensions of , impacting the representation learning aspect of VAE. Hence, in the case of the decoupled prior, the total correlation term should be applied to the representation space. In this sense, the decoupled prior only affects the KL divergence term as described in (5) for the Factor-dpVAE model.
-TC-dpVAE: Regularization provided by -TCVAE [8] factorizes the ELBO into the individual latent dimensions based on the decomposition given in [22]. The only term that includes is the KL divergence between marginals, i.e. . This term in -TCVAE is assumed to be factorized and is evaluated via sampling, facilitating the direct incorporation of the decoupled prior. In particular, we can just sample from the base distribution and compute the corresponding sample using .
Info-dpVAE: In InfoVAE [58], the additional term in the ELBO is again the divergence between aggregate posterior and the prior, i.e. . This KL divergence term is replaced by different divergence families; adversarial training [36], Stein variational gradient [31], and maximum-mean discrepancy MMD [18, 30, 15]. However, adversarial-based divergences can have unstable training and Stein variational gradient scales poorly with high dimensions [58]. Motivated by the MMD-based results in [58], we focus here on the MMD divergence to evaluate this marginal divergence. For Info-dpVAE, we start with the ELBO of InfoVAE and modify the standard KL divergence term using (5
). In addition, we compute the marginal KL divergence using MMD, which quantifies the divergence between two distributions by comparing their moments through sampling. Similar to
-TC-dpVAE, we can sample from and use the inverse mapping to compute samples in the space.Methods | MNIST | SVHN | CelebA | |||||||||
FID | FID (LP) | sKL | NLL () | FID | FID (LP) | sKL | NLL () | FID | FID (LP) | sKL | NLL () | |
VAE [24, 46] | 137.4 | 165.0 | 1.26 | 3.56 | 78.9 | 83.8 | 53.67 | 0.386 | 81.4 | 79.0 | 59.3 | 9.26 |
dpVAE | 129.0 | 153.1 | 0.88 | 3.53 | 50.8 | 55.2 | 13.02 | 0.318 | 71.5 | 74.3 | 10.4 | 4.91 |
-VAE-H [20] | 144.2 | 163.1 | 4.49 | 4.12 | 96.7 | 97.6 | 10.35 | 0.611 | 80.3 | 79.9 | 39.7 | 6.93 |
-dpVAE-H | 127.1 | 127.4 | 1.07 | 2.98 | 65.2 | 67.7 | 4.05 | 0.592 | 67.2 | 72.5 | 33.5 | 10.6 |
-VAE-B [7] | 130.8 | 163.6 | 2.74 | 3.11 | 61.7 | 68.5 | 2.62 | 0.606 | 75.7 | 79.6 | 25.8 | 12.5 |
-dpVAE-B | 113.3 | 114.1 | 1.32 | 2.80 | 51.1 | 50.3 | 2.47 | 0.550 | 67.9 | 72.0 | 19.1 | 10.4 |
-TCVAE [8] | 149.8 | 200.3 | 4.48 | 2.91 | 69.2 | 70.5 | 7.76 | 9.86 | 83.8 | 83.0 | 93.6 | 9.33 |
-TC-dpVAE | 133.3 | 133.1 | 2.07 | 2.70 | 50.3 | 53.8 | 2.94 | 4.52 | 80.3 | 81.4 | 90.3 | 10.0 |
FactorVAE [23] | 130.5 | 191.2 | 1.04 | 3.50 | 97.2 | 108.5 | 1.91 | 2.13 | 82.6 | 86.8 | 71.3 | 9.89 |
Factor-dpVAE | 120.8 | 121.3 | 0.85 | 3.60 | 86.3 | 86.9 | 1.57 | 2.36 | 65.0 | 73.4 | 51.3 | 12.2 |
InfoVAE [58] | 128.7 | 133.2 | 2.89 | 2.88 | 81.3 | 83.2 | 4.91 | 1.55 | 76.5 | 79.1 | 30.6 | 11.1 |
Info-dpVAE | 110.1 | 110.5 | 1.70 | 2.81 | 62.9 | 67.7 | 2.67 | 1.56 | 68.9 | 72.9 | 20.3 | 12.1 |
We experiment with three benchmark image datasets, namely MNIST [29], SVHN [40], CelebA (cropped version) [32]. We train these datasets with VAE [24, 46] and five regularized VAEs, namely -VAE-H [20], -VAE-B [7], -TCVAE) [8], FactorVAE [23] and InfoVAE [58]. We showcase, qualitatively and quantitatively, that dpVAEs improve sample generation while retaining the benefits of representation learning provided by the regularizers^{3}^{3}3Architecture, hyperparameter and train/test split details used for each dataset are described in Appendix C..
We use the following quantitative metrics to assess the generative performance of the regularized VAEs with the decoupled prior in contrast to the standard normal prior.
Fréchet Inception Distance (FID):
The FID score is based on the statistics, assuming Gaussian distribution, computed in the feature space defined using the inception network features
[52]. FID score quantifies both the sample diversity and quality. Lower FID means better sample generation.Symmetric KL Divergence (sKL): To quantify the overlap between and in the representation space ( being the decoupled prior for dpVAEs or the standard normal), we compute through sampling (using 5,000 samples). sKL also captures the existence of pockets and leaks in . Lower sKL implies there is a better overlap between and , indicating better generative capabilities.
Negative Log-likelihood: We estimate the likelihood of held-out samples under a trained model using importance sampling (with 21,000 samples) proposed in [46], where . Lower NLL means better sample generation since the learned model supports unseen samples drawn from the data distribution.
Leakage Score: To assess the effect of decoupled priors on latent leaks (as manifested by high posterior samples), we devise a new metric based on log-probability differences. We sample from the aggregate posterior . If , where is a chosen threshold value, then we consider the sample to lie in a “leakage region” defined by . This sample is considered a high-posterior sample at the level since the sample is better supported under the aggregate posterior than the prior (see illustration in Figure 4). Based on the threshold value, these leakage regions are less likely to be sampled from. In order to not lose significant regions from the data manifold, we want the aggregate posterior corresponding to these samples to attain low values as well. To quantify latent leakage for a trained model, we propose a leakage score as , where for a given at a particular threshold , is defined as:
(6) |
Here, is the identity function in VAEs with standard normal prior, and is for dpVAEs.
In Table 1, we observe that dpVAEs perform better than their corresponding regularized VAEs without the decoupled prior for each dataset. When comparing VAEs with and without decoupled priors (e.g. InfoVAE and Info-dpVAE), we use the same hyperparameters and perform no additional tuning. This showcases the robustness of the decoupled prior w.r.t. hyperparameters, facilitating its direct use with any regularized VAE. We report the FID scores on both the randomly generated samples from the prior and the low-posterior samples. As analyzed in section 4.1, if the low-posterior samples lie on the data manifold, then the learned latent space is pocket-free. Results in Table 1 suggest that for all dpVAEs, the FID scores for the randomly generated samples and low-posterior ones are comparable, suggesting that all the pockets in the latent space are filled. Qualitative results of sample generation for CelebA and MNIST are shown in Figure 5. We show both the random prior and low-posterior sample generation with and without the decoupled prior for three different regularizations. The quality of generated samples using dpVAEs is better or on par with those without the decoupled prior. But more importantly, one can observe a significant quality improvement in the low-posterior samples, which aligns with the quantitative results in Table 1. In Figure 4, we report the leakage score as a function of log-probability thresholds for different regularizers with and without the decoupled priors. We observe that dpVAEs result in models with lower latent leakage. This is especially true at lower thresholds, which suggests that even when is small, the is small as well, preventing the loss of significant regions from the data manifold.
We perform latent traversals between samples in the latent space. We expect that in VAEs, there will be instances of the traversal path crossing the latent pockets resulting in poor intermediate samples. In contrast, we expect dpVAEs will map the linear traversal in (generation space) to a non-linear traversal in (representation space), while avoiding low probability regions. This is observed for MNIST data traversal () and is depicted in Figure 6.
We also qualitatively observe similar occurrences in CelabA traversals (see Figure 7). Finally, we want to attest that the addition of the decoupled prior to a regularizer does not affect it’s ability to improve the latent representation. We demonstrate this by observing latent factor traversals for CelebA trained on Factor-dpVAE, where we vary one dimension of the latent space while fixing the others. One can observe that Factor-dpVAE is able to isolate different attributes of variation in the data, as shown in Figure 8.
In this paper, we define and analyze the submanifold problem for regularized VAEs, or the tendency of a regularizer to accentuate the creation of pockets and leaks in the latent space. This submanifold structure manifests the mismatch between the aggregate posterior and the latent prior which in turn causes degradation in generation quality. To overcome this trade-off between sample generation and latent representation, we propose the decoupled prior family as a general regularization framework for VAE and demonstrate its efficacy on state-of-the-art VAE regularizers. We demonstrate that dpVAEs generate better quality samples as compared with their standard normal prior based counterparts, via qualitative and quantitative results. Additionally, we qualitatively observe that the representation learning (as improved by the regularizer) is not adversely affected by dpVAEs. Decoupled priors can act as a pathway to realizing the true potential of VAEs as both a representation learning and a generative modeling framework. Further work in this direction will include exploring more expressive inference and generative models (e.g. PixelCNN [55]) in conjuction with decoupled priors. We also believe more sophisticated invertible architectures (e.g. RAD [14]) and base distributions will provide further improvements.
International Conference on Machine Learning
, pages 159–168, 2018.The 22nd International Conference on Artificial Intelligence and Statistics
, pages 66–75, 2019.Draw: A recurrent neural network for image generation.
In International Conference on Machine Learning, pages 1462–1471, 2015.Workshop in Advances in Approximate Bayesian Inference, NIPS
, volume 1, 2016.Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
, pages 4467–4477, 2017.Stochastic backpropagation and approximate inference in deep generative models.
In International Conference on Machine Learning, pages 1278–1286, 2014.This appendix includes the derivation of dpVAE training objective, the ELBO definitions of state-of-the-art VAE regularizers with the decoupled prior, and experimental details (architecture, hyperparameters, and train/test splits) for MNIST, SVHN and CelebA experiments.
Consider a bijective mapping between and defined by a function . The change of variable formula for mapping probability distribution on to is given as follows:
(7) |
The bijection is parameterized by affine coupling layers, each layer is a bijection block of the form,
(8) | |||||
where , is the Hadamard (i.e. element-wise) product, is a binary mask used for partitioning the th block input, and are the deep networks parameters of the scaling and translation functions of the blocks. Stacking these affine coupling layers constitutes the functional mapping between the representation and the generation spaces. The bijection is thus defined as,
(9) |
The KL divergence is given as,
(10) | |||||
Using the change of variable formula in (7), we have the following.
(12) | |||||
The first term in (12) can be derived as follows,
(13) | ||||
(14) | ||||
(15) |
where is the differential entropy of the variational posterior distribution. This approximate posterior is a multivariate Gaussian with a diagonal covariance matrix, i.e. , where , and . This results in a closed-form expression for the entropy of the approximate posterior [42], given as:
(16) |
The probability of the base latent space (i.e
. the generation space) in the decoupled prior is assumed to be a standard normal distribution,
i.e. . Together with (16) and ignoring constants terms, the first term in (12) can be simplified into,(18) |
The second term in (12) can be derived as follows,
(19) | |||||
Applying the chain rule to
, which is a composition of functions as defined in (9), and combining this with the multiplicative property of determinants, we have the following,(20) |
Hence, we have the second term in (12) can be expressed as follows:
(21) |
Similar to [13], deriving the Jacobian of individual affine coupling layers yields to an upper triangular matrix. Hence, the determinant is simply the product of the diagonal values, resulting in,
(22) |
Using this simplification, we finally have an expression for the second term expressed as,
(23) |
In this section, we define the ELBO (i.e. training objective) for individual regularizers considered (see section 4.3 in the main manuscript) and how they differ under the application of decoupled prior. Here, we only serve to provide more mathematical clarity of these modifications.
-dpVAE: The ELBO for the -VAE [20] can be expressed as follows:
(25) |
By substituting the KL divergence under the decoupled prior, the ELBO for -dpVAE can defined as follows:
(26) |
The alternate formulation, -VAE-B [7], has a similar ELBO function with some additional parameters applied to the KL divergence term.
Factor-dpVAE: The ELBO for FactorVAE [23] is given as follows:
(27) |
To modifying this ELBO with the decoupled prior, the KL divergence term between the posterior and prior that contains is the only term that needs to be reformulated. This results in:
(28) | ||||
(29) |
-TC-dpVAE: Following similar notation as provided in [8], the ELBO for -TCVAE [8] is given as follows:
(30) |
Due to the factorized representation of the prior, the last KL divergence is computed via sampling. Hence, the modification for the decoupled prior becomes trivial. We simply sample from space (which is assumed to be factorized) and pass it through to obtain a sample in space, the ELBO can thus be modified as follows:
(31) |
Info-dpVAE: For InfoVAE [58], the ELBO (using the MMD divergence) is given as follows:
(32) |
The modification for the decoupled prior takes place in two terms; the MMD divergence term, which is computed via sampling from the aggregate posterior, and the prior in space, which acts on the representation space in the decoupled prior. Therefore, the modification is very similar to the one applied in -TCVAE. Additionally, the KL divergence term will be modified normally. The final ELBO is thus as follows:
(33) |
In this section, we give more details for the architecture and hyperparameters used and the data handling for the four different datasets used in the paper, namely two moons toy data, MNIST, SVHN, and CelebA. These are also described for different regularizers used on each of these datasets.
Figure 10
illustrates the VAE architecture for all the datasets and the regularizers reported in the experiments section of the paper. For MNIST and SVHN, we use ReLU as the non-linear activation function, and for CelebA we use leaky ReLU
[34]. For the two moons data, the VAE architecture consists of two fully connected layers of size 100 and 50 (from input to latent space) in the encoder. The decoder is a mirrored version of the encoder.We use two-dimensional latent space for the two moons data. The architecture that is added for the decoupled prior is the same for all the experiments we present in the paper. This architecture for the affine coupling layers that connects and is shown in Figure 9.FactorVAE [23] has a discriminator architecture, which has five fully connected layers each with 1000 hidden units. Each fully connected layer is followed by a leaky ReLU activation of negative slope of 0.2. This discriminator architecture is the same for all experiments, except for the changing input size (i.e. the latent dimension ).
The learning rate for all the experiments was set to be
, and the batch size for MNIST was 100, SVHN and CelebA were 50. We execute all the experiments for 100,000 iterations (100,000/B epochs where B is the batch size). No other pre-processing was performed while conducting these experiments. The regularization specific hyperparameters are mentioned in Table
2. These hyperparameters were kept the same for all datasets.
Comments
There are no comments yet.