Regularized Autoencoders via Relaxed Injective Probability Flow

02/20/2020 ∙ by Abhishek Kumar, et al. ∙ 5

Invertible flow-based generative models are an effective method for learning to generate samples, while allowing for tractable likelihood computation and inference. However, the invertibility requirement restricts models to have the same latent dimensionality as the inputs. This imposes significant architectural, memory, and computational costs, making them more challenging to scale than other classes of generative models such as Variational Autoencoders (VAEs). We propose a generative model based on probability flows that does away with the bijectivity requirement on the model and only assumes injectivity. This also provides another perspective on regularized autoencoders (RAEs), with our final objectives resembling RAEs with specific regularizers that are derived by lower bounding the probability flow objective. We empirically demonstrate the promise of the proposed model, improving over VAEs and AEs in terms of sample quality.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 11

page 12

page 13

page 15

page 16

page 23

page 25

page 27

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Invertible flow-based generative models (Dinh et al., 2016; Kingma & Dhariwal, 2018) have recently gained traction due to several desirable properties: (i) exact log-likelihood calculation (unlike VAEs that maximize a lower bound (Kingma & Welling, 2013; Rezende et al., 2014)), (ii) exact inference of latent variables, and (iii) good sample quality relative to VAEs.

However, a limitation of invertible flow models is that they require invertibility on the full ambient space, resulting in a latent space with the same dimensionality as the input data. This requirement leads to larger models with higher memory and computational costs that are more difficult to scale than VAE and GAN counterparts that have lower-dimensional latent spaces (Goodfellow et al., 2014; Kingma & Dhariwal, 2018). This lack of dimensionality reduction also makes it difficult to capture high-level generative factors directly in individual latent dimensions, a property that is often argued to be desirable for generative models (Higgins et al., 2017; Narayanaswamy et al., 2017; Kumar et al., 2018; Kim & Mnih, 2018; Chen et al., 2018).

In this work we propose a generative model of data based on probability flows that relaxes the bijectivity requirement. The model maps low dimensional latents to samples in the image of , residing in the much higher-dimensional ambient space

. A probability distribution in the latent space (

e.g., standard normal) is pushed forward by the mapping to induce a distribution on the image, . By taking the mapping to be one-to-one or injective and differentiable, we can use a change of variables theorem to obtain a closed form for the distribution over . While the resulting log-likelihood in ambient space is intractable, we can form tractable lower-bounds using stochastic approximations to obtain objectives amenable to stochastic first-order optimization. Relaxing the bijectivity requirement loses the ability to provide exact likelihoods for data points that lie off the image . In this work, we limit ourselves to using the derived flow-based objective for learning a sampling mechanism that always generates samples from the image . This is in contrast to VAEs where generated samples lie off the image due to the presence of an additional distribution at the output of the decoder.111One can augment the decoder with an ambient noise distribution (e.g., Gaussian) either as part of the training objective or post-hoc after training (Wu et al., 2017)

for estimating log-likelihoods, but we do not consider that here.

Figure 1: Schematic of invertible (left) vs. injective (right) mappings. Invertible flow models require that and be the same dimensionality, and the mapping to be invertible on the full domain. In contrast, injective mappings can have lower-dimensional but are invertible only on the image of (shaded).

Our final objective, although derived from the probability flow perspective, resembles a regularized autoencoder with an additional prior log probability term and an annealing of the weight on the reconstruction loss that increases over time. This flow perspective motivates several commonly used autoencoder regularization strategies, e.g., those in Ghosh et al. (2019). We evaluate the relaxed injective flow models on MNIST, CIFAR-10, and CelebA, where we observe better FID scores compared to VAEs, and Ghosh et al. (2019). Our results demonstrate that these models provide an efficient and tractable mechanism for training neural samplers with compressed latent spaces.

2 Formulation

2.1 Invertible Flows

Let be a generator mapping from latents to data, assumed to be differentiable everywhere. If is a bijection with , then and must have the same dimensionality and we can write the distribution induced over in terms of the distribution over using change of variables formula:

(1)

where and are the Jacobians of and at and , respectively. Invertible flow models (Dinh et al., 2016; Papamakarios et al., 2017; Kingma & Dhariwal, 2018) optimize (1) to learn a generative model of the data. They provide tractable objectives by structuring the generator so that the inverse and the log-det-Jacobian terms are tractable. Recent work on invertible residual nets (Behrmann et al., 2018) makes use of certain approximations to get a tractable objective for invertible flows with ResNets having Lipschitz constrained residual blocks. More recently, Behrmann et al. (2020) studied the numerical stability of invertible flow models, finding that, in practice, numerical issues may prevent the models from being invertible in certain regions even slightly off the data manifold.

2.2 Injective Flows

We are interested in developing probability flow-based models for the setting when the dimensionality of the latent space is much lower than the data dimensionality, i.e., and , where . We can obtain a change of variables formula for this setting by looking at how an infinitesimal volume element at is transformed by the mapping . The theory of integration on manifolds tells us that the mapping transforms an infinitesimal volume element at to a corresponding volume on (Boothby, 1986), where is the image of under . If we assume is an injective function and thus invertible when seen as a mapping , we can write the probability flow from to as:

(2)

Note that as is a positive definite matrix. Figure 1 presents a schematic of invertible and injective functions transforming an infinitesimal volume element. The more familiar change of variables formula in (1) can be derived as the special case when is square and thus . In order to avoid solving an inverse problem for every in our data (i.e., finding for every s.t. ), we assume the existence of an encoder such that for every . This lets us write

(3)

Optimizing the term exactly may be computationally challenging for large models as it requires computing a Jacobian matrix for every data point (however, it could still be tractable for smaller models where the latent dimensionality is small). We propose two ways to lower bound the log likelihood in (3

) in order to obtain a tractable objective we can maximize. Let the singular values of

be given by . Using the inequality , for all and (based on concavity of the log), we have

(4)

This lower bound is maximized for . Substituting it into (4), we get

(5)

The bounds in (4) and (5) are also computationally expensive but we will show how to form efficient stochastic approximations in the next section.

Tightness of the bounds.  The inequality in (4) is tight when all singular values are equal to . Note that it is also possible to use separate corresponding to each

and tune these as hyperparameters to improve upon the tightness of the bound, however we do not explore this for the sake of simplicity. We will still tune the hyperparameter

in (4) to see how it performs against the objective in (5).

The objectives in (5) and (4) are constrained optimization problems that can be solved with a variety of approaches. Recent work on VAEs has used the augmented Lagrangian method to enforce reconstruction constraints (Rezende & Viola, 2018), but here we use the penalty method for its simplicity (Bertsekas, 2016). Applying penalty method to (5), we get:

(6)

where is a positive real that is increased as the optimization progresses (Bertsekas, 2016). Optimizing (6) can still be computationally demanding as it involves computing the full Jacobian of the generator. We can use Hutchinson’s trace estimator (Hutchinson, 1990) to avoid explicitly materializing the full Jacobian. Hutchinson’s trace estimator is based on the fact that

for any random vector

s.t. . We write the Frobenius norms of the Jacobian as for . We further employ the unbiased Monte-Carlo estimation , and use one Monte-Carlo sample per example (

) in a minibatch. This leads to an unbiased estimator of the bound when used with objective (

4), and the expectation of the Monte-Carlo approximation remains a lower bound on the log likelihood.

When used with the objective in (5), Hutchinson’s estimator leads to a biased estimator of , as the expectation of this estimator is smaller than . This results in an estimate whose expectation may no longer be a bound on the log likelihood. Similar issue arises in earlier works that try to do Monte Carlo estimation for (Li & Turner, 2016; Rhodes & Gutmann, 2018). In spite of no longer bounding log-likelihood, we find that this approximation is still effective in practice to train neural samplers. Using this Monte-Carlo approximation yields

(7)

with and ignoring the constant terms (the factor of can also be absorbed in ). We use automatic differentiation222If the automatic differentiation framework only allows for reverse mode AD, one can use with for , instead of with . to optimize the term containing the Jacobian-vector product. However, we observe numerical instabilities while training models for some configurations (CIFAR-10 with ). In these cases, we use the finite difference approximation:

(8)

with small , and .

For to be injective, a necessary condition is to constrain all singular values of to be bounded away from zero. Instead of directly enforcing this which can be computationally challenging, we simply enforce that is greater than a threshold for all with . A similar approach was used by Odena et al. (2018). There are scenarios where positivity of singular values does not ensure global injectivity, i.e., there may exist s.t. (see self-intersections in Lagrange et al. (2007)) . Suppose in this case, then the lower bounds in (5) and (4) are still valid since .

While training, we take the latent space distribution

to be an isotropic Gaussian distribution

, which reduces the first term in bounds (4) and (5) to be . Our final minimization objective corresponding to the lower bound of (5) is given by

(9)

where , , and is a positive penalty on the constraint enforcing the local injectivity of the generator. Both and are increased over the course of optimization (Bertsekas, 2016).

Following similar steps of forming an unconstrained objective using the penalty method and using Monte-Carlo estimation for , we obtain the following minimization objective corresponding to the lower bound of (4):

(10)

where is a fixed hyperparameter (which is not optimized over but can be tuned as discussed earlier). We optimize the objectives (9) and (10) with respect to parameters of both the generator and the encoder .

2.2.1 Sampling from the model

Although the injective flow model transforms an isotropic Gaussian prior to the data distribution, in practice we observe that the distribution of encoded data points (“aggregate posterior”) deviates from the prior distribution, which is also reflected in poor quality of generated samples. Note that this is not linked to invertibility and can happen even when the network is perfectly invertible. This issue is not specific to our model and is present even in VAEs and bijective flow based models. For VAEs, recent work highlighted this issue in the case of modeling a data distribution that lies along a low-dimensional manifold (Sec. 4 in Dai & Wipf (2019)) and proposed fitting another distribution on the encoded latents after training the VAE. For invertible-flow models the Euclidean norm of the latent codes is often different from the typical set of the prior, indicating a systematic aggregate posterior-prior mismatch (see Choi et al. (2018), and Fig. 8 in Kingma & Dhariwal (2018)).

Hence, for sampling from the model, we fit a distribution over the encoded training data in the latent space after the model has been trained, an approach taken by several recent works (van den Oord et al., 2017; Dai & Wipf, 2019; Ghosh et al., 2019). Dai & Wipf (2019) train another VAE on the encoded training data to get a complex post-fit prior. However, in this paper we experiment with fitting a Gaussian prior and a mixture of 10 Gaussians similar to Ghosh et al. (2019).

3 Related Work

Our work is similar in spirit to the recent work of Ghosh et al. (2019); van den Oord et al. (2017); Dai & Wipf (2019), that find regularized autoencoders paired with a learned prior produces high-quality samples. Our work provides another perspective on the regularized autoencoder (RAE) objective in (Ghosh et al., 2019), wherein the regularization terms arise naturally from approximating the log-likelihood objective of the injective probability flow. Ghosh et al. (2019)

motivate the RAE objective by considering constant posterior-variance VAEs and connecting stochasticity at the decoder’s input (arising by sampling from

) to smoothness of the decoder. Recently, Kumar & Poole (2020) analyzed the implicit regularization in -VAEs, deriving a regularizer that also depends on the Jacobian of the deocder but has a different form.

Regularized autoencoders have been widely studied in earlier works as well (Rifai et al., 2011b; Alain & Bengio, 2014; Poole et al., 2014)

. Contractive autoencoders

(Rifai et al., 2011b) also penalize the Frobenius norm of the Jacobian, however the penalty is on the encoder Jacobian, which is different from our penalty on the decoder

Jacobian. Most of these prior works on RAEs has been on improving the quality of the encoder for downstream tasks, wheres we are primarily interested in the quality of the generator for producing samples. Recent work has turned to regularizing autoencoders for sample quality as well, for example improving interpolation quality using an adversarial training objective

(Berthelot et al., 2018).

Krusinga et al. (2019) recently used Eq. (2) to get density estimates for trained GANs. However as we noted earlier, these density estimates are by nature undefined for unseen real examples which may lie off the manifold.

Several earlier works have also used spectral regularizers in training generative models. Miyato et al. (2018) encourage Lipschitz smoothness of the GAN discriminator by normalizing the spectral norm of each layer. Odena et al. (2018) study the spectral properties of the Jacobian of the generator and its correlation with the quality of generated samples. They empirically observe that regularizing the condition number of the Jacobian leads to more stable training and improved generative model.

4 Experiments

Datasets.  Our experimental framework is based on Ghosh et al. (2019). We evaluate our proposed model and baselines on three publicly available datasets: CelebA (Liu et al., 2015), CIFAR-10 (Krizhevsky & Hinton, 2009), and MNIST (Lecun et al., 1998). We use cropped images for CelebA faces as used in several prior works. Image size for CIFAR-10 and MNIST is and , respectively.

MNIST CIFAR10 CelebA
Rec. Samples Rec. Samples Rec. Samples
GMM GMM GMM
VAE 65.10 57.04 62.08 176.5 169.1 184.3 62.36 72.48 67.82
-VAE 7.91 24.31 8.12 43.86 83.59 71.56 30.06 50.66 42.77
AE 8.69 43.40 12.14 41.45 81.13 70.97 30.16 51.48 43.49
CAE 10.51 45.18 12.90 41.13 81.53 70.11 31.12 48.13 40.67
AE+L2 7.76 34.27 9.69 43.02 81.28 70.13 29.97 50.02 42.09
AE+SN 8.07 37.19 11.84 41.34 81.35 70.94 31.21 51.13 43.33
InjFlow 7.40 35.96 9.93 40.11 78.78 68.26 27.93 47.70 40.23
InjFlow 6.0 42.65 11.43 40.86 79.67 68.37 28.51 49.01 40.57
Table 1: FID scores (lower is better). Rec: score for reconstructed test data, score for decoded samples from a Gaussian prior with full covariance fit to encoded training samples, GMM: score for decoded samples from a GMM fit to encoded training samples. InjFlow and InjFlow are the models obtained from the objectives (10) and (9), respectively (with superscript denoting the presence of with the Frobenius term in (5) and (9).
Figure 2: CelebA reconstructions: Top to bottom: Randomly sampled test examples, InjFlow reconstructions, Autoencoder reconstructions, VAE reconstructions.
Figure 3: Random CelebA Samples: Top row: InjFlow, Middle row: Autoencoder, Bottom row: VAE. Left: samples from post-fit GMM prior with 10 components, Right: samples from post-fit Gaussian prior.

Baseline models.  Our final objective, although obtained by developing an injective flow and lower bounding its log likelihood, has resemblance with recently proposed regularized autoencoders (Ghosh et al., 2019) which arise as natural models for comparison.

We consider several smoothness regularizers in our evaluations, some of which have also been used by Ghosh et al. (2019):

(i) AE: Vanilla autoencoder trained with reconstruction loss. (ii) AE+L2: Autoencoder with an additional -norm penalty on the decoder parameters (weight decay). (iii) AE+SN: Autoencoder with an additional spectral normalization on each individual layer of the decoder (i.e.normalizing the top singular value to be 1), motivated by (Miyato et al., 2018). (iv) CAE: We also use contractive autoencoder (Rifai et al., 2011a) as a baseline which penalizes the Frobenius norm of the encoder’s Jacobians. We use a similar Hutchinson trace stochastic approximation (as used for our objectives) to optimize the Frobenius norm term in the CAE objective. Ghosh et al. (2019) also consider a gradient penalty regularized AE which penalizes the Frobenius norm of the decoder’s Jacobian, a term which is also present in our objective (4). We also compare with (v) VAE (Kingma & Welling, 2013) and (vi) -VAE (Higgins et al., 2017), both with a Gaussian observation model at the decoder’s output . For VAE, is taken to be 1, while for -VAE, varying directly controls , with . We do not report a comparison with Wasserstein Autoencoders (WAE) as Ghosh et al. (2019) have shown that the tractable WAE-MMD version is outperformed by regularized autoencoders.

Architectures. 

We use convolutional neural net based architecture for both encoder and decoder, each having five layers of convolutions or transposed convolutions, respectively. Strides and kernel-size in the convolutional filters differ across datasets, but stay same for all the models for a given dataset. We use a slightly larger network than

(Ghosh et al., 2019) (5 vs. 4 layers), and thus rerun all baseline methods so that the results are comparable. This also results in improved scores for baselines over those reported in (Ghosh et al., 2019). We use elu

activation in both encoder and decoder, and also use batch normalization. Latent dimensionality is taken to be

for CIFAR-10 and CelebA, and for MNIST. More details on the architectures used are provided in the supplementary material.

Hyperparameters and training.  Our -Frobenius norm objective (9) (referred as InjFlow) has four hyperparameters: (i) variance of the isotropic Gaussian distribution on latent space, which determines the weight on the term penalizing the norm of the encodings , (ii) penalty coefficient on the reconstruction loss, (iii) penalty coefficient on the injectivity loss term, and (iv) singular value threshold used in the injectivity term . We use in all our experiments. Both penalty coefficients and are initialized to be at the beginning of optimization and are increased with each minibatch iteration as , where is searched over . The weight on the prior term is searched over . Our objective (10) (referred as InjFlow) has an additional hyperparameter that determines the weight on Frobenius norm regularization term, which we fix to in all our experiments. As discussed earlier, this will result in a tight bound only for the case when all Jacobian singular values are one.

For AE+L2, the hyperparameter for the regularization term is searched over the set . For

-VAE, we search over the standard deviation

of the decoder’s distribution (which is related to as for the Gaussian observation model) over the set (for VAE, ). For CAE (Rifai et al., 2011b), the hyperparameter penalizing the encoder’s Jacobian norm is searched over . We train all models using Adam optimizer (Kingma & Ba, 2014) with batch size of and a fixed learning rate of . All models are trained for k minibatch iterations.

Evaluation.  Evaluation of sample quality is a challenging task (Theis et al., 2015), and several metrics have been proposed in literature for this (Salimans et al., 2016; Heusel et al., 2017; Sajjadi et al., 2018). We use the FID score (Heusel et al., 2017) as the quantitative metric for our evaluations, which is one of the most popular metrics and has been used in several recent works (Tolstikhin et al., 2017; Dai & Wipf, 2019; Ghosh et al., 2019) for evaluating generative models. As discussed earlier, we fit a Gaussian and a mixture of 10 Gaussians on the encoded training data, and use these as prior latent distributions to sample from the model. The covariance matrices for the Gaussian as well as for all mixture components in the GMM are taken to be full matrices. We also report FID scores on the test reconstructions apart from qualitative visualization of reconstructions and samples.

For all models, we report the best FID score obtained using decoder sampling from a post-fit GMM in the latent space. We then report all the other scores (i.e., scores for samples from post-fit Gaussian and test reconstructions) for the same model that gives the best GMM samples FID score. This enables us to assess models in term of their best possible sample generation ability. The proposed injective flow models yield better FID scores than all the baseline models for CelebA and CIFAR10, and are competitive on MNIST where they are outperformed by -VAE. For most cases, FID scores for samples with post-fit GMM are better than samples with post-fit Gaussian, except for VAE which we suspect could be due to a convergence issue with GMM fitting. In most cases, InjFlow yields better FID scores than InjFlow, which is expected as InjFlow uses optimal value of (inequality (5)) as opposed to fixed and likely suboptimal value of used for InjFlow in our experiments.

Randomly generated CelebA samples for autoencoder, VAE, and the proposed model (InjFlow) are visualized in Fig. 3. While VAE samples are blurry and tend to lose fine details, they retain global coherence. On the other hand, samples from InjFlow and autoencoder are sharper with more fine details but also have undesired visual artifacts in some cases. Fig. 2 shows reconstructions of randomly sampled test examples using InjFlow, autoencoder and VAE. InjFlow reconstructions preserve more fine details than both autoencoder and VAE (e.g., hair strand for image in the third column), as also reflected by improved FID scores. More generated samples are shown in the supplementary material. It should be noted that better sample quality can be achieved by fitting a more expressive prior such as a GMM with more components, a VAE or a flow prior (Dinh et al., 2016; Papamakarios et al., 2017), however care must be taken to not overfit the latent encodings of the training points. In principle, a model that can produce good quality test reconstructions has the ability to generate good quality novel samples and the challenge lies in fitting a prior distribution that generalizes well.

5 Discussion

We proposed a probability flow based generative model that leverages an injective generator mapping, relaxing the bijectivity requirement. We use a change of variables formula to derive an optimization objective for learning the generator and encoder, where a smoothness regularizer on the generator naturally arises from the probability flow, along with some additional penalty terms. This nicely motivates several autoencoder regularizers that have been used in the past, such as in Ghosh et al. (2019). The proposed model also improves over several regularizers studied in Ghosh et al. (2019) in terms of FID scores.

Relaxing the bijectivity constraint loses many nice properties of invertible flow based generative models, such as tractable likelihood and inference for unseen data. A possible approach to recover these aspects could be to do define a background probability model over the full ambient space and work with a mixture of foreground distribution over coming from probability flow and the background distribution. Investigation of this will be an interesting future direction.

To enable tractable and efficient training of Injective Flow models, we relied on lower bounds and stochastic approximation for the Jacobian term, and an amortized encoder trained with penalty method. Future work should investigate the degree to which these approximations are accurate, and whether there are better and more efficient approaches for ensuring invertibility () on training points (e.g., augmented Lagrangian methods (Bertsekas, 2016) which have been used in Rezende & Viola (2018)). A benefit of Injective Flow models is the ability to scale to larger input dimensions. In future work, we plan to improve sample quality and cater to higher-resolution images by scaling the models and fitting more expressive priors.

References

References

Appendix A Architectures

We used a similar architecture for all datasets, with 5 convolution layers followed by a dense layer projecting to a mean embedding.

Our architecture resembles that of Ghosh et al. (2019)

but with an additional layer, ELU instead of ReLU nonlinearities, and larger latent dimensions. We list Conv (convoutional) and ConvT (transposed convolution) layers with their number of filters, kernel size, and stride.

MNIST CIFAR-10 CelebA
Encoder


















Decoder


















Appendix B Additional Samples

We visualize additional reconstructed test examples and samples from a post-fit GMM model with 10 mixture components on the latents.

Figure 4: CelebA test reconstructions from InjFlow model: Top: original test image, Bottom: reconstructed image.
Figure 5: CelebA random samples from InjFlow model using the post-fit Gaussian mixture distribution on the latent space.
Figure 6: CelebA test reconstructions from Autoencoder: Top: original test image, Bottom: reconstructed image.
Figure 7: CelebA random samples from Autoencoder using the post-fit Gaussian mixture distribution on the latent space.
Figure 8: CelebA test reconstructions from VAE: Top: original test image, Bottom: reconstructed image.
Figure 9: CelebA random samples from VAE using the post-fit Gaussian mixture distribution on the latent space.
Figure 10: CIFAR-10 test reconstructions from InjFlow model: Top: original test image, Bottom: reconstructed image.
Figure 11: CIFAR10 random samples from InjFlow model using the post-fit Gaussian mixture distribution on the latent space.
Figure 12: CIFAR-10 test reconstructions from Autoencoder: Top: original test image, Bottom: reconstructed image.
Figure 13: CIFAR10 random samples from Autoencoder using the post-fit Gaussian mixture distribution on the latent space.
Figure 14: CIFAR-10 test reconstructions from VAE: Top: original test image, Bottom: reconstructed image.
Figure 15: CIFAR10 random samples from VAE using the post-fit Gaussian mixture distribution on the latent space.
Figure 16: MNIST test reconstructions from InjFlow model: Top: original test image, Bottom: reconstructed image.
Figure 17: MNIST random samples from InjFlow model using the post-fit Gaussian mixture distribution on the latent space.
Figure 18: MNIST test reconstructions from Autoencoder: Top: original test image, Bottom: reconstructed image.
Figure 19: MNIST random samples from Autoencoder using the post-fit Gaussian mixture distribution on the latent space.
Figure 20: MNIST test reconstructions from VAE: Top: original test image, Bottom: reconstructed image.
Figure 21: MNIST random samples from VAE using the post-fit Gaussian mixture distribution on the latent space.