Likelihood Almost Free Inference Networks

11/20/2017 ∙ by Guoqing Zheng, et al. ∙ 0

Variational inference for latent variable models is prevalent in various machine learning problems, typically solved by maximizing the Evidence Lower Bound (ELBO) of the true data likelihood with respect to a variational distribution. However, freely enriching the family of variational distribution is challenging since the ELBO requires variational likelihood evaluations of the latent variables. In this paper, we propose a novel framework to enrich the variational family based on an alternative lower bound, by introducing auxiliary random variables to the variational distribution only. While offering a much richer family of complex variational distributions, the resulting inference network is likelihood almost free in the sense that only the latent variables require evaluations from simple likelihoods and samples from all the auxiliary variables are sufficient for maximum likelihood inference. We show that the proposed approach is essentially optimizing a probabilistic mixture of ELBOs, thus enriching modeling capacity and enhancing robustness. It outperforms state-of-the-art methods in our experiments on several density estimation tasks.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 6

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Estimating posterior distributions is the primary focus of Bayesian inference, where we are interested in how our belief over the variables in our model would change after observing a set of data. Predictions can also be benefited from Bayesian inference as every prediction will be equipped with a confidence interval representing how sure the prediction is. Compared to the maximum a posteriori (MAP) estimator of the model parameters, which is a point estimator, the posterior distribution provides richer information about model parameters and hence more justified prediction.

Among various inference algorithms for posterior estimation, variational inference (VI) (Blei et al., 2017)

and Markov Chain Monte Carlo (MCMC) 

(Geyer, 1992) are the most wisely used ones. It is well known that MCMC suffers from slow mixing time though asymptotically the chained samples will approach the true posterior. Furthermore, for latent variable models (LVMs) (Wainwright et al., 2008) where each sampled data point is associated with a latent variable, the number of simulated Markov Chains increases with the number of data points, making the computation too costly. VI, on the other hand, facilitates faster inference because it optimizes an explicit objective function and its convergence can be measured and controlled. Hence, VI has been widely used in many Bayesian models, such as the mean-field approach for the Latent Dirichlet Allocation (Blei et al., 2003), etc. To enrich the family of distributions over the latent variables, neural network based variational inference methods have also been proposed, such as Variational Autoencoder (VAE) (Kingma and Welling, 2013), Importance Weighted Autoencoder (IWAE) (Burda et al., 2015) and others (Rezende and Mohamed, 2015; Mnih and Gregor, 2014; Kingma et al., 2016). These methods outperform the traditional mean-field based inference algorithms due to their flexible distribution families and easy-to-scale algorithms, therefore becoming the state of the art for variational inference.

The aforementioned VI methods are essentially maximizing the evidence lower bound (ELBO), i.e., the lower bound of the true marginal data likelihood, defined as

(1)

where are data point and its latent code, and denote the generative model and the variational model, respectively. The equality holds if and only if and otherwise a gap always exists. The more flexible the variational family is, the more likely it will match the true posterior . However, arbitrarily enriching the variational model family is non-trivial, since optimizing Eq. 1 always requires evaluations of . Most of existing methods either make over simplified assumptions about the variational model, such as simple Gaussian posterior in VAE (Kingma and Welling, 2013), or resort to implicit variational models without explicitly modeling  (Dumoulin et al., 2016).

In this paper we propose to enrich the variational distribution family, by incorporating auxiliary variables to the variational model. Most importantly, density evaluations are not required for the auxiliary variables and thus complex implicit density over the auxiliary variables can be easily constructed, which in turn results in a flexible variational posterior over the latent variables. We argue that the resulting inference network is essentially modeling a complex probabilistic mixture of different variational posteriors indexed by the auxiliary variable, and thus a much richer and flexible family of variational posterior distribution is achieved. We conduct empirical evaluations on several density estimation tasks, which validate the effectiveness of the proposed method.

The rest of the paper is organized as follows: We briefly review two existing approaches for inference network modeling in Section 2, and present our proposed framework in the Section 3. We then point out the connections of the proposed framework to related methods in Section 4. Empirical evaluations and analysis are carried out in Section 5, and lastly we conclude this paper in the Section 6.

2 Preliminaries

In this section, we briefly review several existing methods that aim to address variational inference with stochastic neural networks.

2.1 Variational Autoencoder (VAE)

Given a generative model defined over data and latent variable , indexed by parameter , variational inference aims to approximate the intractable posterior with , indexed by parameter , such that the ELBO is maximized

(2)

Parameters of both generative distribution and variational distribution are learned by maximizing the ELBO with stochastic gradient methods.111We drop the dependencies of and on parameters and to prevent clutter. Specifically, VAE (Kingma and Welling, 2013) assumes both the conditional distribution of data given the latent codes of the generative model and the variational posterior distribution are Gaussians, whose means and diagonal covariances are parameterized by two neural networks, termed as generative network and inference network, respectively. Model learning is possible due to the re-parameterization trick (Kingma and Welling, 2013) which makes back propagation through the stochastic variables possible.

2.2 Importance Weighted Autoencoder (IWAE)

The above ELBO is a lower bound of the true data log-likelihood , hence (Burda et al., 2015) proposed IWAE to directly estimate the true data log-likelihood with the presence of the variational model222The variational model is also referred to as the inference model, hence we use them interchangeably., namely

(3)

where is the number of importance weighted samples. The above bound is tighter than the ELBO used in VAE. When trained on the same network structure as VAE, with the above estimate as training objective, IWAE achieves considerable improvements over VAE on various density estimation tasks (Burda et al., 2015) and similar idea is also considered in (Mnih and Rezende, 2016).

3 The Proposed Method

3.1 Variational Posterior with Auxiliary Variables

Consider the case of modeling binary data with classic VAE and IWAE, which typically assumes that a data point is generated from a multivariate Bernoulli, conditioned on a latent code which is assumed to be from a Gaussian prior, it’s easy to verify that the Gaussian variational posterior inferred by VAE and IWAE will not match the non-Gaussian true posterior.

To this end, we propose to introduce an auxiliary random variable to the inference model of VAE and IWAE. Conditioned on the input , the inference model equipped with auxiliary variable now defines a joint density over as

(4)

where we assume has proper support and both and can be parameterized. Accordingly the marginal variational posterior of given turns to be

(5)

which essentially models the posterior as a probabilistic mixture of different densities indexed by , together with as the mixture weights. This allows complex and flexible posterior to be constructed, even when both and are from simple density families. Due to the presence of auxiliary variables , the inference model is trying to capture more sources of stochasticity than the generative model, hence we term our approach as Asymmetric Variational Autoencoder (AVAE). Figure 0(a) and 0(b) present a comparison of the inference models between classic VAE and the proposed AVAE.

In the context of of VAE and IWAE, the proposed approach includes two instantiations, AVAE and IW-AVAE, with loss functions

(6)

and

(7)

respectively.

(a) Generative model (left) and inference model (right) for VAE

(b) Inference model for AVAE (Generative model is the same as in VAE)

(c) Inference model for AVAE with auxiliary variables
Figure 1: Inference models for VAE, AVAE and AVAE with auxiliary random variables (The generative model is fixed as shown in Figure 0(a)). Note that multiple arrows pointing to a node indicate one stochastic layer, with the source nodes concatenated as input to the stochastic layer and the target node as stochastic output. One stochastic layer could consist of multiple deterministic layers. (For detailed architecture used in experiments, refer to Section 5.)

AVAE enjoys the following properties:

  • VAEs are special cases of AVAE. Conventional variational autoencoders can be seen as special cases of AVAE with no auxiliary variables assumed;

  • No density evaluations for are required. One key advantage brought by the auxiliary variable is that both terms inside the inner expectations of and do not involve , hence no density evaluations are required when Monte Carlo samples of are used to optimize the above bounds.

  • Flexible variational posterior. To fully enrich variational model flexibility, we use a neural network to implicitly model by sampling given

    and a random Gaussian noise vector

    as

    (8)

    Due to the flexible representative power of , the implicit density can be arbitrarily complex. Further we assume

    to be Gaussian with its mean and variance parameterized by neural networks. Since the actual variational posterior

    , complex posterior can be achieved even a simple density family is assumed for , due to the possibly flexible family of implicit density of defined by . (Illustration can be found in Section 5.1)

For completeness, we briefly include that

Proposition 1

Both and are lower bounds of the true data log-likelihood, satisfying .

Proof is trivial from Jensen’s inequality, hence it’s omitted.

Remark 1 Though the first equality holds for any choice of distribution (whether depends on or not), for practical estimation with Monte Carlo methods, it becomes an inequality and the bound tightens as the number of importance samples is increased (Burda et al., 2015). The second inequality always holds when estimated with Monte Carlo samples.

Remark 2 The above bounds are only concerned with one auxiliary variable , in fact can also be a set of auxiliary variables. Moreover, with the same motivation, we can make the variational family of AVAE even more flexible by defining a series of auxiliary variables, such that with sample generation process for all s defined as

(9)

where all are random noise vectors and all are neural networks to be learned. Accordingly, we have

Proposition 2

The AVAE with auxiliary random variables is also a lower bound to the true log-likelihood, satisfying , where

(10)

and

(11)

Figure 0(c) illustrates the inference model of an AVAE with auxiliary variables.

3.2 Learning with Importance Weighted Auxiliary Samples

For both AVAE and IW-AVAE, we can estimate the corresponding bounds and its gradients of and with ancestral sampling from the model. For example, for AVAE with one auxiliary variable , we estimate

(12)

and

(13)

where is the number of s sampled from the current and is the number of s sampled from the implicit conditional , which is by definition achieved by first sampling from and subsequently sampling from . The parameters of both the inference model and generative model are jointly learned by maximizing the above bounds. Besides back propagation through the stochastic variable (typically assumed to be a Gaussian for continuous latent variables) is possible through the re-parameterization trick, and it is naturally also true for all the auxiliary variables since they are constructed in a generative manner.

The term essentially is an -sample importance weighted estimate of , hence it is reasonable to believe that more samples of will lead to less noisy estimate of and thus a more accurate inference model . It’s worth pointing out for AVAE that additional samples of comes almost at no cost when multiple samples of are generated to optimize and , since sampling a from the inference model will also generate intermediate samples of , thus we can always reuse those samples of to estimate . For this purpose, in our experiments we always assume so that no separate process of sampling is needed in estimating the lower bounds. This also ensures that the forward pass and backward pass time complexity of the inference model are the same as conventional VAE and IWAE. In fact, as we will show in all our empirical evaluations that if AVAE performs similarly to VAE and while IW-AVAE always outperforms IWAE, i.e., its counterpart with no auxiliary variables.

4 Connection to Related Methods

Before we proceed to the experimental evaluations of the proposed methods, we highlight the relations of AVAE to other similar methods.

4.1 Other methods with auxiliary variables

Relation to Hierarchical Variational Models (HVM) (Ranganath et al., 2016) and Auxiliary Deep Generative Models (ADGM) (Maaløe et al., 2016) are two closely related variational methods with auxiliary variables. HVM also considers enriching the variational model family by placing a prior over the latent variable for the variational distribution . While ADGM takes another way to this goal, by placing a prior over the auxiliary variable on the generative model, which in some cases will keep the marginal generative distribution of the data invariant. It has been shown that HVM and ADGM are mathematically equivalent by (Brümmer, 2016).

However, our proposed method doesn’t add any prior on the generative model and thus doesn’t change the structure of the generative model. We emphasize that our proposed method makes the least assumption about the generative model and that the proposal in our method is orthogonal to related methods, thus it can can be integrated with previous methods with auxiliary variables to further boost the performance on accurate posterior approximation and generative modeling.

4.2 Adversarial learning based inference models

Adversarial learning based inference models, such as Adversarial Autoencoders (Makhzani et al., 2015), Adversarial Variational Bayes (Mescheder et al., 2017), and Adversarially Learned Inference (Dumoulin et al., 2016)

, aim to maximize the ELBO without any variational likelihood evaluations at all. It can be shown that for the above adversarial learning based models, when the discriminator is trained to its optimum, the model is equivalent to optimizing the ELBO. However, due to the minimax game involved in the adversarial setting, practically at any moment it is not guaranteed that they are optimizing a lower bound of the true data likelihood, thus no maximum likelihood learning interpretation can be provided. Instead in our proposed framework, we don’t require variational density evaluations for the flexible auxiliary variables, while still maintaining the maximum likelihood interpretation.

5 Experiments

5.1 Flexible Variational Family of AVAE

To test the effect of adding auxiliary variables to the inference model, we parameterize two unnormalized 2D target densities 333Sample densities originate from (Rezende and Mohamed, 2015) with

We construct inference model444Inference model of VAE defines a conditional variational posterior , to match the target density which is independent of , we set to be fixed. In this synthetic example, is set to be an all one vector of dimension 10. to approximate the target density by minimizing the KL divergence

(14)
(a)
(b)
(c)
Figure 2: (a) True density; (b) Density learned by VAE; (c) Density learned by AVAE.

Figure 2 illustrates the target densities as well as the ones learned by VAE and AVAE, respectively. It’s unsurprising to see that standard VAE with Gaussian stochastic layer as its inference model will only be able to produce Gaussian density estimates (Figure 1(b)). While with the help of introduced auxiliary random variables, AVAE is able to match the non-Gaussian target densities (Figure 1(c)), even the last stochastic layer of the inference model, i.e., , is also Gaussian.

5.2 Handwritten Digits and Characters

To test AVAE for variational inference we use standard benchmark datasets MNIST555http://www.cs.toronto.edu/~larocheh/public/datasets/binarized_mnist/ and OMNIGLOT666https://github.com/yburda/iwae/raw/master/datasets/OMNIGLOT/chardata.mat (Lake et al., 2013). Our method is general and can be applied to any formulation of the generative model . For simplicity and fair comparison, in this paper we focus on defined by stochastic neural networks, i.e., a family of generative models with their parameters defined by neural networks. Specifically, we consider the following two types of generative models:

  1. with single Gaussian stochastic layer for with 50 units. In between the latent variable and observation there are two deterministic layers, each with 200 units;

  2. with two Gaussian stochastic layers for and with 50 and 100 units, respectively. Two deterministic layers with 200 units connect the observation and latent variable , and two deterministic layers with 100 units connect and .

A Gaussian stochastic layer consists of two fully connected linear layers, with one outputting the mean and the other outputting the logarithm of diagonal covariance. All other deterministic layers are fully connected with tanh nonlinearity. The same network architectures for both and are also used in  (Burda et al., 2015)

For , an inference network with the following architecture is used by AVAE with auxiliary variables

where is defined as input , all are implemented as fully connected layers with tanh nonlinearity and denotes the concatenation operator. All noise vectors s are set to be of 50 dimensions, and all other variables have the corresponding dimensions in the generative model. Inference network used for is the same, except that the Gaussian stochastic layer is defined for . An additional Gaussian stochastic layer for is defined with as input, where the dimensions of variables aligned to those in the generative model

. Further, Bernoulli observation models are assumed for both MNIST and OMNIGLOT. For MNIST, we employ the static binarization strategy as in 

(Larochelle and Murray, 2011) while dynamic binarization is employed for OMNIGLOT.

Our baseline models include VAE and IWAE. Since our proposed method involves adding more layers to the inference network, we also include another enhanced version of VAE with more deterministic layers added to its inference network, which we term as VAE+777VAE+ is a restricted version of AVAE with all the noise vectors s set to be constantly 0, but with the additional layers for s retained.

and its importance sample weighted variant IWAE+. To eliminate discrepancies in implementation details of the models reported in the literature, we implement all models and carry out the experiments under the same setting: All models are implemented in PyTorch

888http://pytorch.org/ and parameters of all models are optimized with Adam (Kingma and Ba, 2014)

for 2000 epochs, with an initial learning rate of 0.001, cosine annealing for learning rate decay 

(Loshchilov and Hutter, 2016)

, exponential decay rates for the 1st and 2nd moments at 0.9 and 0.999, respectively. Batch normalization 

(Ioffe and Szegedy, 2015) is applied to all fully connected layers, except for the final output layer for the generative model, as it has been shown to improve learning for neural stochastic models (Sønderby et al., 2016). Linear annealing of the KL divergence term between the variational posterior and the prior in all the loss functions from 0 to 1 is adopted for the first 200 epochs, as it has been shown to help training stochastic neural networks with multiple layers of latent variables (Sønderby et al., 2016). Code to reproduce all reported results will be made publicly available.

5.2.1 Generative Density Estimation

MNIST OMNIGLOT
Models on on on on
VAE  (Burda et al., 2015)
VAE+
VAE+
VAE+
AVAE
AVAE
AVAE
AVAE
AVAE
AVAE
Models (Importance weighted)
IWAE  (Burda et al., 2015)
IW-AVAE
IW-AVAE
IW-AVAE
IWAE+
IWAE+
IWAE+
IW-AVAE
IW-AVAE
IW-AVAE
Table 1: MNIST and OMNIGLOT test set NLL with generative models and (Lower is better; for VAE+, is the number of additional layers added and for AVAE it is the number of auxiliary variables added. For each column, the best result for each of both type of models (VAE based and IWAE based) are printed in bold. )

For both MNIST and OMNIGLOT, all models are trained and tuned on the training and validation sets, and estimated log-likelihood on the test set with 128 importance weighted samples are reported. Table 1 presents the performance of all models with for both and .

Firstly, VAE+ achieves slightly higher log-likelihood estimates than vanilla VAE due to the additional layers added in the inference network, implying that a better Gaussian posterior approximation is learned. Second, AVAE achieves lower NLL estimates than VAE+, more so with increasingly more samples from auxiliary variables (i.e., larger ), which confirms our expectation that: a) adding auxiliary variables to the inference network leads to a richer family of variational distributions; b) more samples of auxiliary variables yield a more accurate estimate of variational posterior . Lastly, with more importance weighted samples from both and , i.e., IW-AVAE variants, the best data density estimates are achieved. Overall, on MNIST AVAE outperforms VAE by 1.5 nats on and 1.3 nats on ; IW-AVAE outperforms IWAE by about 1.0 nat on and 0.5 nats on . Similar trends can be observed on OMNIGLOT, with AVAE and IW-AVAE outperforming conventional VAE and IWAE in all cases, except for IWAE+ slightly outperforms IW-AVAE.

Compared with previous methods with similar settings, IW-AVAE achieves a best NLL of 83.77, significantly better than 85.10 achieved by Normalizing Flow (Rezende and Mohamed, 2015). Best density modeling with generative modeling on statically binarized MNIST is achieved by Pixel RNN (Oord et al., 2016; Salimans et al., 2017)

with autoregressive models and Inverse Autoregressive Flows 

(Kingma et al., 2016) with latent variable models, however it’s worth noting that much more sophisticated generative models are adopted in those methods and that AVAE enhances standard VAE by focusing on enriching inference model flexibility, which pursues an orthogonal direction for improvements. Therefore, AVAE can be integrated with above-mentioned methods to further improve performance on latent generative modeling.

5.2.2 Latent Code Visualization

We visualize the inferred latent codes of digits in the MNIST test set with respect to their true class labels in Figure 3 from different models with tSNE (Maaten and Hinton, 2008).

Figure 3: Left: VAE, Middle: VAE+, Right:AVAE. Visualization of inferred latent codes for 5000 MNIST digits in the test set (best viewed in color)

We observe that on generative model , all three models are able to infer latent codes of the digits consistent with their true classes. However, VAE and VAE+ still shows disconnected cluster of latent codes from the same class (both class 0 and 1) and latent code overlapping from different classes (class 3 and 5), while AVAE outputs clear separable latent codes for different classes (notably for class 0,1,5,6,7).

5.2.3 Reconstruction and Generated Samples

Generative samples can be obtained from trained model by feeding to the learned generative model (or to ). Since higher log-likelihood estimates are obtained on , Figure 4 shows real samples from the dataset, their reconstruction, and random data points sampled from AVAE trained on for both MNIST and OMNIGLOT. We observe that the reconstructions align well with the input data and that random samples generated by the models are visually consistent with the training data.

(a) Data
(b) Reconstruction
(c) Random samples
Figure 4: Training data, its reconstruction and random samples. (Upper: MNIST, Lower: OMNIGLOT)

6 Conclusions

This paper presents AVAE, a new framework to enrich variational family for variational inference, by incorporating auxiliary variables to the inference model. It can be shown that the resulting inference model is essentially learning a richer probabilistic mixture of simple variational posteriors indexed by the auxiliary variables. We emphasize that no density evaluations are required for the auxiliary variables, hence neural networks can be used to construct complex implicit distribution for the auxiliary variables. Empirical evaluations of two variants of AVAE demonstrate the effectiveness of incorporating auxiliary variables in variational inference for generative modeling.

References