1 Introduction
Training of likelihoodbased deep generative models has advanced rapidly in recent years. These advances have been enabled by amortized inference (Hinton et al., 1995; Mnih & Gregor, 2014; Gregor et al., 2013), which scales up training of variational models, the reparameterization trick (Kingma & Welling, 2014; Rezende et al., 2014)
, which provides lowvariance gradient estimates, and increasingly expressive neural networks. Combinations of these techniques have resulted in many different forms of variational autoencoders (VAEs)
(Kingma et al., 2016; Chen et al., 2016).It is also widely recognized that flexible approximate posterior distributions improve the generative quality of variational autoencoders (VAEs) by more faithfully modeling true posterior distributions. More accurate posterior models have been realized by reducing the gap between true and approximate posteriors with tighter bounds (Burda et al., 2015), and using autoregressive architectures (Gregor et al., 2013, 2015) or flowbased inference (Rezende & Mohamed, 2015).
A complementary (but unexplored) direction for further improving the richness of posterior distributions is available through the use of undirected graphical models (UGMs). This is compelling as UGMs can succinctly capture complex multimodal relationships between variables. However, there are three challenges when training UGMs as approximate posteriors: i) There is no known lowvariance pathwise gradient estimator (like the reparameterization trick) for learning parameters of UGMs. ii) Sampling from general UGMs is intractable and approximate sampling is computationally expensive. iii) Evaluating the probability of a sample under a UGM can require an intractable partition function. These two latter costs are particularly acute when UGMs are used as posterior approximators because the number of UGMs grows with the size of the dataset.
However, we note that posterior UGMs conditioned on observed data points tend to be simple as there are usually only a small number of modes in the latent space that explain an observation. We expect then, that sampling and partition function estimation (challenges ii and iii) can be solved efficiently using parallelizable Markov chain Monte Carlo (MCMC) methods. In fact, we observe that UGM posteriors trained in a VAE model tend to be unimodal but with a mode structure that is not necessarily wellcaptured by meanfield posteriors. Nevertheless, in this case, MCMCbased sampling methods quickly equilibrate.
To address challenge i) we estimate the gradient by reparameterized sampling in the MCMC updates. Although an infinite number of MCMC updates are theoretically required to obtain an unbiased gradient estimator, we observe that a single MCMC update provides a lowvariance but biased gradient estimation that is usually sufficient for training posterior UGMs.
UGMs in the form of Boltzmann machines (Ackley et al., 1985) have recently been shown to be effective as priors for VAEs allowing for discrete versions of VAEs (Rolfe, 2016; Vahdat et al., 2018b, a). However, previous work in this area has relied on directed graphical models (DGMs) for posterior approximation. In this paper, we replace the DGM posteriors of discrete VAEs (DVAEs) with Boltzmann machine UGMs and show that these posteriors provide a generative performance comparable to or better than previous DVAEs. We denote this model as DVAE##, where ## indicates the use of UGMs in both the prior and the posterior.
We begin by summarizing related work on developing expressive posterior models including models trained nonvariationally. Nonvariational models employ MCMC to directly sample posteriors and differ from our variational approach which uses MCMC to sample from an amortized undirected posterior. Sec. 2 provides the necessary background on variational learning and MCMC methods to allow for the development in Sec. 3 of a gradient estimator for undirected posterior models. We provide examples for both Gaussian (Sec. 3.1) and Boltzmann machine (Sec. 3.2) UGMs. Experimental results are provided in Sec. 4 on VAEs (Sec. 4.1), importanceweighted VAEs (Sec. 4.2), and structured prediction (Sec. 4.3), where we observe consistent improvement using UGMs. We conclude in Sec. 5 with a list of future work.
1.1 Related Work
Inference gap reduction in VAEs: Previous work on reducing the gap between true and approximate posteriors can be grouped into three categories: i) Training objectives that replace the variational bound with tighter bounds on data loglikelihood (Burda et al., 2015; Li & Turner, 2016; Bornschein et al., 2016)
. ii) Autoregressive models
(Hochreiter & Schmidhuber, 1997; Graves, 2013) that use DGMs to form more flexible distributions (Gregor et al., 2013; Gulrajani et al., 2016; Van Den Oord et al., 2016; Salimans et al., 2017; Chen et al., 2018). iii) Flowbased models (Rezende & Mohamed, 2015) that use a class of invertible neural networks with simple Jacobians to map the input space to a latent space in which dependencies can be disentangled (Kingma et al., 2016; Kingma & Dhariwal, 2018; Dinh et al., 2014, 2016). To the best of our knowledge, UGMs have not been used as approximate posteriors.Gradient estimation in latent variable models: REINFORCE (Williams, 1992) is the most generic approach for computing the gradient of the approximate posteriors. However, in practice, this estimator suffers from highvariance and must be augmented by variance reduction techniques. For a large class of continuous latent variable models the reparameterization trick (Kingma & Welling, 2014; Rezende et al., 2014) provides lowervariance gradient estimates. Reparameterization does not apply to discrete latent variables and recent methods for discrete variables have focused on REINFORCE with control variates (Mnih & Gregor, 2014; Gu et al., 2015; Mnih & Rezende, 2016; Tucker et al., 2017; Grathwohl et al., 2018) or continuous relaxations (Maddison et al., 2016; Jang et al., 2016; Rolfe, 2016; Vahdat et al., 2018b, a). See (Andriyash et al., 2018) for a review of the recent techniques.
Variational Inference and MCMC: A common alternative to variational inference is to use MCMC to sample directly from posteriors in generative models (Welling & Teh, 2011; Salimans et al., 2015; Wolf et al., 2016; Hoffman, 2017; Li et al., 2017; Caterini et al., 2018). Our method differs from these techniques, as we use MCMC to sample from an amortized undirected posterior.
2 Background
In this section we provide background for the topics discussed in this paper.
Undirected graphical models:
A UGM represents the joint probability distribution for a set of random variables
as , where is a parameterized energy function defined by an undirected graphical model, andis the partition function. For example, the multivariate Gaussian distribution with mean
and precision matrix is a UGM with energy function with .UGMs with quadratic energy functions defined over binary variables
are called Boltzmann machines (BMs). We use restricted Boltzmann machines (RBMs) defined on a bipartite graph. An RBM has energy function
, where and indicate linear biases and represents pairwise interactions and . RBMs allow for parallel Gibbs sampling updates as and are factorial in the components of and .Variational autoencoders: A VAE is a generative model factored as , where is a prior distribution over latent variables and is a probabilistic decoder representing the conditional distribution over data variables given . Since maximizing the marginal loglikelihood is intractable in general, VAEs are trained by maximizing a variational lower bound (ELBO) on :
(1) 
where is a probabilistic encoder that approximates the posterior over latent variables given a data point. For continuous latent variables, the ELBO is typically optimized using the reparameterization trick. With reparameterization, the expectation with respect to is replaced with an expectation over a simple continuous base distribution. Samples from the base distribution are transformed under a differentiable mapping to samples from .
When is factorized using a directed graphical model (e.g. ) the reparameterization trick is applied to each conditional and full joint samples are assembled using ancestral sampling.
This approach cannot be directly applied to binary latent variables because there is no differentiable mapping that maps samples from a continuous base distribution to the binary distribution. However, several continuous relaxations have recently been proposed to mitigate this problem. These approaches can be grouped into two categories. In the first category (Maddison et al., 2016; Vahdat et al., 2018b, a)
, the binary random variables are relaxed to continuous random variables whose distributions in the limit of low temperature approach the original binary distributions. Instead of optimizing the objective in Eq. (
1), a new objective is defined on the generative model with relaxed variables. In the second category (Jang et al., 2016; Khoshaman & Amin, 2018; Andriyash et al., 2018), the objective function inside the expectation in Eq. (1) is relaxed directly by replacing the discrete variables with their continuous version. Although this relaxation does not yield a consistent probabilistic objective, it produces results on par with the first group.Importance weighted (IW) bounds: The variational lower bound in Eq. (1) on the data loglikelihood can be tightened using importance weighting. A sample estimation of loglikelihood is
(2) 
where . Both continuous relaxation and reparameterized sampling can be used for optimizing this objective function.
MCMC methods: MCMC encompasses a variety of methods for drawing approximate samples from a target distribution . Each MCMC method is characterized by a transition kernel whose equilibrium distribution is equal to . MCMC can be considered as a stochastic approximate solution to the fixed point equation:
(3) 
which is solved by sampling from an initial distribution and iteratively updating the samples by drawing conditional samples using . Denoting the distribution of the samples after iterations by , where represents applications of the transition kernel, the theory of MCMC shows that under some regulatory conditions as . In practice, is usually chosen to satisfy the detailed balance condition:
(4) 
which implies Eq. (3). Finally, itself is a valid transition kernel and satisfies Eq. (3) and Eq. (4).
Gibbs sampling: Gibbs sampling is a particular MCMC method for drawing samples from UGMs. In this method, a single variable in a UGM is updated by sampling from at each iteration. The method can be inefficient as a complete sweep requires a sequential update of all variables. However, when the UGM is defined on a bipartite graph, can be partitioned into disjoint sets and and the conditional independence among the components of in and in can be leveraged to update each partition in parallel.^{1}^{1}1To be precise, having a bipartite graph is not a sufficient condition. The energy function should also have a form in which the conditionals and can be formed. In this case, all variables can be updated in two halfsweeps. The transition kernel is written as:
(5) 
3 Learning Undirected Posteriors
In this section, we propose a gradient estimator for generic UGM posteriors and discuss how the estimator can be applied to RBMbased posteriors.
The training objective for a probabilistic encoder network with being a UGM can be written as:
(6) 
where is a differentiable function of . For simplicity of exposition, we assume that does not depend on .^{2}^{2}2If does depend on , then is approximated with Monte Carlo sampling. Using the fixed point equation in Eq. (3), we have:
(7) 
To maximize Eq. (7), we require its gradient:
(8) 
The term marked as is written as:
(9)  
(10) 
where we have used the detailed balance condition in the second line. The form of Eq. (10) makes it clear that goes to zero in expectation as , because and . However, has high variance that does not decrease as .
The expectation in Eq. (3) marked as involves the gradient of an expectation with respect to MCMC transition kernel. Fortunately, the reparameterization trick can provide a lowvariance gradient estimator for this term. Moreover, as , the contribution approaches the full gradient as .
Given the high variance of , it is beneficial to drop this term from the gradient (Eq. (3)) and use the biased but lowervariance estimate . Furthermore, the choice of determines the computational complexity of the estimator, and its bias and variance. Our key observation is that increasing has little effect on the optimization performance because the decreased bias is accompanied by an increase in variance. This means that is a good choice with the smallest computational complexity. More specifically, we use the approximation:
(11) 
where is a reparameterized sample from .
In principle, MCMC methods such as MetropolisHastings or Hamiltonian Monte Carlo (HMC) can be used as the transition kernel in Eq. (11). However, unbiased reparameterized sampling for these methods is challenging. For example, MetropolisHastings contains a nondifferentiable binary operation in the accept/reject step and HMC, in addition to the binary step, requires backpropagating through the gradients of the energy function and tuning hyperparameters such as step size.
These complications can be avoided when Gibbs sampling of is possible. When is a UGM defined on a bipartite graph, the Gibbs transition kernel is simply as in Eq. (5). This kernel does not contain an accept/reject step or any hyperparameters. Assuming that the conditionals can be reparameterized, the gradient of the kernel is approximated efficiently using lowvariance reparameterized sampling from each conditional. In this case, the gradient estimator for a single Gibbs update is:
(12) 
where and are reparameterized samples. The extension to Gibbs updates is straightforward.
Lastly, we note that the reparameterized sample has distribution if is equilibrated. So, if
has its own parameters (e.g., the parameters of the generative model in VAEs), the same sample can be used to compute an unbiased estimation of
where denotes the parameters of .3.1 Toy Example
Here, we assess the efficacy of our approximations for a multivariate Gaussian distribution. We consider learning a twovariable Gaussian distribution depicted on Fig. 3(b). The target distribution is defined by the energy function , where and . The distribution has the same form and we learn the parameters by minimizing the KullbackLeibler (KL) divergence . We compare Eq. (12) for with reparameterized sampling and with REINFORCE. Fig. 3(a) shows the KL divergence during training. Our method significantly outperforms REINFORCE due to lower variance of the gradients. There is little difference between different until KL divergence becomes . Finally, our method performs worse then reparameterized sampling, which is expected due to the bias introduced by neglecting in (Eq. (3)).
3.2 Learning Boltzmann Machine Posteriors
In this section, we consider training DVAE##, a DVAE (Rolfe, 2016; Vahdat et al., 2018b, a) with RBM prior and posterior. The objective function of DVAE## is given by Eq. (1), where is an RBM with parameters . is generated using a neural network. To minimize , we need the gradients with respect to for each data point :
(13) 
For each , this gradient has the form Eq. (6) with . Although the RBM has a bipartite structure, applying the gradient estimator in Eq. (12) is challenging due to the discrete nature of the latent variables. In order to apply the reparameterization trick, we further relax the binary variables to continuous variables using Gumbelsoftmax or piecewise linear relaxations.^{3}^{3}3Another option is to use unbiased gradient estimators such as REBAR (Tucker et al., 2017) or RELAX (Grathwohl et al., 2018). However, with these approaches, the number of evaluations increases with . Representing the relaxation of by and the relaxation of by , where , we define the following gradient estimator:
(14) 
where is computed using automatic differentiation by backpropagating through reparameterized sampling. Thus far, we have only accounted for the dependency of on through samples (this is known as the pathwise derivative). However, in the case of VAEs and IWAEs, also depends on through the contribution. Note that this dependency can be ignored in the case of VAEs as (Roeder et al., 2017), and it can be removed in IWAEs using the doubly reparameterized gradient estimation (Tucker et al., 2018). This enables us to ignore the partition function of and its gradient during training as it does not involve in the pathwise derivative (see Appendix C).
If , each Gibbs sampling step can be relaxed. However, as increases sampling from the relaxed chain diverges from the corresponding discrete chain resulting in even more biased gradient estimation. Thus, we use in our experiments (See Appendix A for ). Moreover, in Eq. (14), our estimator also requires samples from the RBM posterior in the outer expectation. These samples are obtained by running persistent chains for each training datapoint. Finally, may have its own parameters (i.e., the parameters of the generative model ). The gradient of with respect to these parameters can be obtained using either the relaxed samples and , or the discrete samples and . We use the former as it results in a single function evaluation per instance in each minibatch parameter update.
Algorithm 1 summarizes our method for training DVAE## with an RBM prior and posterior. We represent the Boltzmann prior using . The number of Gibbs sweeps for generating in Eq. (14) is denoted by and the number of Gibbs sweeps for sampling is denoted by . The objective function is such that automatic differentiation of will yield the gradient estimator in Eq. (14) for and evaluated using relaxed samples for . For , we use the method introduced in (Vahdat et al., 2018b) Sec. E to obtain proper gradients.
4 Experiments
We examine our proposed gradient estimator for training undirected posteriors for three tasks: variational autoencoders, importance weighted autoencoders, and structured prediction models on the binarized MNIST
(Salakhutdinov & Murray, 2008) and OMNIGLOT (Lake et al., 2015) datasets. We follow the experimental setup in DVAE# (Vahdat et al., 2018a) with minor modifications outlined below.4.1 Variational Autoencoders
We train a VAE with a generative model of the form , where is an RBM and is a neural network. The conditional
is represented using a fullyconnected neural network with two 200unit hidden layers, tanh activations, and batch normalization similar to DVAE++, DVAE#
(Vahdat et al., 2018b, a) and GumBolt (Khoshaman & Amin, 2018). We evaluate the performance of the generative models trained with either an undirected posterior (DVAE##) or a directed posterior.For DVAE##, is modeled with a neural network having two tanh hidden layers that predicts the parameters of the RBM (Fig. 7(a)). Training DVAE## is done using Algorithm 1 with and using a piecewise linear relaxation (Andriyash et al., 2018) for relaxed Gibbs samples. We follow (Vahdat et al., 2018a) for batch size, learning rate schedule, and KL warm up parameters. During training, samples from the prior are required for computing the gradient of . This sampling is done using the (QuPA, )
library that offers populationannealingbased sampling and partition function estimation (using AIS) from within Tensorflow. We also use
(QuPA, ) for sampling the undirected posteriors and estimating their partition function during evaluation. We explore VAEs with equallysized RBM prior and posteriors consisting of either 200 (100100) or 400 (200200) latent variables. Test set negative loglikelihoods are computed using 4000 importanceweighted samples.Baselines: We compare the RBM posteriors with directed posteriors where the directed models are factored across groups of latent variables as . Each conditional, , is factorial across the components of and is the total number of hierarchical layers. A fair comparison between directed and undirected posteriors is challenging because they differ in structure and in the number of parameters. We design a baseline so that the number of parameters and the number of nonlinearities is identical for a directed posterior with a single group () and an undirected posterior with no pairwise interactions. This is reasonable as both cases reduce to a meanfield posterior.
Following this principal, we initially experimented with the structure introduced in DVAE#^{4}^{4}4These models are also used in DVAE++ and GumBolt. in which each factor is represented using two tanh nonlinearities (see Fig. 7(b)). However, our experiments indicated that when the number of latent variables is held constant, increasing the number of layers does not necessarily improve the generative performance.
In undirected posteriors, the parameters of the posterior for each training example are predicted using a single shared context feature ( in Fig. 7(a)). However, in DVAE#, parallel neural networks generate the parameters of the posterior for each level of the hierarchy. Inspired by this observation, we define a new structure for the directed posteriors by introducing a shared (200 dimensional) context feature that is computed using two tanh nonlinearities. This shared feature is fed to the subsequent conditional each represented with a linear layer, i.e., (see Fig. 7(c)). In Appendix B, the shared context structure is compared against the original structure.
We examine three recent methods for training VAEs with directed posteriors with shared context. These baselines are i) DVAE# (Vahdat et al., 2018a) that uses powerfunction distribution to relax the objective function, ii) Concrete relaxation used in GumBolt (Khoshaman & Amin, 2018), and iii) piecewise linear (PWL) relaxation (Andriyash et al., 2018). All models are trained using , 2, or 4 hierarchical levels.
Results: The performance of DVAE## is compared against the baselines in Table 1. We make two observations. i) The baselines with are equivalent to a meanfield posterior which is also the special case of an undirected posterior with no pairwise terms. Since PWL and DVAE## use the same continuous relaxation, we can compare PWL and DVAE## to see how introducing pairwise interactions improve the quality of the DVAE models. In our experiments, we consistently observe approximately nat improvement arising from pairwise interactions. ii) DVAE## with undirected posteriors outperforms the directed baselines in most cases indicating that UGMs form more appropriate posteriors.
, the logits for each conditional in
. (c) Our directed posterior differs from DVAE# and predicts the parameters of each conditional inusing a linear transformation given the shared context feature
and previous .Prior RBM: 100100  Prior RBM: 200200  
DVAE#  Concrete  PWL  DVAE##  DVAE#  Concrete  PWL  DVAE##  
MNIST 
1  84.970.03  84.660.04  84.620.05  84.090.06  83.210.03  83.190.05  83.220.05  82.750.05 
2  84.960.05  84.710.03  84.500.07  83.130.04  83.040.02  82.990.05  
4  84.580.02  84.390.04  84.070.04  82.930.04  83.140.04  82.900.03  
OMNI. 
1  101.640.05  101.410.06  101.240.02  100.690.04  99.510.03  99.390.04  99.320.04  98.610.08 
2  101.750.04  101.390.06  101.140.05  99.400.05  99.120.07  99.100.04  
4  101.740.07  102.040.10  101.140.05  99.470.04  99.970.02  99.300.03 
4.1.1 Examining Undirected Posteriors
We examine the posteriors trained with DVAE## on MNIST in terms of the number of modes and the difficulty of sampling and partition function estimation.
Multimodality: To estimate the number of modes we find a variety of mean field solutions as follows: Given an RBM we draw samples using (QuPA, )
and use each sample to initialize the construction of a meanfield approximation. Specifically, we initialize the mean parameter of a factorial Bernoulli distribution to the sample value and then iteratively minimize the KL until we converge to a fixed point describing a mode. In this way, the initial population of samples converges to a perhaps smaller number of unique modes.
Using this method, we observe that the majority () of trained posteriors have a single mode. This is in contrast with the prior RBM that typically has 10 to 20 modes. However, we also observe that the KL from the converged meanfield distributions to the RBM posteriors is typically in the range . This indicates that, while most RBM posteriors are unimodal, the structure of the mode is not captured completely by a factorial distribution.
The difficulty of sampling and partition function estimation: Fig. 10 visualizes the deviation of log partition function estimations for an RBM posterior and prior for different numbers of AIS temperatures. As shown in the figure, the number of temperatures required for achieving an acceptable precision (e.g.,
) for the RBM posterior is 10 times smaller than the number of temperatures required for the RBM prior. The main reason for this difference is that the RBM posteriors have strong linear biases. When annealing starts from a meanfield distribution containing only the biases, AIS requires fewer interpolating temperatures in order to accurately approximate the partition function. However, RBM priors which do not contain strong linear biases, do not benefit from this.
4.2 Importance Weighted Autoencoders
In this section, we examine the generative model introduced in the previous section but using the IW bound. For comparison, we only use the PWL baseline as it achieves the best performance among directed posterior baselines in Table 1
. For training, we use the same hyperparameters used in the previous section with an additional hyperparameter
representing the number of IW samples.To sidestep computation of the gradient of the partition function for undirected posteriors in the IW bound, we use the pathwise gradient estimator introduce in (Tucker et al., 2018). However, we observe that an annealing scheme in IWAEs (similar to KLwarm up in VAEs) improves the performance of the generative model by preventing the latent variables from turning off. In Appendix C, we introduce this annealing mechanism for training IWAEs and show how the pathwise gradient can be computed while annealing the objective function.
The experimental results for training DVAE## are compared against directed posteriors in Table 2. As can be seen, DVAE## achieves a comparable performance on MNIST, but outperforms the directed posteriors on OMNIGLOT.
Prior RBM: 100100  Prior RBM: 200200  
PWL  PWL  PWL  DVAE##  PWL  PWL  PWL  DVAE##  
MNIST 
1  84.600.04  84.490.03  84.040.03  84.060.03  83.270.06  82.990.04  82.890.03  82.760.05 
5  84.150.06  83.880.02  83.530.06  83.560.02  83.240.06  82.900.06  82.650.03  82.760.06  
25  83.960.05  83.700.03  83.380.02  83.440.04  83.380.04  83.050.03  82.840.02  82.970.11  
OMNI. 
1  101.210.06  101.140.07  101.120.03  100.680.05  99.320.02  99.110.02  99.280.09  98.610.06 
5  100.720.05  100.580.05  100.510.04  100.160.04  99.190.05  98.690.09  98.780.02  98.340.03  
25  100.580.01  100.480.05  100.370.03  100.020.04  99.100.03  98.700.07  98.860.06  98.340.04  
4.3 Structured Output Prediction
Structured prediction is a form of conditional likelihood estimation (Sohn et al., 2015) concerned with modeling the distribution of a highdimensional output given an input. Here, we predict the distribution of the bottom half of an image given the top half . For this, we follow the IW objective function proposed by (Raiko et al., 2015):
where we have added an entropy term to prevent the model from overfitting the training data. This expectation (without the entropy term) is identical to the IW bound in Eq. (2), where the prior and the approximate posterior are both set to and it can thus be considered as a lower bound on . The scalar is annealed during training from 1 to a small value (e.g., ).
The experimental results for the structured prediction problem are reported in Table 3. Here, the latent space is limited to 200 binary variables and and are modeled by fullyconnected networks with one 200unit hidden to limit overfitting. Performance is estimated in terms of the average logconditional measured using 4000 IW samples. Undirected posteriors outperform the directed models by several nats on MNIST and OMNIGLOT.
RBM Size: 100100  

K  PWL  PWL  PWL  DVAE##  
MNIST 
1  60.820.17  59.540.12  60.220.12  57.130.18 
5  52.380.03  52.270.16  52.670.05  49.250.07  
25  48.300.08  48.410.05  48.610.11  45.750.10  
OMNI. 
1  67.060.04  67.110.11  67.350.08  63.740.08 
5  59.000.03  59.280.08  59.340.10  57.530.08  
25  54.790.04  54.830.04  54.880.06  54.320.04 
5 Conclusions and Future Directions
We have introduced a gradient estimator for stochastic optimization with undirected distributions and showed that VAEs and IWAEs with RBM posteriors can outperform similar models having directed posteriors. These encouraging results for UGMs over discrete variables suggest a number of promising research directions.
Exponential Family Harmoniums: The methods we have outlined also apply to UGMs defined over continuous variables (as in Fig 3). Exponential Family Harmoniums (EFHs) (Welling et al., 2005) generalize RBMs and are composed of two disjoint groups of exponentialfamily random variables where the sufficient statistics of the groups are tied to each other with a quadratic interaction. EFHs retain the factorial structure of the conditionals and so that we can easily backpropagate through Gibbs updates. However, the exponentialfamily generalization allows for the mixing of different latent variable types and distributions.
Auxiliary Random Variables: An appealing property of RBMs (and EFHs) is that either group of variables can be analytically marginalized out. This property can be exploited to form more expressive approximate posteriors. Consider a generative model . To approximate the posterior we augment the latent space of with auxiliary variables , and form a UGM over the joint space. In this case, the marginal approximate posterior has more expressive power. Sampling from can be done by sampling from the joint and our gradient estimator can be used for training the parameters of the UGM. The objective function of VAE can be optimized easily as the logmarginal has an analytic expression up to the normalization constant given a sample from the posterior.
Combining DGMs and UGMs: In practice, VAEs trained with DGM posteriors have shown promising results. However, each factor in DGMs is typically assumed to be a product of independent distributions. We can build more powerful DGMs by modeling each using a UGM.
Sampling and Computing Partition Function: A challenge when using continuousvariable UGMs as posteriors is the requirement for sampling and partition function estimation during evaluation if the test data loglikelihood is estimated using the IW bound. However, we note that approximate posteriors are only required for training VAEs. At evaluation, we can sidestep sampling and partition function estimation challenges using techniques such as AIS (Wu et al., 2017) starting from the prior distribution when the latent variables are continuous.
References
 Ackley et al. (1985) Ackley, D. H., Hinton, G. E., and Sejnowski, T. J. A learning algorithm for Boltzmann machines. Cognitive science, 9(1):147–169, 1985.
 Andriyash et al. (2018) Andriyash, E., Vahdat, A., and Macready, W. G. Improved gradientbased optimization over discrete distributions. arXiv preprint arXiv:1810.00116, 2018.
 Bornschein et al. (2016) Bornschein, J., Shabanian, S., Fischer, A., and Bengio, Y. Bidirectional Helmholtz machines. In International Conference on Machine Learning, pp. 2511–2519, 2016.
 Burda et al. (2015) Burda, Y., Grosse, R., and Salakhutdinov, R. Importance weighted autoencoders. arXiv preprint arXiv:1509.00519, 2015.
 Caterini et al. (2018) Caterini, A. L., Doucet, A., and Sejdinovic, D. Hamiltonian variational autoencoder. In Neural Information Processing Systems, 2018.
 Chen et al. (2016) Chen, X., Kingma, D. P., Salimans, T., Duan, Y., Dhariwal, P., Schulman, J., Sutskever, I., and Abbeel, P. Variational lossy autoencoder. arXiv preprint arXiv:1611.02731, 2016.
 Chen et al. (2018) Chen, X., Mishra, N., Rohaninejad, M., and Abbeel, P. PixelSNAIL: An improved autoregressive generative model. In International Conference on Machine Learning, 2018.
 Dinh et al. (2014) Dinh, L., Krueger, D., and Bengio, Y. Nice: Nonlinear independent components estimation. arXiv preprint arXiv:1410.8516, 2014.
 Dinh et al. (2016) Dinh, L., SohlDickstein, J., and Bengio, S. Density estimation using real NVP. arXiv preprint arXiv:1605.08803, 2016.
 Grathwohl et al. (2018) Grathwohl, W., Choi, D., Wu, Y., Roeder, G., and Duvenaud, D. Backpropagation through the void: Optimizing control variates for blackbox gradient estimation. In International Conference on Learning Representations (ICLR), 2018.
 Graves (2013) Graves, A. Generating sequences with recurrent neural networks. arXiv preprint arXiv:1308.0850, 2013.
 Gregor et al. (2013) Gregor, K., Danihelka, I., Mnih, A., Blundell, C., and Wierstra, D. Deep autoregressive networks. arXiv preprint arXiv:1310.8499, 2013.

Gregor et al. (2015)
Gregor, K., Danihelka, I., Graves, A., Rezende, D., and Wierstra, D.
Draw: A recurrent neural network for image generation.
In International Conference on Machine Learning, pp. 1462–1471, 2015.  Gu et al. (2015) Gu, S., Levine, S., Sutskever, I., and Mnih, A. MuProp: Unbiased backpropagation for stochastic neural networks. arXiv preprint arXiv:1511.05176, 2015.
 Gulrajani et al. (2016) Gulrajani, I., Kumar, K., Ahmed, F., Taiga, A. A., Visin, F., Vazquez, D., and Courville, A. PixelVAE: A latent variable model for natural images. arXiv preprint arXiv:1611.05013, 2016.
 Hinton et al. (1995) Hinton, G. E., Dayan, P., Frey, B. J., and Neal, R. M. The “wakesleep” algorithm for unsupervised neural networks. Science, 268(5214):1158, 1995.
 Hochreiter & Schmidhuber (1997) Hochreiter, S. and Schmidhuber, J. Long shortterm memory. Neural computation, 9(8):1735–1780, 1997.
 Hoffman (2017) Hoffman, M. D. Learning deep latent Gaussian models with Markov chain Monte Carlo. In International Conference on Machine Learning, 2017.
 Jang et al. (2016) Jang, E., Gu, S., and Poole, B. Categorical reparameterization with GumbelSoftmax. arXiv preprint arXiv:1611.01144, 2016.
 Khoshaman & Amin (2018) Khoshaman, A. H. and Amin, M. H. GumBolt: Extending Gumbel trick to Boltzmann priors. In Neural Information Processing Systems, 2018.
 Kingma & Ba (2014) Kingma, D. and Ba, J. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
 Kingma & Dhariwal (2018) Kingma, D. P. and Dhariwal, P. Glow: Generative flow with invertible 1x1 convolutions. In Bengio, S., Wallach, H., Larochelle, H., Grauman, K., CesaBianchi, N., and Garnett, R. (eds.), Advances in Neural Information Processing Systems 31, pp. 10236–10245. 2018.
 Kingma & Welling (2014) Kingma, D. P. and Welling, M. Autoencoding variational bayes. In The International Conference on Learning Representations (ICLR), 2014.
 Kingma et al. (2016) Kingma, D. P., Salimans, T., Jozefowicz, R., Chen, X., Sutskever, I., and Welling, M. Improved variational inference with inverse autoregressive flow. In Advances in Neural Information Processing Systems, pp. 4743–4751, 2016.
 Lake et al. (2015) Lake, B. M., Salakhutdinov, R., and Tenenbaum, J. B. Humanlevel concept learning through probabilistic program induction. Science, 350(6266):1332–1338, 2015.
 Li & Turner (2016) Li, Y. and Turner, R. E. Rényi divergence variational inference. In Advances in Neural Information Processing Systems, pp. 1073–1081, 2016.
 Li et al. (2017) Li, Y., Turner, R. E., and Liu, Q. Approximate inference with amortised MCMC. arXiv preprint arXiv:1702.08343, 2017.
 Maddison et al. (2016) Maddison, C. J., Mnih, A., and Teh, Y. W. The concrete distribution: A continuous relaxation of discrete random variables. arXiv preprint arXiv:1611.00712, 2016.
 Mnih & Gregor (2014) Mnih, A. and Gregor, K. Neural variational inference and learning in belief networks. arXiv preprint arXiv:1402.0030, 2014.
 Mnih & Rezende (2016) Mnih, A. and Rezende, D. Variational inference for Monte Carlo objectives. In International Conference on Machine Learning, pp. 2188–2196, 2016.
 (31) QuPA. Quadrant population annealing library. https://try.quadrant.ai/qupa. Accessed: 20190101.
 Raiko et al. (2015) Raiko, T., Berglund, M., Alain, G., and Dinh, L. Techniques for learning binary stochastic feedforward neural networks. In International Conference on Learning Representations (ICLR), 2015.
 Rezende & Mohamed (2015) Rezende, D. J. and Mohamed, S. Variational inference with normalizing flows. arXiv preprint arXiv:1505.05770, 2015.
 Rezende et al. (2014) Rezende, D. J., Mohamed, S., and Wierstra, D. Stochastic backpropagation and approximate inference in deep generative models. In International Conference on Machine Learning, pp. 1278–1286, 2014.
 Roeder et al. (2017) Roeder, G., Wu, Y., and Duvenaud, D. K. Sticking the landing: Simple, lowervariance gradient estimators for variational inference. In Advances in Neural Information Processing Systems, pp. 6925–6934, 2017.
 Rolfe (2016) Rolfe, J. T. Discrete variational autoencoders. arXiv preprint arXiv:1609.02200, 2016.

Salakhutdinov & Murray (2008)
Salakhutdinov, R. and Murray, I.
On the quantitative analysis of deep belief networks.
In Proceedings of the 25th international conference on Machine learning, pp. 872–879. ACM, 2008.  Salimans et al. (2015) Salimans, T., Kingma, D., and Welling, M. Markov chain Monte Carlo and variational inference: Bridging the gap. In International Conference on Machine Learning, 2015.
 Salimans et al. (2017) Salimans, T., Karpathy, A., Chen, X., and Kingma, D. P. PixelCNN++: A PixelCNN implementation with discretized logistic mixture likelihood and other modifications. In ICLR, 2017.
 Sohn et al. (2015) Sohn, K., Lee, H., and Yan, X. Learning structured output representation using deep conditional generative models. In Advances in Neural Information Processing Systems, 2015.
 Sønderby et al. (2016) Sønderby, C. K., Raiko, T., Maaløe, L., Sønderby, S. K., and Winther, O. Ladder variational autoencoders. In Advances in neural information processing systems, pp. 3738–3746, 2016.
 Tucker et al. (2017) Tucker, G., Mnih, A., Maddison, C. J., Lawson, J., and SohlDickstein, J. REBAR: Lowvariance, unbiased gradient estimates for discrete latent variable models. In Advances in Neural Information Processing Systems, pp. 2624–2633, 2017.
 Tucker et al. (2018) Tucker, G., Lawson, D., Gu, S., and Maddison, C. J. Doubly reparameterized gradient estimators for Monte Carlo objectives. arXiv preprint arXiv:1810.04152, 2018.
 Vahdat et al. (2018a) Vahdat, A., Andriyash, E., and Macready, W. G. DVAE#: Discrete variational autoencoders with relaxed Boltzmann priors. In Neural Information Processing Systems, 2018a.
 Vahdat et al. (2018b) Vahdat, A., Macready, W. G., Bian, Z., Khoshaman, A., and Andriyash, E. DVAE++: Discrete variational autoencoders with overlapping transformations. In International Conference on Machine Learning (ICML), 2018b.
 Van Den Oord et al. (2016) Van Den Oord, A., Kalchbrenner, N., and Kavukcuoglu, K. Pixel recurrent neural networks. In Proceedings of the 33rd International Conference on International Conference on Machine Learning, pp. 1747–1756. JMLR. org, 2016.
 Welling & Teh (2011) Welling, M. and Teh, Y. W. Bayesian learning via stochastic gradient Langevin dynamics. In Proceedings of the 28th International Conference on Machine Learning (ICML11), pp. 681–688, 2011.
 Welling et al. (2005) Welling, M., RosenZvi, M., and Hinton, G. E. Exponential family harmoniums with an application to information retrieval. In Advances in neural information processing systems, pp. 1481–1488, 2005.

Williams (1992)
Williams, R. J.
Simple statistical gradientfollowing algorithms for connectionist reinforcement learning.
In Reinforcement Learning, pp. 5–32. Springer, 1992.  Wolf et al. (2016) Wolf, C., Karl, M., and van der Smagt, P. Variational inference with hamiltonian monte carlo. arXiv preprint arXiv:1609.08203, 2016.
 Wu et al. (2017) Wu, Y., Burda, Y., Salakhutdinov, R., and Grosse, R. On the quantitative analysis of decoderbased generative models. In The International Conference on Learning Representations (ICLR), 2017.
Appendix A Toy Example: RBM Posterior
In this section, we examine the dependence of the gradient estimator on , the number of MCMC updates on a toy example. Here, a bit RBM is learned to minimize the KL divergence to a mixture of Bernoulli distributions:
(15)  
(16) 
We choose 8 mixture components with weights chosen uniformly and Bernoulli components having variance . The training of parameters is done with the Adam optimizer (Kingma & Ba, 2014), using learning rate for 10000 iterations. We show the results for several values of in Fig 11. We note that the effectiveness of optimization does not depend on indicating that the bias reduction obtained by increasing is offset by increased variance.
Appendix B Examining the Shared Context
For the directed posterior baselines, we initially experimented with the structure introduced in DVAE# in which each factor is represented using two tanh nonlinearities as shown in Fig. 7(b). However, initial experiments indicated that when the number of latent variables is held constant, increasing the number of subgroups, , does not improve the generative performance.
For undirected posteriors, the parameters of the posterior for each training example are predicted using a single shared context feature (Fig. 7(a)). Inspired by this observation, we define a new structure for the directed posteriors by introducing a shared context feature that is computed using two tanh nonlinearities. This shared feature is fed to the subsequent conditional each represented with a linear layer. i.e., (see Fig. 7(c)).
In Table 4, we compare the original structure used in DVAE# to the structure with a shared context. In almost all cases the shared context feature improves the performance of the original structure. Moreover, increasing the number of hierarchical layers often improves the performance of the new structure.
Note that in Table 4, we report the average negative loglikelihood measured on the test set for 5 runs. In the case, both structures are identical and they achieve statistically similar performance.
Prior RBM Size: 100100  Prior RBM Size: 200200  

DVAE#  Concrete  PWL  DVAE#  Concrete  PWL  
context  ✗  ✓  ✗  ✓  ✗  ✓  ✗  ✓  ✗  ✓  ✗  ✓  
MNIST  #layers  1  84.95  84.97  84.65  84.66  84.57  84.62  83.26  83.21  83.21  83.19  83.23  83.22 
2  84.81  84.96  84.75  84.71  84.70  84.50  83.26  83.13  83.24  83.04  83.38  82.99  
4  84.61  84.58  84.90  84.39  85.00  84.07  83.13  82.93  83.69  83.14  83.85  82.90  
OMNI.  #layers  1  101.69  101.64  101.41  101.41  101.21  101.24  99.53  99.51  99.39  99.39  99.37  99.32 
2  101.84  101.75  102.45  101.39  101.64  101.14  99.66  99.40  100.52  99.12  100.07  99.10  
4  101.93  101.74  102.76  102.04  101.97  101.14  99.63  99.47  100.99  99.97  100.45  99.30 
Appendix C Annealing the Importance Weighted Bound
KL annealing (Sønderby et al., 2016) is used to prevent latent variables from turning off in VAE training. A scalar is introduced which weights KL contribution to the ELBO and is annealed from zero to one during training:
(17) 
Since the KL contribution is small early in training, the VAE model initially optimizes the reconstruction term which prevents the approximate posterior from matching to the prior.
Here, we apply the same approach to the importance weighted bound by rewriting the IW bound by:
(18) 
where . Similar to the KL annealing idea, we anneal the scalar during training from zero to one. When is small, the IW bound emphasizes reconstruction of the data which inhibits the latent variables from turning off. When , Eq. (18) reduces to Eq. (17).
The gradient of Eq. (18) with respect to , the parameters of and , the parameters of the generative model is:
(19) 
where . The gradient is evaluated as:
(20) 
The second term in the right hand side of Eq. (20) is the pathwise gradient that can be computed easily using our biased gradient estimator. This term does not have any dependence on the partition function since
and depends only on the energy function.
Comments
There are no comments yet.