Learning Undirected Posteriors by Backpropagation through MCMC Updates

01/11/2019
by   Arash Vahdat, et al.
0

The representation of the posterior is a critical aspect of effective variational autoencoders (VAEs). Poor choices for the posterior have a detrimental impact on the generative performance of VAEs due to the mismatch with the true posterior. We extend the class of posterior models that may be learned by using undirected graphical models. We develop an efficient method to train undirected posteriors by showing that the gradient of the training objective with respect to the parameters of the undirected posterior can be computed by backpropagation through Markov chain Monte Carlo updates. We apply these gradient estimators for training discrete VAEs with Boltzmann machine posteriors and demonstrate that undirected models outperform previous results obtained using directed graphical models as posteriors.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

07/11/2012

Bayesian Learning in Undirected Graphical Models: Approximate MCMC algorithms

Bayesian learning in undirected graphical models|computing posterior dis...
09/25/2013

Stratified Graphical Models - Context-Specific Independence in Graphical Models

Theory of graphical models has matured over more than three decades to p...
02/11/2016

Variational Inference for Sparse and Undirected Models

Undirected graphical models are applied in genomics, protein structure p...
11/19/2015

Iterative Refinement of Approximate Posterior for Training Directed Belief Networks

Variational methods that rely on a recognition network to approximate th...
06/18/2018

The Minimax Learning Rate of Normal and Ising Undirected Graphical Models

Let G be an undirected graph with m edges and d vertices. We show that d...
03/14/2012

Bayesian Parameter Estimation for Latent Markov Random Fields and Social Networks

Undirected graphical models are widely used in statistics, physics and m...
09/30/2015

Maximum Likelihood Learning With Arbitrary Treewidth via Fast-Mixing Parameter Sets

Inference is typically intractable in high-treewidth undirected graphica...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Training of likelihood-based deep generative models has advanced rapidly in recent years. These advances have been enabled by amortized inference (Hinton et al., 1995; Mnih & Gregor, 2014; Gregor et al., 2013), which scales up training of variational models, the reparameterization trick (Kingma & Welling, 2014; Rezende et al., 2014)

, which provides low-variance gradient estimates, and increasingly expressive neural networks. Combinations of these techniques have resulted in many different forms of variational autoencoders (VAEs) 

(Kingma et al., 2016; Chen et al., 2016).

It is also widely recognized that flexible approximate posterior distributions improve the generative quality of variational autoencoders (VAEs) by more faithfully modeling true posterior distributions. More accurate posterior models have been realized by reducing the gap between true and approximate posteriors with tighter bounds (Burda et al., 2015), and using auto-regressive architectures (Gregor et al., 2013, 2015) or flow-based inference (Rezende & Mohamed, 2015).

A complementary (but unexplored) direction for further improving the richness of posterior distributions is available through the use of undirected graphical models (UGMs). This is compelling as UGMs can succinctly capture complex multimodal relationships between variables. However, there are three challenges when training UGMs as approximate posteriors: i) There is no known low-variance path-wise gradient estimator (like the reparameterization trick) for learning parameters of UGMs. ii) Sampling from general UGMs is intractable and approximate sampling is computationally expensive. iii) Evaluating the probability of a sample under a UGM can require an intractable partition function. These two latter costs are particularly acute when UGMs are used as posterior approximators because the number of UGMs grows with the size of the dataset.

However, we note that posterior UGMs conditioned on observed data points tend to be simple as there are usually only a small number of modes in the latent space that explain an observation. We expect then, that sampling and partition function estimation (challenges ii and iii) can be solved efficiently using parallelizable Markov chain Monte Carlo (MCMC) methods. In fact, we observe that UGM posteriors trained in a VAE model tend to be unimodal but with a mode structure that is not necessarily well-captured by mean-field posteriors. Nevertheless, in this case, MCMC-based sampling methods quickly equilibrate.

To address challenge i) we estimate the gradient by reparameterized sampling in the MCMC updates. Although an infinite number of MCMC updates are theoretically required to obtain an unbiased gradient estimator, we observe that a single MCMC update provides a low-variance but biased gradient estimation that is usually sufficient for training posterior UGMs.

UGMs in the form of Boltzmann machines (Ackley et al., 1985) have recently been shown to be effective as priors for VAEs allowing for discrete versions of VAEs (Rolfe, 2016; Vahdat et al., 2018b, a). However, previous work in this area has relied on directed graphical models (DGMs) for posterior approximation. In this paper, we replace the DGM posteriors of discrete VAEs (DVAEs) with Boltzmann machine UGMs and show that these posteriors provide a generative performance comparable to or better than previous DVAEs. We denote this model as DVAE##, where ## indicates the use of UGMs in both the prior and the posterior.

We begin by summarizing related work on developing expressive posterior models including models trained nonvariationally. Nonvariational models employ MCMC to directly sample posteriors and differ from our variational approach which uses MCMC to sample from an amortized undirected posterior. Sec. 2 provides the necessary background on variational learning and MCMC methods to allow for the development in Sec. 3 of a gradient estimator for undirected posterior models. We provide examples for both Gaussian (Sec. 3.1) and Boltzmann machine (Sec. 3.2) UGMs. Experimental results are provided in Sec. 4 on VAEs (Sec. 4.1), importance-weighted VAEs (Sec. 4.2), and structured prediction (Sec. 4.3), where we observe consistent improvement using UGMs. We conclude in Sec. 5 with a list of future work.

1.1 Related Work

Inference gap reduction in VAEs: Previous work on reducing the gap between true and approximate posteriors can be grouped into three categories: i) Training objectives that replace the variational bound with tighter bounds on data log-likelihood (Burda et al., 2015; Li & Turner, 2016; Bornschein et al., 2016)

. ii) Autoregressive models 

(Hochreiter & Schmidhuber, 1997; Graves, 2013) that use DGMs to form more flexible distributions (Gregor et al., 2013; Gulrajani et al., 2016; Van Den Oord et al., 2016; Salimans et al., 2017; Chen et al., 2018). iii) Flow-based models (Rezende & Mohamed, 2015) that use a class of invertible neural networks with simple Jacobians to map the input space to a latent space in which dependencies can be disentangled (Kingma et al., 2016; Kingma & Dhariwal, 2018; Dinh et al., 2014, 2016). To the best of our knowledge, UGMs have not been used as approximate posteriors.

Gradient estimation in latent variable models: REINFORCE (Williams, 1992) is the most generic approach for computing the gradient of the approximate posteriors. However, in practice, this estimator suffers from high-variance and must be augmented by variance reduction techniques. For a large class of continuous latent variable models the reparameterization trick (Kingma & Welling, 2014; Rezende et al., 2014) provides lower-variance gradient estimates. Reparameterization does not apply to discrete latent variables and recent methods for discrete variables have focused on REINFORCE with control variates (Mnih & Gregor, 2014; Gu et al., 2015; Mnih & Rezende, 2016; Tucker et al., 2017; Grathwohl et al., 2018) or continuous relaxations (Maddison et al., 2016; Jang et al., 2016; Rolfe, 2016; Vahdat et al., 2018b, a). See (Andriyash et al., 2018) for a review of the recent techniques.

Variational Inference and MCMC: A common alternative to variational inference is to use MCMC to sample directly from posteriors in generative models (Welling & Teh, 2011; Salimans et al., 2015; Wolf et al., 2016; Hoffman, 2017; Li et al., 2017; Caterini et al., 2018). Our method differs from these techniques, as we use MCMC to sample from an amortized undirected posterior.

2 Background

In this section we provide background for the topics discussed in this paper.

Undirected graphical models:

A UGM represents the joint probability distribution for a set of random variables

as , where is a -parameterized energy function defined by an undirected graphical model, and

is the partition function. For example, the multivariate Gaussian distribution with mean

and precision matrix is a UGM with energy function with .

UGMs with quadratic energy functions defined over binary variables

are called Boltzmann machines (BMs). We use restricted Boltzmann machines (RBMs) defined on a bipartite graph. An RBM has energy function

, where and indicate linear biases and represents pairwise interactions and . RBMs allow for parallel Gibbs sampling updates as and are factorial in the components of and .

Variational autoencoders: A VAE is a generative model factored as , where is a prior distribution over latent variables and is a probabilistic decoder representing the conditional distribution over data variables given . Since maximizing the marginal log-likelihood is intractable in general, VAEs are trained by maximizing a variational lower bound (ELBO) on :

(1)

where is a probabilistic encoder that approximates the posterior over latent variables given a data point. For continuous latent variables, the ELBO is typically optimized using the reparameterization trick. With reparameterization, the expectation with respect to is replaced with an expectation over a simple continuous base distribution. Samples from the base distribution are transformed under a differentiable mapping to samples from .

When is factorized using a directed graphical model (e.g. ) the reparameterization trick is applied to each conditional and full joint samples are assembled using ancestral sampling.

This approach cannot be directly applied to binary latent variables because there is no differentiable mapping that maps samples from a continuous base distribution to the binary distribution. However, several continuous relaxations have recently been proposed to mitigate this problem. These approaches can be grouped into two categories. In the first category (Maddison et al., 2016; Vahdat et al., 2018b, a)

, the binary random variables are relaxed to continuous random variables whose distributions in the limit of low temperature approach the original binary distributions. Instead of optimizing the objective in Eq. (

1), a new objective is defined on the generative model with relaxed variables. In the second category (Jang et al., 2016; Khoshaman & Amin, 2018; Andriyash et al., 2018), the objective function inside the expectation in Eq. (1) is relaxed directly by replacing the discrete variables with their continuous version. Although this relaxation does not yield a consistent probabilistic objective, it produces results on par with the first group.

Importance weighted (IW) bounds: The variational lower bound in Eq. (1) on the data log-likelihood can be tightened using importance weighting. A -sample estimation of log-likelihood is

(2)

where . Both continuous relaxation and reparameterized sampling can be used for optimizing this objective function.

MCMC methods: MCMC encompasses a variety of methods for drawing approximate samples from a target distribution . Each MCMC method is characterized by a transition kernel whose equilibrium distribution is equal to . MCMC can be considered as a stochastic approximate solution to the fixed point equation:

(3)

which is solved by sampling from an initial distribution and iteratively updating the samples by drawing conditional samples using . Denoting the distribution of the samples after iterations by , where represents applications of the transition kernel, the theory of MCMC shows that under some regulatory conditions as . In practice, is usually chosen to satisfy the detailed balance condition:

(4)

which implies Eq. (3). Finally, itself is a valid transition kernel and satisfies Eq. (3) and Eq. (4).

Gibbs sampling: Gibbs sampling is a particular MCMC method for drawing samples from UGMs. In this method, a single variable in a UGM is updated by sampling from at each iteration. The method can be inefficient as a complete sweep requires a sequential update of all variables. However, when the UGM is defined on a bipartite graph, can be partitioned into disjoint sets and and the conditional independence among the components of in and in can be leveraged to update each partition in parallel.111To be precise, having a bipartite graph is not a sufficient condition. The energy function should also have a form in which the conditionals and can be formed. In this case, all variables can be updated in two half-sweeps. The transition kernel is written as:

(5)

3 Learning Undirected Posteriors

In this section, we propose a gradient estimator for generic UGM posteriors and discuss how the estimator can be applied to RBM-based posteriors.

The training objective for a probabilistic encoder network with being a UGM can be written as:

(6)

where is a differentiable function of . For simplicity of exposition, we assume that does not depend on .222If does depend on , then is approximated with Monte Carlo sampling. Using the fixed point equation in Eq. (3), we have:

(7)

To maximize Eq. (7), we require its gradient:

(8)

The term marked as is written as:

(9)
(10)

where we have used the detailed balance condition in the second line. The form of Eq. (10) makes it clear that goes to zero in expectation as , because and . However, has high variance that does not decrease as .

The expectation in Eq. (3) marked as involves the gradient of an expectation with respect to MCMC transition kernel. Fortunately, the reparameterization trick can provide a low-variance gradient estimator for this term. Moreover, as , the contribution approaches the full gradient as .

Given the high variance of , it is beneficial to drop this term from the gradient (Eq. (3)) and use the biased but lower-variance estimate . Furthermore, the choice of determines the computational complexity of the estimator, and its bias and variance. Our key observation is that increasing has little effect on the optimization performance because the decreased bias is accompanied by an increase in variance. This means that is a good choice with the smallest computational complexity. More specifically, we use the approximation:

(11)

where is a reparameterized sample from .

In principle, MCMC methods such as Metropolis-Hastings or Hamiltonian Monte Carlo (HMC) can be used as the transition kernel in Eq. (11). However, unbiased reparameterized sampling for these methods is challenging. For example, Metropolis-Hastings contains a nondifferentiable binary operation in the accept/reject step and HMC, in addition to the binary step, requires backpropagating through the gradients of the energy function and tuning hyper-parameters such as step size.

These complications can be avoided when Gibbs sampling of is possible. When is a UGM defined on a bipartite graph, the Gibbs transition kernel is simply as in Eq. (5). This kernel does not contain an accept/reject step or any hyper-parameters. Assuming that the conditionals can be reparameterized, the gradient of the kernel is approximated efficiently using low-variance reparameterized sampling from each conditional. In this case, the gradient estimator for a single Gibbs update is:

(12)

where and are reparameterized samples. The extension to Gibbs updates is straightforward.

Lastly, we note that the reparameterized sample has distribution if is equilibrated. So, if

has its own parameters (e.g., the parameters of the generative model in VAEs), the same sample can be used to compute an unbiased estimation of

where denotes the parameters of .

3.1 Toy Example

Here, we assess the efficacy of our approximations for a multivariate Gaussian distribution. We consider learning a two-variable Gaussian distribution depicted on Fig. 3(b). The target distribution is defined by the energy function , where and . The distribution has the same form and we learn the parameters by minimizing the Kullback-Leibler (KL) divergence . We compare Eq. (12) for with reparameterized sampling and with REINFORCE. Fig. 3(a) shows the KL divergence during training. Our method significantly outperforms REINFORCE due to lower variance of the gradients. There is little difference between different until KL divergence becomes . Finally, our method performs worse then reparameterized sampling, which is expected due to the bias introduced by neglecting in (Eq. (3)).

(a) KL divergence

(b) Target Gaussian distribution
Figure 3: (a) Our gradient estimator (for various ) is compared with REINFORCE and reparameterized Gaussian samples when minimizing the KL divergence of a Gaussian to a target distribution (b).

3.2 Learning Boltzmann Machine Posteriors

In this section, we consider training DVAE##, a DVAE (Rolfe, 2016; Vahdat et al., 2018b, a) with RBM prior and posterior. The objective function of DVAE## is given by Eq. (1), where is an RBM with parameters . is generated using a neural network. To minimize , we need the gradients with respect to for each data point :

(13)

For each , this gradient has the form Eq. (6) with . Although the RBM has a bipartite structure, applying the gradient estimator in Eq. (12) is challenging due to the discrete nature of the latent variables. In order to apply the reparameterization trick, we further relax the binary variables to continuous variables using Gumbel-softmax or piece-wise linear relaxations.333Another option is to use unbiased gradient estimators such as REBAR (Tucker et al., 2017) or RELAX (Grathwohl et al., 2018). However, with these approaches, the number of evaluations increases with . Representing the relaxation of by and the relaxation of by , where , we define the following gradient estimator:

(14)

where is computed using automatic differentiation by backpropagating through reparameterized sampling. Thus far, we have only accounted for the dependency of on through samples (this is known as the path-wise derivative). However, in the case of VAEs and IWAEs, also depends on through the contribution. Note that this dependency can be ignored in the case of VAEs as  (Roeder et al., 2017), and it can be removed in IWAEs using the doubly reparameterized gradient estimation (Tucker et al., 2018). This enables us to ignore the partition function of and its gradient during training as it does not involve in the path-wise derivative (see Appendix C).

If , each Gibbs sampling step can be relaxed. However, as increases sampling from the relaxed chain diverges from the corresponding discrete chain resulting in even more biased gradient estimation. Thus, we use in our experiments (See Appendix A for ). Moreover, in Eq. (14), our estimator also requires samples from the RBM posterior in the outer expectation. These samples are obtained by running persistent chains for each training datapoint. Finally, may have its own parameters (i.e., the parameters of the generative model ). The gradient of with respect to these parameters can be obtained using either the relaxed samples and , or the discrete samples and . We use the former as it results in a single function evaluation per instance in each minibatch parameter update.

Algorithm 1 summarizes our method for training DVAE## with an RBM prior and posterior. We represent the Boltzmann prior using . The number of Gibbs sweeps for generating in Eq. (14) is denoted by and the number of Gibbs sweeps for sampling is denoted by . The objective function is such that automatic differentiation of will yield the gradient estimator in Eq. (14) for and evaluated using relaxed samples for . For , we use the method introduced in (Vahdat et al., 2018b) Sec. E to obtain proper gradients.

  Input: training minibatch , number of Gibbs sweeps , and number of relaxed Gibbs sweeps .
  Output: training objective function
  
  for each  do
      = encoder
      retrieve_persistent_states
      update_gibbs_samples, )
     store_updated_states
     
      relax_gibbs_sample
      stop_gradient
     
     
  end for
Algorithm 1 DVAE## with RBM prior and posterior

4 Experiments

We examine our proposed gradient estimator for training undirected posteriors for three tasks: variational autoencoders, importance weighted autoencoders, and structured prediction models on the binarized MNIST 

(Salakhutdinov & Murray, 2008) and OMNIGLOT (Lake et al., 2015) datasets. We follow the experimental setup in DVAE# (Vahdat et al., 2018a) with minor modifications outlined below.

4.1 Variational Autoencoders

We train a VAE with a generative model of the form , where is an RBM and is a neural network. The conditional

is represented using a fully-connected neural network with two 200-unit hidden layers, tanh activations, and batch normalization similar to DVAE++, DVAE# 

(Vahdat et al., 2018b, a) and GumBolt (Khoshaman & Amin, 2018). We evaluate the performance of the generative models trained with either an undirected posterior (DVAE##) or a directed posterior.

For DVAE##, is modeled with a neural network having two tanh hidden layers that predicts the parameters of the RBM (Fig. 7(a)). Training DVAE## is done using Algorithm 1 with and using a piece-wise linear relaxation (Andriyash et al., 2018) for relaxed Gibbs samples. We follow (Vahdat et al., 2018a) for batch size, learning rate schedule, and KL warm up parameters. During training, samples from the prior are required for computing the gradient of . This sampling is done using the (QuPA, )

library that offers population-annealing-based sampling and partition function estimation (using AIS) from within Tensorflow. We also use

(QuPA, ) for sampling the undirected posteriors and estimating their partition function during evaluation. We explore VAEs with equally-sized RBM prior and posteriors consisting of either 200 (100100) or 400 (200200) latent variables. Test set negative log-likelihoods are computed using 4000 importance-weighted samples.

Baselines: We compare the RBM posteriors with directed posteriors where the directed models are factored across groups of latent variables as . Each conditional, , is factorial across the components of and is the total number of hierarchical layers. A fair comparison between directed and undirected posteriors is challenging because they differ in structure and in the number of parameters. We design a baseline so that the number of parameters and the number of nonlinearities is identical for a directed posterior with a single group () and an undirected posterior with no pairwise interactions. This is reasonable as both cases reduce to a mean-field posterior.

Following this principal, we initially experimented with the structure introduced in DVAE#444These models are also used in DVAE++ and GumBolt. in which each factor is represented using two tanh nonlinearities (see Fig. 7(b)). However, our experiments indicated that when the number of latent variables is held constant, increasing the number of layers does not necessarily improve the generative performance.

In undirected posteriors, the parameters of the posterior for each training example are predicted using a single shared context feature ( in Fig. 7(a)). However, in DVAE#, parallel neural networks generate the parameters of the posterior for each level of the hierarchy. Inspired by this observation, we define a new structure for the directed posteriors by introducing a shared (200 dimensional) context feature that is computed using two tanh nonlinearities. This shared feature is fed to the subsequent conditional each represented with a linear layer, i.e., (see Fig. 7(c)). In Appendix B, the shared context structure is compared against the original structure.

We examine three recent methods for training VAEs with directed posteriors with shared context. These baselines are i) DVAE# (Vahdat et al., 2018a) that uses power-function distribution to relax the objective function, ii) Concrete relaxation used in GumBolt (Khoshaman & Amin, 2018), and iii) piece-wise linear (PWL) relaxation (Andriyash et al., 2018). All models are trained using , 2, or 4 hierarchical levels.

Results: The performance of DVAE## is compared against the baselines in Table 1. We make two observations. i) The baselines with are equivalent to a mean-field posterior which is also the special case of an undirected posterior with no pairwise terms. Since PWL and DVAE## use the same continuous relaxation, we can compare PWL and DVAE## to see how introducing pairwise interactions improve the quality of the DVAE models. In our experiments, we consistently observe approximately -nat improvement arising from pairwise interactions. ii) DVAE## with undirected posteriors outperforms the directed baselines in most cases indicating that UGMs form more appropriate posteriors.

(a)

(b)

linear layer

concat

tanh

(c)
Figure 7: Neural networks representing : (a) A 2-layer network predicts the parameters of RBM. (b) The directed posterior used in DVAE# consists of parallel 2-layer networks to successively predict,

, the logits for each conditional in

. (c) Our directed posterior differs from DVAE# and predicts the parameters of each conditional in

using a linear transformation given the shared context feature

and previous .
Prior RBM: 100100 Prior RBM: 200200
DVAE# Concrete PWL DVAE## DVAE# Concrete PWL DVAE##

MNIST

1 84.970.03 84.660.04 84.620.05 84.090.06 83.210.03 83.190.05 83.220.05 82.750.05
2 84.960.05 84.710.03 84.500.07 83.130.04 83.040.02 82.990.05
4 84.580.02 84.390.04 84.070.04 82.930.04 83.140.04 82.900.03

OMNI.

1 101.640.05 101.410.06 101.240.02 100.690.04 99.510.03 99.390.04 99.320.04 98.610.08
2 101.750.04 101.390.06 101.140.05 99.400.05 99.120.07 99.100.04
4 101.740.07 102.040.10 101.140.05 99.470.04 99.970.02 99.300.03
Table 1: The performance of DVAE## is compared against hierarchical posteriors with different relaxations. Meanstandard deviation of the negative log-likelihood for five runs are reported. Models are trained with the variational bound. Boldface numbers indicate the best performing models per latent variable size and dataset.

4.1.1 Examining Undirected Posteriors

We examine the posteriors trained with DVAE## on MNIST in terms of the number of modes and the difficulty of sampling and partition function estimation.

Multimodality: To estimate the number of modes we find a variety of mean field solutions as follows: Given an RBM we draw samples using (QuPA, )

and use each sample to initialize the construction of a mean-field approximation. Specifically, we initialize the mean parameter of a factorial Bernoulli distribution to the sample value and then iteratively minimize the KL until we converge to a fixed point describing a mode. In this way, the initial population of samples converges to a perhaps smaller number of unique modes.

Using this method, we observe that the majority () of trained posteriors have a single mode. This is in contrast with the prior RBM that typically has 10 to 20 modes. However, we also observe that the KL from the converged mean-field distributions to the RBM posteriors is typically in the range . This indicates that, while most RBM posteriors are unimodal, the structure of the mode is not captured completely by a factorial distribution.

The difficulty of sampling and partition function estimation: Fig. 10 visualizes the deviation of log partition function estimations for an RBM posterior and prior for different numbers of AIS temperatures. As shown in the figure, the number of temperatures required for achieving an acceptable precision (e.g.,

) for the RBM posterior is 10 times smaller than the number of temperatures required for the RBM prior. The main reason for this difference is that the RBM posteriors have strong linear biases. When annealing starts from a mean-field distribution containing only the biases, AIS requires fewer interpolating temperatures in order to accurately approximate the partition function. However, RBM priors which do not contain strong linear biases, do not benefit from this.

(a)
(b)
Figure 10: The difficulty of sampling and partition function estimation is visualized using a) the mean absolute difference between estimates and its true value and b) the variance of the estimates for different number of temperatures. The true value is computed using temperatures. The number of temperatures required for achieving for the RBM posterior is 10x smaller than the number of temperatures required for the RBM prior.

4.2 Importance Weighted Autoencoders

In this section, we examine the generative model introduced in the previous section but using the IW bound. For comparison, we only use the PWL baseline as it achieves the best performance among directed posterior baselines in Table 1

. For training, we use the same hyperparameters used in the previous section with an additional hyperparameter

representing the number of IW samples.

To sidestep computation of the gradient of the partition function for undirected posteriors in the IW bound, we use the path-wise gradient estimator introduce in (Tucker et al., 2018). However, we observe that an annealing scheme in IWAEs (similar to KL-warm up in VAEs) improves the performance of the generative model by preventing the latent variables from turning off. In Appendix C, we introduce this annealing mechanism for training IWAEs and show how the path-wise gradient can be computed while annealing the objective function.

The experimental results for training DVAE## are compared against directed posteriors in Table 2. As can be seen, DVAE## achieves a comparable performance on MNIST, but outperforms the directed posteriors on OMNIGLOT.

Prior RBM: 100100 Prior RBM: 200200
PWL PWL PWL DVAE## PWL PWL PWL DVAE##

MNIST

1 84.600.04 84.490.03 84.040.03 84.060.03 83.270.06 82.990.04 82.890.03 82.760.05
5 84.150.06 83.880.02 83.530.06 83.560.02 83.240.06 82.900.06 82.650.03 82.760.06
25 83.960.05 83.700.03 83.380.02 83.440.04 83.380.04 83.050.03 82.840.02 82.970.11

OMNI.

1 101.210.06 101.140.07 101.120.03 100.680.05 99.320.02 99.110.02 99.280.09 98.610.06
5 100.720.05 100.580.05 100.510.04 100.160.04 99.190.05 98.690.09 98.780.02 98.340.03
25 100.580.01 100.480.05 100.370.03 100.020.04 99.100.03 98.700.07 98.860.06 98.340.04
Table 2: The performance of DVAE## is compared against hierarchical posteriors with the PWL relaxation trained with the IW bound. Meanstandard deviation of the negative log-likelihood for five runs are reported. Boldface numbers indicate the best performing models per latent variable size, dataset, and .

4.3 Structured Output Prediction

Structured prediction is a form of conditional likelihood estimation (Sohn et al., 2015) concerned with modeling the distribution of a high-dimensional output given an input. Here, we predict the distribution of the bottom half of an image given the top half . For this, we follow the IW objective function proposed by (Raiko et al., 2015):

where we have added an entropy term to prevent the model from over-fitting the training data. This expectation (without the entropy term) is identical to the IW bound in Eq. (2), where the prior and the approximate posterior are both set to and it can thus be considered as a lower bound on . The scalar is annealed during training from 1 to a small value (e.g., ).

The experimental results for the structured prediction problem are reported in Table 3. Here, the latent space is limited to 200 binary variables and and are modeled by fully-connected networks with one 200-unit hidden to limit overfitting. Performance is estimated in terms of the average log-conditional measured using 4000 IW samples. Undirected posteriors outperform the directed models by several nats on MNIST and OMNIGLOT.

RBM Size: 100100
K PWL PWL PWL DVAE##

MNIST

1 60.820.17 59.540.12 60.220.12 57.130.18
5 52.380.03 52.270.16 52.670.05 49.250.07
25 48.300.08 48.410.05 48.610.11 45.750.10

OMNI.

1 67.060.04 67.110.11 67.350.08 63.740.08
5 59.000.03 59.280.08 59.340.10 57.530.08
25 54.790.04 54.830.04 54.880.06 54.320.04
Table 3: The performance of DVAE## is compared against hierarchical posteriors with the PWL relaxation on the structured prediction problem. Meanstandard deviation of the negative log-likelihood for five runs are reported.

5 Conclusions and Future Directions

We have introduced a gradient estimator for stochastic optimization with undirected distributions and showed that VAEs and IWAEs with RBM posteriors can outperform similar models having directed posteriors. These encouraging results for UGMs over discrete variables suggest a number of promising research directions.

Exponential Family Harmoniums: The methods we have outlined also apply to UGMs defined over continuous variables (as in Fig 3). Exponential Family Harmoniums (EFHs) (Welling et al., 2005) generalize RBMs and are composed of two disjoint groups of exponential-family random variables where the sufficient statistics of the groups are tied to each other with a quadratic interaction. EFHs retain the factorial structure of the conditionals and so that we can easily backpropagate through Gibbs updates. However, the exponential-family generalization allows for the mixing of different latent variable types and distributions.

Auxiliary Random Variables: An appealing property of RBMs (and EFHs) is that either group of variables can be analytically marginalized out. This property can be exploited to form more expressive approximate posteriors. Consider a generative model . To approximate the posterior we augment the latent space of with auxiliary variables , and form a UGM over the joint space. In this case, the marginal approximate posterior has more expressive power. Sampling from can be done by sampling from the joint and our gradient estimator can be used for training the parameters of the UGM. The objective function of VAE can be optimized easily as the log-marginal has an analytic expression up to the normalization constant given a sample from the posterior.

Combining DGMs and UGMs: In practice, VAEs trained with DGM posteriors have shown promising results. However, each factor in DGMs is typically assumed to be a product of independent distributions. We can build more powerful DGMs by modeling each using a UGM.

Sampling and Computing Partition Function: A challenge when using continuous-variable UGMs as posteriors is the requirement for sampling and partition function estimation during evaluation if the test data log-likelihood is estimated using the IW bound. However, we note that approximate posteriors are only required for training VAEs. At evaluation, we can sidestep sampling and partition function estimation challenges using techniques such as AIS (Wu et al., 2017) starting from the prior distribution when the latent variables are continuous.

References

Appendix A Toy Example: RBM Posterior

In this section, we examine the dependence of the gradient estimator on , the number of MCMC updates on a toy example. Here, a bit RBM is learned to minimize the KL divergence to a mixture of Bernoulli distributions:

(15)
(16)

We choose 8 mixture components with weights chosen uniformly and Bernoulli components having variance . The training of parameters is done with the Adam optimizer (Kingma & Ba, 2014), using learning rate for 10000 iterations. We show the results for several values of in Fig 11. We note that the effectiveness of optimization does not depend on indicating that the bias reduction obtained by increasing is offset by increased variance.

Figure 11: Optimization of KL divergence from an RBM to a mixture of Bernoulli distributions using the proposed gradient estimator. Increasing the number of MCMC updates does not improve the gradient estimation.

Appendix B Examining the Shared Context

For the directed posterior baselines, we initially experimented with the structure introduced in DVAE# in which each factor is represented using two tanh nonlinearities as shown in Fig. 7(b). However, initial experiments indicated that when the number of latent variables is held constant, increasing the number of subgroups, , does not improve the generative performance.

For undirected posteriors, the parameters of the posterior for each training example are predicted using a single shared context feature (Fig. 7(a)). Inspired by this observation, we define a new structure for the directed posteriors by introducing a shared context feature that is computed using two tanh nonlinearities. This shared feature is fed to the subsequent conditional each represented with a linear layer. i.e., (see Fig. 7(c)).

In Table 4, we compare the original structure used in DVAE# to the structure with a shared context. In almost all cases the shared context feature improves the performance of the original structure. Moreover, increasing the number of hierarchical layers often improves the performance of the new structure.

Note that in Table 4, we report the average negative log-likelihood measured on the test set for 5 runs. In the case, both structures are identical and they achieve statistically similar performance.

Prior RBM Size: 100100 Prior RBM Size: 200200
DVAE# Concrete PWL DVAE# Concrete PWL
context
MNIST #layers 1 84.95 84.97 84.65 84.66 84.57 84.62 83.26 83.21 83.21 83.19 83.23 83.22
2 84.81 84.96 84.75 84.71 84.70 84.50 83.26 83.13 83.24 83.04 83.38 82.99
4 84.61 84.58 84.90 84.39 85.00 84.07 83.13 82.93 83.69 83.14 83.85 82.90
OMNI. #layers 1 101.69 101.64 101.41 101.41 101.21 101.24 99.53 99.51 99.39 99.39 99.37 99.32
2 101.84 101.75 102.45 101.39 101.64 101.14 99.66 99.40 100.52 99.12 100.07 99.10
4 101.93 101.74 102.76 102.04 101.97 101.14 99.63 99.47 100.99 99.97 100.45 99.30
Table 4: The performance of structure with a shared context feature (Fig. 7(c)) is compared against the original structure used in DVAE# (Fig. 7(b)). The shared context feature often improves the performance.

Appendix C Annealing the Importance Weighted Bound

KL annealing (Sønderby et al., 2016) is used to prevent latent variables from turning off in VAE training. A scalar is introduced which weights KL contribution to the ELBO and is annealed from zero to one during training:

(17)

Since the KL contribution is small early in training, the VAE model initially optimizes the reconstruction term which prevents the approximate posterior from matching to the prior.

Here, we apply the same approach to the importance weighted bound by rewriting the IW bound by:

(18)

where . Similar to the KL annealing idea, we anneal the scalar during training from zero to one. When is small, the IW bound emphasizes reconstruction of the data which inhibits the latent variables from turning off. When , Eq. (18) reduces to Eq. (17).

The gradient of Eq. (18) with respect to , the parameters of and , the parameters of the generative model is:

(19)

where . The gradient is evaluated as:

(20)

The second term in the right hand side of Eq. (20) is the path-wise gradient that can be computed easily using our biased gradient estimator. This term does not have any dependence on the partition function since

and depends only on the energy function.

However, the first term in the right hand side of Eq. (20) does contain which can be high-variance. We apply the doubly reparameterized method proposed by (Tucker et al., 2018) to remove this term. Following a derivation similar to (Tucker et al., 2018), it is easy to show that:

where .