Iterative Refinement of the Approximate Posterior for Directed Belief Networks (http://arxiv.org/abs/1511.06382)
Variational methods that rely on a recognition network to approximate the posterior of directed graphical models offer better inference and learning than previous methods. Recent advances that exploit the capacity and flexibility in this approach have expanded what kinds of models can be trained. However, as a proposal for the posterior, the capacity of the recognition network is limited, which can constrain the representational power of the generative model and increase the variance of Monte Carlo estimates. To address these issues, we introduce an iterative refinement procedure for improving the approximate posterior of the recognition network and show that training with the refined posterior is competitive with state-of-the-art methods. The advantages of refinement are further evident in an increased effective sample size, which implies a lower variance of gradient estimates.READ FULL TEXT VIEW PDF
Learning latent variable models with stochastic variational inference is...
Highly expressive directed latent variable models, such as sigmoid belie...
The representation of the posterior is a critical aspect of effective
Training deep directed graphical models with many hidden variables and
Classical approaches for approximate inference depend on cleverly design...
Amortized inference has led to efficient approximate inference for large...
We propose using recognition networks for approximate inference inBayesi...
Iterative Refinement of the Approximate Posterior for Directed Belief Networks (http://arxiv.org/abs/1511.06382)
Variational methods have surpassed traditional methods such as Markov chain Monte Carlo(MCMC, Neal, 1992) and mean-field coordinate ascent (Saul et al., 1996) as the de-facto standard approach for training directed graphical models. Helmholtz machines (Dayan et al., 1995) are a type of directed graphical model that approximate the posterior distribution with a recognition network
that provides fast inference as well as flexible learning which scales well to large datasets. Many recent significant advances in training Helmholtz machines come as estimators for the gradient of the objective w.r.t. the approximate posterior. The most successful of these methods, variational autoencoders(VAE, Kingma and Welling, 2013), relies on a re-parameterization of the latent variables to pass the learning signal to the recognition network. This type of parameterization, however, is not available with discrete units, and the naive Monte Carlo estimate of the gradient has too high variance to be practical (Dayan et al., 1995; Kingma and Welling, 2013).
However, good estimators are available through importance sampling (Bornschein and Bengio, 2014), input-dependent baselines (Mnih and Gregor, 2014), a combination baselines and importance sampling (Mnih and Rezende, 2016), and parametric Taylor expansions (Gu et al., 2015b). Each of these methods strive to be a lower-variance and unbiased gradient estimator. However, the reliance on the recognition network means that the quality of learning is bounded by the capacity of the recognition network, which in turn raises the variance.
We demonstrate reducing the variance of Monte Carlo based estimators by iteratively refining the approximate posterior provided by the recognition network. The complete learning algorithm follows expectation-maximization(EM, Dempster et al., 1977; Neal and Hinton, 1998)
, where in the E-step the variational parameters of the approximate posterior are initialized using the recognition network, then iteratively refined. The refinement procedure provides an asymptotically-unbiased estimate of the variational lowerbound, which is tight w.r.t. the true posterior and can be used to easily train both the recognition network and generative model during the M-step. The variance-reducing refinement is available to any directed graphical model and can give a more accurate estimate of the log-likelihood of the model.
For the iterative refinement step, we use adaptive importance sampling (AIS, Oh and Berger, 1992). We demonstrate the proposed refinement procedure is effective for training directed belief networks, providing a better or competitive estimates of the log-likelihood. We also demonstrate the improved posterior from refinement can improve inference and accuracy of evaluation for models trained by other methods.
A directed belief network is a generative directed graphical model consisting of a conditional density and a prior , such that the joint density can be expressed as . In particular, the joint density factorizes into a hierarchy of conditional densities and a prior: , where is the conditional density at the -th layer and is a prior distribution of the top layer. Sampling from the model can be done simply via ancestral-sampling, first sampling from the prior, then subsequently sampling from each layer until reaching the observation, . This latent variable structure can improve model capacity, but inference can still be intractable, as is the case in sigmoid belief networks (SBN, Neal, 1992)Hinton et al., 2006), deep autoregressive networks (DARN, Gregor et al., 2013), and other models in which each of the conditional distributions involves complex nonlinear functions.
The objective we consider is the likelihood function, , where
represent parameters of the generative model (e.g. a directed belief network). Estimating the likelihood function given the joint distribution,, above is not generally possible as it requires intractable marginalization over . Instead, we introduce an approximate posterior, , as a proposal distribution. In this case, the log-likelihood can be bounded from below*** For clarity of presentation, we will often omit dependence on parameters of the generative model, so that :
where we introduce the subscript in the lowerbound to make the connection to importance sampling later. The bound is tight (e.g., ) when the KL divergence between the approximate and true posterior is zero (e.g., ). The gradients of the lowerbound w.r.t. the generative model can be approximated using the Monte Carlo approximation of the expectation:
The success of variational inference lies on the choice of approximate posterior, as poor choice can result in a looser variational bound. A deep feed-forward recognition network parameterized by has become a popular choice, such that , as it offers fast and flexible data-dependent inference (see, e.g., Salakhutdinov and Larochelle, 2010; Kingma and Welling, 2013; Mnih and Gregor, 2014; Rezende et al., 2014). Generally known as a “Helmholtz machine” (Dayan et al., 1995), these approaches often require additional tricks to train, as the naive Monte Carlo gradient of the lowerbound w.r.t. the variational parameters has high variance. In addition, the variational lowerbound in Eq. (1) is constrained by the assumptions implicit in the choice of approximate posterior, as the approximate posterior must be within the capacity of the recognition network and factorial.
These assumptions can be relaxed by using an unbiased -sampled importance weighted estimate of the likelihood function (see (Burda et al., 2015) for details):
where and are the importance weights. This lowerbound is tighter than the single-sample version provided in Eq. (1) and is an asymptotically unbiased estimate of the likelihood as .
The gradient of the lowerbound w.r.t. the model parameters is simple and can be estimated as:
The estimator in Eq. (3) can reduce the variance of the gradients, , but in general additional variance reduction is needed (Mnih and Rezende, 2016). Alternatively, importance sampling yields an estimate of the inclusive KL divergence, , which can be used for training parameters of the recognition network (Bornschein and Bengio, 2014)
. However, it is well known that importance sampling can yield heavily-skewed distributions over the importance weights(Doucet et al., 2001), so that only a small number of the samples will effectively have non-zero weight. This is consequential not only in training, but also for evaluating models when using Eq. (3
) to estimate test log-probabilities, which requires drawing a very large number of samples (in the literature for models trained on MNIST (Gregor et al., 2013)).
The effective samples size, , of importance-weighted estimates increases and is optimal when the approximate posterior matches the true posterior:
Conversely, importance sampling from a poorer approximate posterior will have lower effective sampling size, resulting in higher variance of the gradient estimates. In order to improve the effectiveness of importance sampling, we need a method for improving the approximate posterior from those provided by the recognition network.
To address the above issues, iterative refinement for variational inference (IRVI) uses the recognition network as a preliminary guess of the posterior, then refines the posterior through iterative updates of the variational parameters. For the refinement step, IRVI uses a stochastic transition operator, , that maximizes the variational lowerbound.
An overview of IRVI is available in Figure 1. For the expectation (E)-step, we feed the observation through the recognition network to get the initial parameters, , of the approximate posterior, . We then refine by applying updates to the variational parameters, , iterating through parameterizations of the approximate posterior .
With the final set of parameters, , the gradient estimate of the recognition parameters in the maximization (M)-step is taken w.r.t the negative exclusive KL divergence:
where . Similarly, the gradients w.r.t. the parameters of the generative model follow Eqs. (2) or (4) using samples from the refined posterior . As an alternative to Eq. (6), we can maximize the negative inclusive KL divergence using the refined approximate posterior:
The form of the IRVI transition operator, , depends on the problem. In the case of continuous variables, we can make use of the VAE re-parameterization with the gradient of the lowerbound in Eq. (1) for our refinement step (see supplementary material). However, as this is not available with discrete units, we take a different approach that relies on adaptive importance sampling.
Adaptive importance sampling (AIS, Oh and Berger, 1992)
provides a general approach for iteratively refining the variational parameters. For Bernoulli distributions, we observe that the mean parameter of the true posterior,, can be written as the expected value of the latent variables:
As the initial estimator typically has high variance, AIS iteratively moves toward by applying Eq. 8 until a stopping criteria is met. While using the update, in principle works, a convex combination of importance sample estimate of the current step and the parameters from the previous step tends to be more stable:
Here, is the inference rate and can be thought of as the adaptive “damping” rate. This approach, which we call adaptive importance refinement (AIR), should work with any discrete parametric distribution. Although AIR is applicable with continuous Gaussian variables, which model second-order statistics, we leave adapting AIR to continuous latent variables for future work.
The general AIR algorithm follows Algorithm 1 with gradient variations following Eqs. (2), (4), (6), and (7). While iterative refinement may reduce the variance of stochastic gradient estimates and speed up learning, it comes at a computational cost, as each update is times more expensive than fixed approximations. However, in addition to potential learning benefits, AIR can also improve the approximate posterior of an already trained directed belief networks at test, independent on how the model was trained. Our implementation following Algorithm 1 is available at https://github.com/rdevon/IRVI.
Adaptive importance refinement (AIR) trades computation for expressiveness and is similar in this regard to the refinement procedure of hybrid MCMC for variational inference (HVI, Salimans et al., 2015) and normalizing flows for VAE (NF, Rezende and Mohamed, 2015). HVI has a similar complexity as AIR, as it requires re-estimating the lowerbound at every step. While NF can be less expensive than AIR, both HVI and NF rely on the VAE re-parameterization to work, and thus cannot be applied to discrete variables. Sequential importance sampling (SIS, Doucet et al., 2001) can offer a better refinement step than AIS but typically requires resampling to control variance. While parametric versions exist that could be applicable to training directed graphical models with discrete units (Gu et al., 2015a; Paige and Wood, 2016), their applicability as a general refinement procedure is limited as the refinement parameters need to be learned.
Importance sampling is central to reweighted wake-sleep (RWS, Bornschein and Bengio, 2014), importance-weighted autoencoders (IWAE, Burda et al., 2015), variational inference for Monte Carlo objectives (VIMCO, Mnih and Rezende, 2016), and recent work on stochastic feed-forward networks (SFFN, Tang and Salakhutdinov, 2013; Raiko et al., 2014). While each of these methods are competitive, they rely on importance samples from the recognition network and do not offer the low-variance estimates available from AIR. Neural variational inference and learning (NVIL, Mnih and Gregor, 2014) is a single-sample and biased version of VIMCO, which is greatly outperformed by techniques that use importance sampling. Both NVIL and VIMCO reduce the variance of the Monte Carlo estimates of gradients by using an input-dependent baseline, but this approach does not necessarily provide a better posterior and cannot be used to give better estimates of the likelihood function or expectations.
Finally, IRVI is meant to be a general approach to refining the approximate posterior. IRVI is not limited to the refinement step provided by AIR, and many different types of refinement steps are available to improve the posterior for models above (see supplementary material for the continuous case). SIS and sequential importance resampling (SIR, Gordon et al., 1993) can be used as an alternative to AIR and may provide a better refinement step for IRVI.
We evaluate iterative refinement for variational inference (IRVI) using adaptive importance refinement (AIR) for both training and evaluating directed belief networks. We train and test on the following benchmarks: the binarized MNIST handwritten digit datasetSalakhutdinov and Murray (2008) and the Caltech-101 Silhouettes dataset. We centered the MNIST and Caltech datasets by subtracting the mean-image over the training set when used as input to the recognition network. We also train additional models using the re-weighted wake-sleep algorithm (RWS, Bornschein and Bengio, 2014), the state of the art for many configurations of directed belief networks with discrete variables on these datasets for comparison and to demonstrate improving the approximate posteriors with refinement. With our experiments, we show that 1) IRVI can train a variety of directed models as well or better than existing methods, 2) the gains from refinement improves the approximate posterior, and can be applied to models trained by other algorithms, and 3) IRVI can be used to improve a model with a relatively simple approximate posterior.
Models were trained using the RMSprop algorithm(Hinton, 2012) with a batch size of and early stopping by recorded best variational lower bound on the validation dataset. For AIR, “inference steps" (), adaptive samples (), and an adaptive damping rate, , of were used during inference, chosen from validation in initial experiments. posterior samples () were used for model parameter updates for both AIR and RWS. All models were trained for epochs and were fine-tuned for an additional with a decaying learning rate and SGD.
We use a generative model composed of a) a factorized Bernoulli prior as with sigmoid belief networks (SBNs) or b) an autoregressive prior, as in published MNIST results with deep autoregressive networks (DARN, Gregor et al., 2013):
where is the sigmoid () function, is a lower-triangular square matrix, and
is the bias vector.
For our experiments, we use conditional and approximate posterior densities that follow Bernoulli distributions:
where is a weight matrix between the and layers. As in Gregor et al. (2013) with MNIST, we do not use autoregression on the observations, , and use a fully factorized approximate posterior.
The effective sample size (ESS) in Eq. (5) is a good indicator of the variance of gradient estimate. In Fig. 5 (right), we observe that the ESS improves as we take more AIR steps when training a deep belief network (AIR(5) vs AIR(20)). When the approximate posterior is not refined (RWS), the ESS stays low throughout training, eventually resulting in a worse model. This improved ESS reveals itself as faster convergence in terms of the exact log-likelihood in the left panel of Fig. 5 (see the progress of each curve until 100 epochs. See also supplementary materials for wall-clock time.)
This faster convergence does not guarantee a good final log-likelihood, as the latter depends on the tightness of the lowerbound rather than the variance of its estimate. This is most apparent when comparing AIR(5), AIR+RW(5) and AIR+RW+IKL(5). AIR(5) has a low variance (high ESS) but computes the gradient of a looser lowerbound from Eq. (2), while the other two compute the gradient of a tighter lowerbound from Eq. (4). This results in AIR(5) converging faster than the other two, while the final log-likelihood estimates are better for the other two.
We however observe that the final log-likelihood estimates are comparable across all three variants (AIR, AIR+RW and AIR+RW+IKL) when a sufficient number of AIR steps are taken so that is sufficiently tight. When 20 steps were taken, we observe that the AIR(20) converges faster as well as achieves a better log-likelihood compared to AIR+RW(20) and AIR+RW+IKL(20). Based on these observations, we use vanilla AIR (subsequently just “AIR”) in our following experiments.
We evaluate AIR for training SBNs with one, two, and three layers of hidden units and DARN with and hidden units, comparing against our implementation of RWS. All models were tested using posterior samples to estimate the lowerbounds and average test log-probabilities.
When training SBNs with AIR and RWS, we used a completely deterministic network for the approximate posterior. For example, for a 2-layer SBN, the approximate posterior factors into the approximate posteriors for the top and the bottom hidden layers, and the initial variational parameters of the top layer, are a function of the initial variational parameters of the first layer, :
For DARN, we trained two different configurations on MNIST: one with stochastic units and an additional hyperbolic tangent deterministic layer with units in both the generative and recognition networks, and another with stochastic units with a hyperbolic tangent deterministic layer in the generative network only. We used DARN with units with the Caltech-101 silhouettes dataset.
The results of our experiments with the MNIST and Caltech-101 silhouettes datasets trained with AIR, RWS, and RWS refined at test with AIR (RWS+) are in Table 1. Refinement at test (RWS+) always improves the results for RWS. As our unrefined results are comparable to those found in Bornschein and Bengio (2014), the improved results indicate many evaluations of Helmholtz machines in the literature could benefit from refinement with AIR to improve evaluation accuracy. For most model configurations, AIR and RWS perform comparably, though RWS appears to do better in the average test log-probability estimates for some configurations of MNIST. RWS+ performs comparably with variational inference for Monte Carlo objectives (VIMCO, Mnih and Rezende, 2016), despite the reported VIMCO results relying on more posterior samples in training. Finally, AIR results approach SOTA with Caltech-101 silhouettes with 3-layer SBNs against neural autoregressive distribution estimator (NADE, Bornschein and Bengio, 2014).
We also tested our log-probability estimates against the exact log-probability (by marginalizing over the joint) of smaller single-layer SBNs with stochastic units. The exact log-probability was and our estimate with the unrefined approximate was and with refinement steps. Overall, this result is consistent with those of Table 1, that iterative refinement improves the accuracy of log-probability estimates.
In order to visualize the improvements due to refinement and to demonstrate AIR as a general means of improvement for directed models at test, we generate samples from the approximate posterior without () and with refinement (), from a single-layer SBN with stochastic units originally trained with RWS. We then use the samples from the approximate posterior to compute the expected conditional probability or average reconstruction: . We used a restricted model with a lower number of stochastic units to demonstrate that refinement also works well with simple models, where the recognition network is more likely to “average” over latent configurations, giving a misleading evaluation of the model’s generative capability.
We also refine the approximate posterior of a simplified version of the recognition network of a single-layer SBN with units trained with RWS. We simplified the approximate posterior by first computing , then randomly setting of the variational parameters to .
Fig. 3 shows improvement from refinement for digits from the MNIST test dataset, where the samples chosen were those of which the expected reconstruction error of the original test sample was the most improved. The digits generated from the refined posterior are of higher quality, and in many cases the correct digit class is revealed. This shows that, in many cases where the recognition network indicates that the generative model cannot model a test sample correctly, refinement can more accurately reveal the model’s capacity. With the simplified approximate posterior, refinement is able to retrieve most of the shape of images from the Caltech-101 silhouettes, despite only starting with of the original parameters from the recognition network. This indicates that the work of inference need not all be done via a complex recognition network: iterative refinement can be used to aid in inference with a relatively simple approximate posterior.
We have introduced iterative refinement for variational inference (IRVI), a simple, yet effective and flexible approach for training and evaluating directed belief networks that works by improving the approximate posterior from a recognition network. We demonstrated IRVI using adaptive importance refinement (AIR), which uses importance sampling at each iterative step, and showed that AIR can be used to provide low-variance gradients to efficiently train deep directed graphical models. AIR can also be used to more accurately reveal the generative model’s capacity, which is evident when the approximate posterior is of poor quality. The improved approximate posterior provided by AIR shows an increased effective samples size, which is a consequence of a better approximation of the true posterior and improves the accuracy of the test log-probability estimates.
This work was supported by Microsoft Research to RDH (internship under NJ); NIH P20GM103472, R01 grant REB020407, and NSF grant 1539067 to VDC; and ONR grant N000141512791 and ADeLAIDE grant FA8750-16C-0130-001 to RS. KC was supported in part by Facebook, Google (Google Faculty Award 2016) and NVidia (GPU Center of Excellence 2015-2016), and RDH was supported in part by PIBBS.
Stochastic backpropagation and approximate inference in deep generative models.In Proceedings of the 31st International Conference on Machine Learning (ICML-14), pages 1278–1286, 2014.
Efficient learning of deep boltzmann machines.In International Conference on Artificial Intelligence and Statistics, pages 693–700, 2010.
With variational autoencoders (VAE), the back-propagated gradient of the lowerbound with respect to the approximate posterior is composed of individual gradients for each factor, that can be applied simultaneously. Applying the gradient directly to the variational parameters, , without back-propagating to the recognition network parameters, , yields a simple iterative refinement operator:
where is the inference rate hyperparameter and
is the inference rate hyperparameter andis auxiliary noise used in the re-parameterization.
This gradient-descent iterative refinement (GDIR) is very straightforward with continuous latent variables as with VAE. However, GDIR with discrete units suffers the same shortcomings as when passing the gradients directly, so a better transition operator is needed (AIR).
In the limit of , we do not arrive at VAE, as the gradients are never passed through the approximate posterior during learning. However, as the complete computational graph involves a series of differentiable variables, , in addition to auxiliary noise, it is possible to pass gradients through GDIR to the recognition network parameters, , during learning, though we do not here.
For continuous latent variables, we used the same network structure as in [Kingma and Welling, 2013, Salimans et al., 2015]. Results for GDIR are presented in Table 2 for the MNIST dataset, and included for comparison are methods for learning non-factorial latent distributions for Gaussian variables and the corresponding result for VAE, the baseline.
|Model||-log p(x)||-log p(x)|
|VAE (w/ refinement)||90.57||88.53|
Though GDIR can improve the posterior in VAE, our results show that VAE is at an upper-bound for learning with a factorized posterior on the MNIST dataset. Further improvements on this dataset must be made by using a non-factorized posterior (re-weighting or sequential Monte Carlo with importance weighting). GDIR may still also provide improvement for training models with other datasets, and we leave this for future work.
Iterative refinement via adaptive inference refinement (AIR) improves the variational lowerbound and effective sample size (ESS) of the approximate posterior. To show this, we trained models with one, two, and three hidden layers with binary units trained using AIR with inference steps on the MNIST dataset for epochs. Taking the initial approximate posterior from each model, we refined the posterior up to steps (Figure 4), evaluating the lowerbound and ESS using posterior samples. Refinement improves the posterior from models trained on AIR well beyond the number of steps used in training.
Adaptive iterative refinement (AIR) and reweighted wake-sleep [RWS, Bornschein and Bengio, 2014] have competing convergence wall-clock times, while AIR outperforms on updates (Figures 5 and 6). AIR converges to a higher lowerbound and with far fewer updates than RWS, though RWS converges sooner to a similar value as AIR does later in training time. AIR outperforms RWS in ESS in both wall-clock time and updates. For a more accurate comparison, RWS may need to be trained at wall-clock times equal to that afforded to AIR. However, these results support the conclusion that AIR converges to similar values as RWS in less updates but similar wall-clock times.
As an alternative to the variational lowerbound, a lowerbound can be formulated from the geometric mean of the joint generative and approximate posterior models:
As an alternative to the variational lowerbound, a lowerbound can be formulated from the geometric mean of the joint generative and approximate posterior models:
In this procedure, known as bidirectional Helmholtz machines [Bornschein et al., 2015], the lowerbound, which minimizes the Bhattacharyya distance (), yields estimates of the likelihood, , with importance weights,
Similar to with the variational lowerbound, we can refine the approximate posterior to maximize this lowerbound by simply replacing the weights in Equation 15.
We performed similar experiments to those as the experiments on wall-clock times above, using only a three layer SBN trained for epochs with the equivalent AIR and BiHM procedures using the bidirectional lowerbound importance weights. We evaluated these models using posterior samples on the test dataset and evaluated BiHM with (BiHM+) and without refinement.
Our results show similar negative log likelihoods for AIR ( nats), BiHM ( nats), and BiHM+ ( nats), though AIR slightly outperforms BiHM+, and BiHM+ slightly outperforms BiHM. Further optimization is necessary for a better comparison to our experiments with the variational lowerbound. However these observations are consistent with those from our original experiments: AIR can be used to improve the posterior both in training and when evaluating models, regardless of how they were trained. Furthermore, AIR is compatible with optimizations based on alternative lowerbounds, broadening the scope in which AIR is applicable.