DVAE#: Discrete Variational Autoencoders with Relaxed Boltzmann Priors

05/18/2018
by   Arash Vahdat, et al.
0

Boltzmann machines are powerful distributions that have been shown to be an effective prior over binary latent variables in variational autoencoders (VAEs). However, previous methods for training discrete VAEs have used the evidence lower bound and not the tighter importance-weighted bound. We propose two approaches for relaxing Boltzmann machines to continuous distributions that permit training with importance-weighted bounds. These relaxations are based on generalized overlapping transformations and the Gaussian integral trick. Experiments on the MNIST and OMNIGLOT datasets show that these relaxations outperform previous discrete VAEs with Boltzmann priors.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

02/14/2018

DVAE++: Discrete Variational Autoencoders with Overlapping Transformations

Training of discrete latent variable models remains challenging because ...
05/18/2018

GumBolt: Extending Gumbel trick to Boltzmann priors

Boltzmann machines (BMs) are appealing candidates for powerful priors in...
04/10/2017

Reinterpreting Importance-Weighted Autoencoders

The standard interpretation of importance-weighted autoencoders is that ...
01/16/2013

Joint Training Deep Boltzmann Machines for Classification

We introduce a new method for training deep Boltzmann machines jointly. ...
08/01/2014

Thurstonian Boltzmann Machines: Learning from Multiple Inequalities

We introduce Thurstonian Boltzmann Machines (TBM), a unified architectur...
01/14/2021

A Metaheuristic-Driven Approach to Fine-Tune Deep Boltzmann Machines

Deep learning techniques, such as Deep Boltzmann Machines (DBMs), have r...
03/15/2012

Intracluster Moves for Constrained Discrete-Space MCMC

This paper addresses the problem of sampling from binary distributions w...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Advances in amortized variational inference kingma2014vae ; rezende2014stochastic ; mnih2014neural ; gregorICML14 have enabled novel learning methods gregorICML14 ; pu2017VAE ; salimans2015markov and extended generative learning into complex domains such as molecule design gomez2016automatic ; kusner2017grammar , music roberts2018hierarchical and program Murali2018

generation. These advances have been made using continuous latent variable models in spite of the computational efficiency and greater interpretability offered by discrete latent variables. Further, models such as clustering, semi-supervised learning, and variational memory addressing 

bornschein2017Variational all require discrete variables, which makes the training of discrete models an important challenge.

Prior to the deep learning era, Boltzmann machines were widely used for learning with discrete latent variables. These powerful multivariate binary distributions can represent any distribution defined on a set of binary random variables

le2008representational

, and have seen application in unsupervised learning

Salakhutdinov09 , supervised learning Larochelle2008 ; Nguyen2017

, reinforcement learning

Sallans2004 , dimensionality reduction hinton2006 , and collaborative filtering Salakhutdinov2007 . Recently, Boltzmann machines have been used as priors for variational autoencoders (VAEs) in the discrete variational autoencoder (DVAE) rolfe2016discrete and its successor DVAE++ vahdat2018dvae++ . It has been demonstrated that these VAE models can capture discrete aspects of data. However, both these models assume a particular variational bound and tighter bounds such as the importance weighted (IW) bound burda2015importance cannot be used for training.

We remove this constraint by introducing two continuous relaxations that convert a Boltzmann machine to a distribution over continuous random variables. These relaxations are based on overlapping transformations introduced in 

vahdat2018dvae++ and the Gaussian integral trick hertz1991introduction (known as the Hubbard-Stratonovich transform hubbard1959calculation in physics). Our relaxations are made tunably sharp by using an inverse temperature parameter.

VAEs with relaxed Boltzmann priors can be trained using standard techniques developed for continuous latent variable models. In this work, we train discrete VAEs using the same IW bound on the log-likelihood that has been shown to improve importance weighted autoencoders (IWAEs) burda2015importance .

This paper makes two contributions: i) We introduce two continuous relaxations of Boltzmann machines and use these relaxations to train a discrete VAE with a Boltzmann prior using the IW bound. ii) We generalize the overlapping transformations of vahdat2018dvae++

to any pair of distributions with computable probability density function (PDF) and cumulative density function (CDF). Using these more general overlapping transformations, we propose new smoothing transformations using mixtures of Gaussian and power-function 

govindarajulu1966characterization

distributions. Power-function overlapping transformations provide lower variance gradient estimates and improved test set log-likelihoods when the inverse temperature is large. We name our framework DVAE# because the best results are obtained when the power-function transformations are sharp.

111And not because our model is proposed after DVAE and DVAE++.

1.1 Related Work

Previous work on training discrete latent variable models can be grouped into five main categories:

  1. Exhaustive approaches marginalize all discrete variables kingma2014semi ; maaloe2017semi and which are not scalable to more than a few discrete variables.

  2. Local expectation gradients aueb2015local and reparameterization and marginalization tokui2017evaluating estimators compute low-variance estimates at the cost of multiple function evaluations per gradient. These approaches can be applied to problems with a moderate number of latent variables.

  3. Relaxed computation of discrete densities jang2016categorical

    replaces discrete variables with continuous relaxations for gradient computation. A variation of this approach, known as the straight-through technique, sets the gradient of binary variables to the gradient of their mean 

    bengio2013estimating ; raiko2014techniques .

  4. Continuous relaxations of discrete distributions maddison2016concrete replace discrete distributions with continuous ones and optimize a consistent objective. This method cannot be applied directly to Boltzmann distributions. The DVAE rolfe2016discrete solves this problem by pairing each binary variable with an auxiliary continuous variable. This approach is described in Sec. 2.

  5. The REINFORCE estimator williams1992simple (also known as the likelihood ratio glynn1990likelihood or score-function estimator) replaces the gradient of an expectation with the expectation of the gradient of the score function. This estimator has high variance, but many increasingly sophisticated methods provide lower variance estimators. NVIL mnih2014neural uses an input-dependent baseline, and MuProp gu2015muprop uses a first-order Taylor approximation along with an input-dependent baseline to reduce noise. VIMCO mnih2016variational trains an IWAE with binary latent variables and uses a leave-one-out scheme to define the baseline for each sample. REBAR tucker2017rebar and its generalization RELAX grathwohl2017backpropagation use the reparameterization of continuous distributions to define baselines.

The method proposed here is of type iv) and differs from rolfe2016discrete ; vahdat2018dvae++ in the way that binary latent variables are marginalized. The resultant relaxed distribution allows for DVAE training with a tighter bound. Moreover, our proposal encompasses a wider variety of smoothing methods and one of these empirically provides lower-variance gradient estimates.

2 Background

Let represent observed random variables and continuous latent variables. We seek a generative model where denotes the prior distribution and is a probabilistic decoder. In the VAE kingma2014vae , training maximizes a variational lower bound on the marginal log-likelihood:

A probabilistic encoder approximates the posterior over latent variables. For continuous , the bound is maximized using the reparameterization trick. With reparameterization, expectations with respect to are replaced by expectations against a base distribution and a differentiable function that maps samples from the base distribution to . This can always be accomplished when

has an analytic inverse cumulative distribution function (CDF) by mapping uniform samples through the inverse CDF. However, reparameterization cannot be applied to binary latent variables because the CDF is not differentiable.

The DVAE rolfe2016discrete

resolves this issue by pairing each binary latent variable with a continuous counterpart. Denoting a binary vector of length

by , the Boltzmann prior is where is an energy function with parameters and partition function . The joint model over discrete and continuous variables is where is a smoothing transformation that maps each discrete to its continuous analogue .

DVAE rolfe2016discrete and DVAE++ vahdat2018dvae++ differ in the type of smoothing transformations : rolfe2016discrete uses spike-and-exponential transformation (Eq. (1) left), while vahdat2018dvae++

uses two overlapping exponential distributions (Eq. (

1) right). Here, is the (one-sided) Dirac delta distribution, , and is the normalization constant:

(1)

The variational bound for a factorial approximation to the posterior where and is derived in vahdat2018dvae++ as

(2)

Here is a mixture distribution combining and with weights . The probability of binary units conditioned on , , can be computed analytically. is the entropy of . The second and third terms in Eq. (2) have analytic solutions (up to the log normalization constant) that can be differentiated easily with an automatic differentiation (AD) library. The expectation over is approximated with reparameterized sampling.

We extend rolfe2016discrete ; vahdat2018dvae++ to tighten the bound of Eq. (2) by importance weighting burda2015importance ; li2016renyi . These tighter bounds are shown to improve VAEs. For continuous latent variables, the -sample IW bound is

(3)

The tightness of the IW bound improves as increases burda2015importance .

3 Model

We introduce two relaxations of Boltzmann machines to define the continuous prior distribution in the IW bound of Eq. (3). These relaxations rely on either overlapping transformations (Sec. 3.1) or the Gaussian integral trick (Sec. 3.2). Sec. 3.3 then generalizes the class of overlapping transformations that can be used in the approximate posterior .

3.1 Overlapping Relaxations

We obtain a continuous relaxation of through the marginal where is an overlapping smoothing transformation vahdat2018dvae++ that operates on each component of and independently; i.e., . Overlapping transformations such as mixture of exponential in Eq. (1) may be used for

. These transformations are equipped with an inverse temperature hyperparameter

to control the sharpness of the smoothing transformation. As , approaches and becomes a mixture of delta function distributions centered on the vertices of the hypercube in . At finite , provides a continuous relaxation of the Boltzmann machine.

To train an IWAE using Eq. (3) with as a prior, we must compute and its gradient with respect to the parameters of the Boltzmann distribution and the approximate posterior. This computation involves marginalization over , which is generally intractable. However, we show that this marginalization can be approximated accurately using a mean-field model.

3.1.1 Computing and its Gradient for Overlapping Relaxations

Since overlapping transformations are factorial, the log marginal distribution of is

(4)

where and . For the mixture of exponential smoothing and .

The first term in Eq. (4) is the log partition function of the Boltzmann machine with augmented energy function . Estimating the log partition function accurately can be expensive, particularly because it has to be done for each . However, we note that each comes from a bimodal distribution centered at zero and one, and that the bias is usually large for most components (particularly for large ). In this case, mean field is likely to provide a good approximation of , a fact we demonstrate empirically in Sec. 4.

To compute and its gradient, we first fit a mean-field distribution by minimizing  welling2002new . The gradient of with respect to , or is:

(5)

where is the mean-field solution and where the gradient does not act on . The first term in Eq. (5) is the result of computing the average energy under a factorial distribution.222The augmented energy is a multi-linear function of and under the mean-field assumption each is replaced by its average value . The second expectation corresponds to the negative phase in training Boltzmann machines and is approximated by Monte Carlo sampling from .

To compute the importance weights for the IW bound of Eq. (3) we must compute the value of up to the normalization; i.e. the first term in Eq. (4). Assuming that and using

(6)

the first term of Eq. (4) is approximated as .

3.2 The Gaussian Integral Trick

The computational complexity of arises from the pairwise interactions present in . Instead of applying mean field, we remove these interactions using the Gaussian integral trick zhangNIPS12 . This is achieved by defining Gaussian smoothing:

for an invertible matrix

and a diagonal matrix with . Here, must be large enough so that is positive definite. Common choices for include or where is the eigendecomposition of zhangNIPS12 . However, neither of these choices places the modes of on the vertices of the hypercube in . Instead, we take giving the smoothing transformation . The joint density is then

where is the -vector of all ones. Since no longer contains pairwise interactions can be marginalized out giving

(7)

where is the element of .

The marginal in Eq. (7) is a mixture of Gaussian distributions centered on the vertices of the hypercube in with mixing weights given by . Each mixture component has covariance and, as gets large, the precision matrix becomes diagonally dominant. As , each mixture component becomes a delta function and approaches . This Gaussian smoothing allows for simple evaluation of (up to ), but we note that each mixture component has a nondiagonal covariance matrix, which should be accommodated when designing the approximate posterior .

The hyperparameter

must be larger than the absolute value of the most negative eigenvalue of

to ensure that is positive definite. Setting to even larger values has the benefit of making the Gaussian mixture components more isotropic, but this comes at the cost of requiring a sharper approximate posterior with potentially noisier gradient estimates.

3.3 Generalizing Overlapping Transformations

The previous sections developed two relaxations for Boltzmann priors. Depending on this choice, compatible parameterizations must be used. For example, if Gaussian smoothing is used, then a mixture of Gaussian smoothers should be used in the approximate posterior. Unfortunately, the overlapping transformations introduced in DVAE++ vahdat2018dvae++ are limited to mixtures of exponential or logistic distributions where the inverse CDF can be computed analytically. Here, we provide a general approach for reparameterizing overlapping transformations that does not require analytic inverse CDFs. Our approach is a special case of the reparameterization method for multivariate mixture distributions proposed in graves2016stochastic .

Assume is the mixture distribution resulting from an overlapping transformation defined for one-dimensional and where . Ancestral sampling from is accomplished by first sampling from the binary distribution and then sampling from . This process generates samples but is not differentiable with respect to .

To compute the gradient (with respect to ) of samples from , we apply the implicit function theorem. The inverse CDF of at is obtained by solving:

(8)

where and is the CDF for . Assuming that is a function of but is not, we take the gradient from both sides of Eq. (8) with respect to giving

(9)

which can be easily computed for a sampled if the PDF and CDF of are known. This generalization allows us to compute gradients of samples generated from a wide range of overlapping transformations. Further, the gradient of with respect to the parameters of (e.g. ) is computed similarly as

(a)
(b)
(c)
(a) Exponential Transformation
(b) Uniform+Exp Transformation
(c) Power-Function Transformation
Figure 1: In the first row, we visualize the inverse CDF of the mixture for as a function of the random noise . In the second row, the gradient of the inverse CDF with respect to is visualized. Each column corresponds to a different smoothing transformation. As the transition region sharpens with increasing , a sampling based estimate of the gradient becomes noisier; i.e., the variance of increases. The uniform+exp exponential has a very similar inverse CDF (first row) to the exponential but has potentially lower variance (bottom row). In comparison, the power-function smoothing with provides a good relaxation of the discrete variables while its gradient noise is still moderate. See the supplementary material for a comparison of the gradient noise.

With this method, we can apply overlapping transformations beyond the mixture of exponentials considered in vahdat2018dvae++ . The inverse CDF of exponential mixtures is shown in Fig. 1(a) for several . As increases, the relaxation approaches the original binary variables, but this added fidelity comes at the cost of noisy gradients. Other overlapping transformations offer alternative tradeoffs:

Uniform+Exp Transformation: We ensure that the gradient remains finite as

by mixing the exponential with a uniform distribution. This is achieved by defining

where is the exponential smoothing and . The inverse CDF resulting from this smoothing is shown in Fig. 1(b).

Power-Function Transformation: Instead of adding a uniform distribution we substitute the exponential distribution for one with heavier tails. One choice is the power-function distribution govindarajulu1966characterization :

(10)

The conditionals in Eq. (10

) correspond to the Beta distributions

and respectively. The inverse CDF resulted from this smoothing is visualized in Fig. 1(c).

Gaussian Transformations: The transformations introduced above have support . We also explore Gaussian smoothing with support .

None of these transformations have an analytic inverse CDF for so we use Eq. (9) to calculate gradients.

4 Experiments

In this section we compare the various relaxations for training DVAEs with Boltzmann priors on statically binarized MNIST 

salakhutdinov2008dbn and OMNIGLOT lake2015human datasets. For all experiments we use a generative model of the form where is a continuous relaxation obtained from either the overlapping relaxation of Eq. (4) or the Gaussian integral trick of Eq. (7

). The underlying Boltzmann distribution is a restricted Boltzmann machine (RBM) with bipartite connectivity which allows for parallel Gibbs updates. We use a hierarchical autoregressively-structured

to approximate the posterior distribution over . This structure divides the components of into equally-sized groups and defines each conditional using a factorial distribution conditioned on and all from previous groups.

The smoothing transformation used in depends on the type of relaxation used in . For overlapping relaxations, we compare exponential, uniform+exp, Gaussian, and power-function. With the Gaussian integral trick, we use shifted Gaussian smoothing as described below. The decoder and conditionals

are modeled with neural networks. Following 

vahdat2018dvae++ , we consider both linear (—) and nonlinear () versions of these networks. The linear models use a single linear layer to predict the parameters of the distributions and

given their input. The nonlinear models use two deterministic hidden layers with 200 units, tanh activation and batch-normalization. We use the same initialization scheme, batch-size, optimizer, number of training iterations, schedule of learning rate, weight decay and KL warm-up for training that was used in

vahdat2018dvae++ (See Sec. 7.2 in vahdat2018dvae++ ). For the mean-field optimization, we use 5 iterations. To evaluate the trained models, we estimate the log-likelihood on the discrete graphical model using the importance-weighted bound with 4000 samples burda2015importance . At evaluation is replaced with the Boltzmann distribution , and with (corresponding to ).

For DVAE, we use the original spike-and-exp smoothing. For DVAE++, in addition to exponential smoothing, we use a mixture of power-functions. The DVAE# models are trained using the IW bound in Eq. (3) with samples. To fairly compare DVAE# with DVAE and DVAE++ (which can only be trained with the variational bound), we use the same number of samples when estimating the variational bound during DVAE and DVAE++ training.

The smoothing parameter is fixed throughout training (i.e. is not annealed). However, since acts differently for each smoothing function , its value is selected by cross validation per smoothing and structure. We select from for spike-and-exp, for exponential, with for uniform+exp, for power-function, and for Gaussian smoothing. For models other than the Gaussian integral trick, is set to the same value in and . For the Gaussian integral case, in the encoder is trained as discussed next, but is selected in the prior from .

With the Gaussian integral trick, each mixture component in the prior contains off-diagonal correlations and the approximation of the posterior over should capture this. We recall that a multivariate Gaussian can always be represented as a product of Gaussian conditionals where is linear in . Motivated by this observation, we provide flexibility in the approximate posterior by using shifted Gaussian smoothing where , and is an additional parameter that shifts the distribution. As the approximate posterior in our model is hierarchical, we generate for the element in group as the output of the same neural network that generates the parameters of . The parameter for each component of is a trainable parameter shared for all .

Training also requires sampling from the discrete RBM to compute the -gradient of . We have used both population annealing hukushima2003population

with 40 sweeps across variables per parameter update and persistent contrastive divergence 

tieleman2008training for sampling. Population annealing usually results in a better generative model (see the supplementary material for a comparison). We use QuPA333This library is publicly available at https://try.quadrant.ai/qupa, a GPU implementation of population annealing. To obtain test set log-likelihoods we require , which we estimate with annealed importance sampling neal2001annealed ; salakhutdinov2008quantitative

. We use 10,000 temperatures and 1,000 samples to ensure that the standard deviation of the

estimate is small ().

We compare the performance of DVAE# against DVAE and DVAE++ in Table 1. We consider four neural net structures when examining the various smoothing models. Each structure is denoted “G —/” where G represent the number of groups in the approximate posterior and —/ indicates linear/nonlinear conditionals. The RBM prior for the structures “1 —/” is (i.e. ) and for structures “2/4 ” the RBM is (i.e. ).

We make several observations based on Table 1: i) Most baselines improve as increases. The improvements are generally larger for DVAE# as they optimize the IW bound. ii) Power-function smoothing improves the performance of DVAE++ over the original exponential smoothing. iii) DVAE# and DVAE++ both with power-function smoothing for optimizes a similar variational bound with same smoothing transformation. The main difference here is that DVAE# uses the marginal in the prior whereas DVAE++ has the joint . For this case, it can be seen that DVAE# usually outperforms DVAE++ . iv) Among the DVAE# variants, the Gaussian integral trick and Gaussian overlapping relaxation result in similar performance, and both are usually inferior to the other DVAE# relaxations. v) In DVAE#, the uniform+exp smoothing performs better than exponential smoothing alone. vi) DVAE# with the power-function smoothing results in the best generative models, and in most cases outperforms both DVAE and DVAE++.

(a)
(b)
(c)
Figure 2: (a) The KL divergence between the mean-field model and the augmented Boltzmann machine as a function of the number of optimization iterations of the mean-field. The mean-field model converges to in three iterations. (b) The KL value is computed for randomly selected s during training at different iterations for exponential and power-function smoothings with different

. (c) The variance of the gradient of the objective function with respect to the logit of

is visualized for exponential and power-function smoothing transformations. Power-function smoothing tends to have lower variance than exponential smoothing. The artifact seen early in training is due to the warm-up of KL. Models in (b) and (c) are trained for 100K iterations with batch size of 1,000.
DVAE DVAE++ DVAE#
Struct. K Spike-Exp Exp Power Gauss. Int Gaussian Exp Un+Exp Power
MNIST 1 — 1 89.000.09 90.430.06 89.120.05 92.140.12 91.330.13 90.550.11 89.570.08 89.350.06
5 89.150.12 90.130.03 89.090.05 91.320.09 90.150.04 89.620.08 88.560.04 88.250.03
25 89.200.13 89.920.07 89.040.07 91.180.21 89.550.10 89.270.09 88.020.04 87.670.07
1 1 85.480.06 85.130.06 85.050.02 86.230.05 86.240.05 85.370.05 85.190.05 84.930.02
5 85.290.03 85.130.09 85.290.10 84.990.03 84.910.07 84.830.03 84.470.02 84.210.02
25 85.920.10 86.140.18 85.590.10 84.360.04 84.300.04 84.690.08 84.220.01 83.930.06
2 1 83.970.04 84.150.07 83.620.04 84.300.05 84.350.04 83.960.06 83.540.06 83.370.02
5 83.740.03 84.850.13 83.570.07 83.680.02 83.610.04 83.700.04 83.330.04 82.990.04
25 84.190.21 85.490.12 83.580.15 83.390.04 83.260.04 83.760.04 83.300.04 82.850.03
4 1 84.380.03 84.630.11 83.440.05 84.590.06 84.810.19 84.060.06 83.520.06 83.180.05
5 83.930.07 85.410.09 83.170.09 83.890.09 84.200.15 84.150.05 83.410.04 82.950.07
25 84.120.07 85.420.07 83.200.08 83.520.06 83.800.04 84.220.13 83.390.04 82.820.02
OMNIGLOT 1 — 1 105.110.11 106.710.08 105.450.08 110.810.32 106.810.07 107.210.14 105.890.06 105.470.09
5 104.680.21 106.830.09 105.340.05 112.260.70 106.160.11 106.860.10 104.940.05 104.420.09
25 104.380.15 106.850.07 105.380.14 111.920.30 105.750.10 106.880.09 104.490.07 103.980.05
1 1 102.950.07 101.840.08 101.880.06 103.500.06 102.740.08 102.230.08 101.860.06 101.700.01
5 102.450.08 102.130.11 101.670.07 102.150.04 102.000.09 101.590.06 101.220.05 101.000.02
25 102.740.05 102.660.09 101.800.15 101.420.04 101.600.09 101.480.04 100.930.07 100.600.05
2 1 103.100.31 101.340.04 100.420.03 102.070.16 102.840.23 100.380.09 99.840.06 99.750.05
5 100.880.13 100.550.09 99.510.05 100.850.02 101.430.11 99.930.07 99.570.06 99.240.05
25 100.550.08 100.310.15 99.490.07 100.200.02 100.450.08 100.100.28 99.590.16 98.930.05
4 1 104.630.47 101.580.22 100.420.08 102.910.25 103.430.10 100.850.12 99.920.11 99.650.09
5 101.770.20 101.010.09 99.520.09 101.790.25 101.820.13 100.320.19 99.610.07 99.130.10
25 100.890.13 100.370.09 99.430.14 100.730.08 100.970.21 99.920.30 99.360.09 98.880.09
Table 1: The performance of DVAE# is compared against DVAE and DVAE++ on MNIST and OMNIGLOT. Meanstandard deviation of the negative log-likelihood for five runs are reported.

Given the superior performance of the models obtained using the mean-field approximation of Sec. 3.1.1 to , we investigate the accuracy of this approximation. In Fig. 2(a), we show that the mean-field model converges quickly by plotting the KL divergence of Eq. (6) with the number of mean-field iterations for a single . To assess the quality of the mean-field approximation, in Fig. 2(b) we compute the KL divergence for randomly selected s during training at different iterations for exponential and power-function smoothings with different s. As it can be seen, throughout the training the KL value is typically . For larger s, the KL value is smaller due to the stronger bias that imposes on .

Lastly, we demonstrate that the lower variance of power-function smoothing may contribute to its success. As noted in Fig. 1, power-function smoothing potentially has moderate gradient noise while still providing a good approximation of binary variables at large . We validate this hypothesis in Fig. 2(c) by measuring the variance of the derivative of the variational bound (with ) with respect to the logit of during training of a 2-layer nonlinear model on MNIST. When comparing the exponential () to power-function smoothing () at the that performs best for each smoothing method, we find that power-function smoothing has significantly lower variance.

5 Conclusions

We have introduced two approaches for relaxing Boltzmann machines to continuous distributions, and shown that the resulting distributions can be trained as priors in DVAEs using an importance-weighted bound. We have proposed a generalization of overlapping transformations that removes the need for computing the inverse CDF analytically. Using this generalization, the mixture of power-function smoothing provides a good approximation of binary variables while the gradient noise remains moderate. In the case of sharp power smoothing, our model outperforms previous discrete VAEs.

References

  • [1] Diederik Kingma and Max Welling. Auto-encoding variational Bayes. In The International Conference on Learning Representations (ICLR), 2014.
  • [2] Danilo Jimenez Rezende, Shakir Mohamed, and Daan Wierstra.

    Stochastic backpropagation and approximate inference in deep generative models.

    In

    International Conference on Machine Learning

    , 2014.
  • [3] Andriy Mnih and Karol Gregor. Neural variational inference and learning in belief networks. In International Conference on Machine Learning, 2014.
  • [4] Karol Gregor, Ivo Danihelka, Andriy Mnih, Charles Blundell, and Daan Wierstra. Deep autoregressive networks. In International Conference on Machine Learning, 2014.
  • [5] Yuchen Pu, Zhe Gan, Ricardo Henao, Chunyuan Li, Shaobo Han, and Lawrence Carin. VAE learning via Stein variational gradient descent. In Advances in Neural Information Processing Systems. 2017.
  • [6] Tim Salimans, Diederik Kingma, and Max Welling. Markov chain Monte Carlo and variational inference: Bridging the gap. In International Conference on Machine Learning, pages 1218–1226, 2015.
  • [7] Rafael Gómez-Bombarelli, Jennifer N Wei, David Duvenaud, José Miguel Hernández-Lobato, Benjamín Sánchez-Lengeling, Dennis Sheberla, Jorge Aguilera-Iparraguirre, Timothy D Hirzel, Ryan P Adams, and Alán Aspuru-Guzik. Automatic chemical design using a data-driven continuous representation of molecules. ACS Central Science, 2016.
  • [8] Matt J Kusner, Brooks Paige, and José Miguel Hernández-Lobato. Grammar variational autoencoder. In International Conference on Machine Learning, 2017.
  • [9] Adam Roberts, Jesse Engel, Colin Raffel, Curtis Hawthorne, and Douglas Eck. A hierarchical latent vector model for learning long-term structure in music. In International Conference on Machine Learning, 2018.
  • [10] Vijayaraghavan Murali, Letao Qi, Swarat Chaudhuri, and Chris Jermaine. Neural sketch learning for conditional program generation. In The International Conference on Learning Representations, 2018.
  • [11] Jörg Bornschein, Andriy Mnih, Daniel Zoran, and Danilo Jimenez Rezende. Variational memory addressing in generative models. In Advances in Neural Information Processing Systems, pages 3923–3932, 2017.
  • [12] Nicolas Le Roux and Yoshua Bengio.

    Representational power of restricted Boltzmann machines and deep belief networks.

    Neural computation, 2008.
  • [13] Ruslan Salakhutdinov and Geoffrey E. Hinton. Deep Boltzmann machines. In

    International Conference on Artificial Intelligence and Statistics

    , 2009.
  • [14] Hugo Larochelle and Yoshua Bengio. Classification using discriminative restricted Boltzmann machines. In International Conference on Machine Learning (ICML), 2008.
  • [15] Tu Dinh Nguyen, Dinh Phung, Viet Huynh, and Trung Le. Supervised restricted Boltzmann machines. In UAI, 2017.
  • [16] Brian Sallans and Geoffrey E. Hinton. Reinforcement learning with factored states and actions. J. Mach. Learn. Res., 5:1063–1088, December 2004.
  • [17] Geoffrey E. Hinton and Ruslan Salakhutdinov. Reducing the dimensionality of data with neural networks. Science, 313(5786):504–507, 2006.
  • [18] Ruslan Salakhutdinov, Andriy Mnih, and Geoffrey E. Hinton. Restricted Boltzmann machines for collaborative filtering. In Proceedings of the 24th International Conference on Machine Learning, ICML ’07, pages 791–798, New York, NY, USA, 2007. ACM.
  • [19] Jason Tyler Rolfe. Discrete variational autoencoders. In International Conference on Learning Representations (ICLR), 2017.
  • [20] Arash Vahdat, William G. Macready, Zhengbing Bian, Amir Khoshaman, and Evgeny Andriyash. DVAE++: Discrete variational autoencoders with overlapping transformations. In International Conference on Machine Learning (ICML), 2018.
  • [21] Yuri Burda, Roger Grosse, and Ruslan Salakhutdinov. Importance weighted autoencoders. In The International Conference on Learning Representations (ICLR), 2016.
  • [22] John Hertz, Richard Palmer, and Anders Krogh. Introduction to the theory of neural computation. 1991.
  • [23] J Hubbard. Calculation of partition functions. Physical Review Letters, 3(2):77, 1959.
  • [24] Zakkula Govindarajulu. Characterization of the exponential and power distributions. Scandinavian Actuarial Journal, 1966(3-4):132–136, 1966.
  • [25] Diederik P Kingma, Shakir Mohamed, Danilo Jimenez Rezende, and Max Welling. Semi-supervised learning with deep generative models. In Advances in Neural Information Processing Systems, 2014.
  • [26] Lars Maaløe, Marco Fraccaro, and Ole Winther. Semi-supervised generation with cluster-aware generative models. arXiv preprint arXiv:1704.00637, 2017.
  • [27] Michalis Titsias RC AUEB and Miguel Lázaro-Gredilla. Local expectation gradients for black box variational inference. In Advances in neural information processing systems, pages 2638–2646, 2015.
  • [28] Seiya Tokui and Issei Sato. Evaluating the variance of likelihood-ratio gradient estimators. In International Conference on Machine Learning, pages 3414–3423, 2017.
  • [29] Eric Jang, Shixiang Gu, and Ben Poole. Categorical reparametrization with gumble-softmax. In International Conference on Learning Representations, 2017.
  • [30] Yoshua Bengio, Nicholas Léonard, and Aaron Courville. Estimating or propagating gradients through stochastic neurons for conditional computation. arXiv preprint arXiv:1308.3432, 2013.
  • [31] Tapani Raiko, Mathias Berglund, Guillaume Alain, and Laurent Dinh. Techniques for learning binary stochastic feedforward neural networks. arXiv preprint arXiv:1406.2989, 2014.
  • [32] Chris J Maddison, Andriy Mnih, and Yee Whye Teh.

    The concrete distribution: A continuous relaxation of discrete random variables.

    In International Conference on Learning Representations (ICLR), 2017.
  • [33] Ronald J Williams. Simple statistical gradient-following algorithms for connectionist reinforcement learning. In Reinforcement Learning, pages 5–32. Springer, 1992.
  • [34] Peter W Glynn. Likelihood ratio gradient estimation for stochastic systems. Communications of the ACM, 33(10):75–84, 1990.
  • [35] Shixiang Gu, Sergey Levine, Ilya Sutskever, and Andriy Mnih. MuProp: Unbiased backpropagation for stochastic neural networks. In The International Conference on Learning Representations (ICLR), 2016.
  • [36] Andriy Mnih and Danilo Rezende. Variational inference for Monte Carlo objectives. In International Conference on Machine Learning, pages 2188–2196, 2016.
  • [37] George Tucker, Andriy Mnih, Chris J Maddison, John Lawson, and Jascha Sohl-Dickstein. REBAR: Low-variance, unbiased gradient estimates for discrete latent variable models. In Advances in Neural Information Processing Systems, pages 2624–2633, 2017.
  • [38] Will Grathwohl, Dami Choi, Yuhuai Wu, Geoff Roeder, and David Duvenaud. Backpropagation through the void: Optimizing control variates for black-box gradient estimation. In International Conference on Learning Representations (ICLR), 2018.
  • [39] Yingzhen Li and Richard E Turner. Rényi divergence variational inference. In Advances in Neural Information Processing Systems, pages 1073–1081, 2016.
  • [40] Max Welling and Geoffrey E Hinton. A new learning algorithm for mean field Boltzmann machines. In International Conference on Artificial Neural Networks, pages 351–357. Springer, 2002.
  • [41] Yichuan Zhang, Zoubin Ghahramani, Amos J Storkey, and Charles A Sutton. Continuous relaxations for discrete Hamiltonian Monte Carlo. In Advances in Neural Information Processing Systems, pages 3194–3202, 2012.
  • [42] Alex Graves. Stochastic backpropagation through mixture density distributions. arXiv preprint arXiv:1607.05690, 2016.
  • [43] Ruslan Salakhutdinov and Iain Murray. On the quantitative analysis of deep belief networks. In Proceedings of the 25th international conference on Machine learning, pages 872–879. ACM, 2008.
  • [44] Brenden M Lake, Ruslan Salakhutdinov, and Joshua B Tenenbaum. Human-level concept learning through probabilistic program induction. Science, 350(6266):1332–1338, 2015.
  • [45] K Hukushima and Y Iba. Population annealing and its application to a spin glass. In AIP Conference Proceedings, volume 690, pages 200–206. AIP, 2003.
  • [46] Tijmen Tieleman. Training restricted Boltzmann machines using approximations to the likelihood gradient. In Proceedings of the 25th international conference on Machine learning, pages 1064–1071. ACM, 2008.
  • [47] Radford M. Neal. Annealed importance sampling. Statistics and computing, 11(2):125–139, 2001.
  • [48] Ruslan Salakhutdinov and Iain Murray. On the quantitative analysis of deep belief networks. In Proceedings of the 25th international conference on Machine learning, pages 872–879. ACM, 2008.

Appendix A Population Annealing vs. Persistence Contrastive Divergence

In this section, we compare population annealing (PA) to persistence contrastive divergence (PCD) for sampling in the negative phase. In Table 2, we train DVAE# with the power-function smoothing on the binarized MNIST dataset using PA and PCD. As shown, PA results in a comparable generative model when there is one group of latent variables and better models in other cases.

Struct. K PCD PA
1 — 1 89.250.04 89.350.06
5 88.180.08 88.250.03
25 87.660.09 87.670.07
1 1 84.950.05 84.930.02
5 84.250.04 84.210.02
25 83.910.05 83.930.06
2 1 83.480.04 83.370.02
5 83.120.04 82.990.04
25 83.060.03 82.850.03
4 1 83.620.06 83.180.05
5 83.340.06 82.950.07
25 83.180.05 82.820.02
Table 2: The performance of DVAE# with power-function smoothing for binarized MNIST when PCD or PA is used in the negative phase.

Appendix B On the Gradient Variance of the Power-function Smoothing

Our experiments show that power-function smoothing performs best because it provides a better approximation of the binary random variables. We demonstrate this qualitatively in Fig. 1 and quantitatively in Fig. 2(c) of the paper. This is also visualized in Fig. 3. Here, we generate samples from for using both the exponential and power smoothings with different values of ( for exponential, and for power smoothing). The value of is increasing from left to right on each curve. The mean of (for ) vs. the variance of is visualized in this figure. For a given gradient variance, power function smoothing provides a closer approximation to the binary variables.

Figure 3: Average distance between and its binarized vs. variance of measured on samples from . For a given gradient variance, power function smoothing provides a closer approximation to the binary variables.