1 Introduction
Advances in amortized variational inference kingma2014vae ; rezende2014stochastic ; mnih2014neural ; gregorICML14 have enabled novel learning methods gregorICML14 ; pu2017VAE ; salimans2015markov and extended generative learning into complex domains such as molecule design gomez2016automatic ; kusner2017grammar , music roberts2018hierarchical and program Murali2018
generation. These advances have been made using continuous latent variable models in spite of the computational efficiency and greater interpretability offered by discrete latent variables. Further, models such as clustering, semisupervised learning, and variational memory addressing
bornschein2017Variational all require discrete variables, which makes the training of discrete models an important challenge.Prior to the deep learning era, Boltzmann machines were widely used for learning with discrete latent variables. These powerful multivariate binary distributions can represent any distribution defined on a set of binary random variables
le2008representational, and have seen application in unsupervised learning
Salakhutdinov09 , supervised learning Larochelle2008 ; Nguyen2017 Sallans2004 , dimensionality reduction hinton2006 , and collaborative filtering Salakhutdinov2007 . Recently, Boltzmann machines have been used as priors for variational autoencoders (VAEs) in the discrete variational autoencoder (DVAE) rolfe2016discrete and its successor DVAE++ vahdat2018dvae++ . It has been demonstrated that these VAE models can capture discrete aspects of data. However, both these models assume a particular variational bound and tighter bounds such as the importance weighted (IW) bound burda2015importance cannot be used for training.We remove this constraint by introducing two continuous relaxations that convert a Boltzmann machine to a distribution over continuous random variables. These relaxations are based on overlapping transformations introduced in
vahdat2018dvae++ and the Gaussian integral trick hertz1991introduction (known as the HubbardStratonovich transform hubbard1959calculation in physics). Our relaxations are made tunably sharp by using an inverse temperature parameter.VAEs with relaxed Boltzmann priors can be trained using standard techniques developed for continuous latent variable models. In this work, we train discrete VAEs using the same IW bound on the loglikelihood that has been shown to improve importance weighted autoencoders (IWAEs) burda2015importance .
This paper makes two contributions: i) We introduce two continuous relaxations of Boltzmann machines and use these relaxations to train a discrete VAE with a Boltzmann prior using the IW bound. ii) We generalize the overlapping transformations of vahdat2018dvae++
to any pair of distributions with computable probability density function (PDF) and cumulative density function (CDF). Using these more general overlapping transformations, we propose new smoothing transformations using mixtures of Gaussian and powerfunction
govindarajulu1966characterizationdistributions. Powerfunction overlapping transformations provide lower variance gradient estimates and improved test set loglikelihoods when the inverse temperature is large. We name our framework DVAE# because the best results are obtained when the powerfunction transformations are sharp.
^{1}^{1}1And not because our model is proposed after DVAE and DVAE++.1.1 Related Work
Previous work on training discrete latent variable models can be grouped into five main categories:

Exhaustive approaches marginalize all discrete variables kingma2014semi ; maaloe2017semi and which are not scalable to more than a few discrete variables.

Local expectation gradients aueb2015local and reparameterization and marginalization tokui2017evaluating estimators compute lowvariance estimates at the cost of multiple function evaluations per gradient. These approaches can be applied to problems with a moderate number of latent variables.

Relaxed computation of discrete densities jang2016categorical
replaces discrete variables with continuous relaxations for gradient computation. A variation of this approach, known as the straightthrough technique, sets the gradient of binary variables to the gradient of their mean
bengio2013estimating ; raiko2014techniques . 
Continuous relaxations of discrete distributions maddison2016concrete replace discrete distributions with continuous ones and optimize a consistent objective. This method cannot be applied directly to Boltzmann distributions. The DVAE rolfe2016discrete solves this problem by pairing each binary variable with an auxiliary continuous variable. This approach is described in Sec. 2.

The REINFORCE estimator williams1992simple (also known as the likelihood ratio glynn1990likelihood or scorefunction estimator) replaces the gradient of an expectation with the expectation of the gradient of the score function. This estimator has high variance, but many increasingly sophisticated methods provide lower variance estimators. NVIL mnih2014neural uses an inputdependent baseline, and MuProp gu2015muprop uses a firstorder Taylor approximation along with an inputdependent baseline to reduce noise. VIMCO mnih2016variational trains an IWAE with binary latent variables and uses a leaveoneout scheme to define the baseline for each sample. REBAR tucker2017rebar and its generalization RELAX grathwohl2017backpropagation use the reparameterization of continuous distributions to define baselines.
The method proposed here is of type iv) and differs from rolfe2016discrete ; vahdat2018dvae++ in the way that binary latent variables are marginalized. The resultant relaxed distribution allows for DVAE training with a tighter bound. Moreover, our proposal encompasses a wider variety of smoothing methods and one of these empirically provides lowervariance gradient estimates.
2 Background
Let represent observed random variables and continuous latent variables. We seek a generative model where denotes the prior distribution and is a probabilistic decoder. In the VAE kingma2014vae , training maximizes a variational lower bound on the marginal loglikelihood:
A probabilistic encoder approximates the posterior over latent variables. For continuous , the bound is maximized using the reparameterization trick. With reparameterization, expectations with respect to are replaced by expectations against a base distribution and a differentiable function that maps samples from the base distribution to . This can always be accomplished when
has an analytic inverse cumulative distribution function (CDF) by mapping uniform samples through the inverse CDF. However, reparameterization cannot be applied to binary latent variables because the CDF is not differentiable.
The DVAE rolfe2016discrete
resolves this issue by pairing each binary latent variable with a continuous counterpart. Denoting a binary vector of length
by , the Boltzmann prior is where is an energy function with parameters and partition function . The joint model over discrete and continuous variables is where is a smoothing transformation that maps each discrete to its continuous analogue .DVAE rolfe2016discrete and DVAE++ vahdat2018dvae++ differ in the type of smoothing transformations : rolfe2016discrete uses spikeandexponential transformation (Eq. (1) left), while vahdat2018dvae++
uses two overlapping exponential distributions (Eq. (
1) right). Here, is the (onesided) Dirac delta distribution, , and is the normalization constant:(1) 
The variational bound for a factorial approximation to the posterior where and is derived in vahdat2018dvae++ as
(2) 
Here is a mixture distribution combining and with weights . The probability of binary units conditioned on , , can be computed analytically. is the entropy of . The second and third terms in Eq. (2) have analytic solutions (up to the log normalization constant) that can be differentiated easily with an automatic differentiation (AD) library. The expectation over is approximated with reparameterized sampling.
We extend rolfe2016discrete ; vahdat2018dvae++ to tighten the bound of Eq. (2) by importance weighting burda2015importance ; li2016renyi . These tighter bounds are shown to improve VAEs. For continuous latent variables, the sample IW bound is
(3) 
The tightness of the IW bound improves as increases burda2015importance .
3 Model
We introduce two relaxations of Boltzmann machines to define the continuous prior distribution in the IW bound of Eq. (3). These relaxations rely on either overlapping transformations (Sec. 3.1) or the Gaussian integral trick (Sec. 3.2). Sec. 3.3 then generalizes the class of overlapping transformations that can be used in the approximate posterior .
3.1 Overlapping Relaxations
We obtain a continuous relaxation of through the marginal where is an overlapping smoothing transformation vahdat2018dvae++ that operates on each component of and independently; i.e., . Overlapping transformations such as mixture of exponential in Eq. (1) may be used for
. These transformations are equipped with an inverse temperature hyperparameter
to control the sharpness of the smoothing transformation. As , approaches and becomes a mixture of delta function distributions centered on the vertices of the hypercube in . At finite , provides a continuous relaxation of the Boltzmann machine.To train an IWAE using Eq. (3) with as a prior, we must compute and its gradient with respect to the parameters of the Boltzmann distribution and the approximate posterior. This computation involves marginalization over , which is generally intractable. However, we show that this marginalization can be approximated accurately using a meanfield model.
3.1.1 Computing and its Gradient for Overlapping Relaxations
Since overlapping transformations are factorial, the log marginal distribution of is
(4) 
where and . For the mixture of exponential smoothing and .
The first term in Eq. (4) is the log partition function of the Boltzmann machine with augmented energy function . Estimating the log partition function accurately can be expensive, particularly because it has to be done for each . However, we note that each comes from a bimodal distribution centered at zero and one, and that the bias is usually large for most components (particularly for large ). In this case, mean field is likely to provide a good approximation of , a fact we demonstrate empirically in Sec. 4.
To compute and its gradient, we first fit a meanfield distribution by minimizing welling2002new . The gradient of with respect to , or is:
(5)  
where is the meanfield solution and where the gradient does not act on . The first term in Eq. (5) is the result of computing the average energy under a factorial distribution.^{2}^{2}2The augmented energy is a multilinear function of and under the meanfield assumption each is replaced by its average value . The second expectation corresponds to the negative phase in training Boltzmann machines and is approximated by Monte Carlo sampling from .
3.2 The Gaussian Integral Trick
The computational complexity of arises from the pairwise interactions present in . Instead of applying mean field, we remove these interactions using the Gaussian integral trick zhangNIPS12 . This is achieved by defining Gaussian smoothing:
for an invertible matrix
and a diagonal matrix with . Here, must be large enough so that is positive definite. Common choices for include or where is the eigendecomposition of zhangNIPS12 . However, neither of these choices places the modes of on the vertices of the hypercube in . Instead, we take giving the smoothing transformation . The joint density is thenwhere is the vector of all ones. Since no longer contains pairwise interactions can be marginalized out giving
(7) 
where is the element of .
The marginal in Eq. (7) is a mixture of Gaussian distributions centered on the vertices of the hypercube in with mixing weights given by . Each mixture component has covariance and, as gets large, the precision matrix becomes diagonally dominant. As , each mixture component becomes a delta function and approaches . This Gaussian smoothing allows for simple evaluation of (up to ), but we note that each mixture component has a nondiagonal covariance matrix, which should be accommodated when designing the approximate posterior .
The hyperparameter
must be larger than the absolute value of the most negative eigenvalue of
to ensure that is positive definite. Setting to even larger values has the benefit of making the Gaussian mixture components more isotropic, but this comes at the cost of requiring a sharper approximate posterior with potentially noisier gradient estimates.3.3 Generalizing Overlapping Transformations
The previous sections developed two relaxations for Boltzmann priors. Depending on this choice, compatible parameterizations must be used. For example, if Gaussian smoothing is used, then a mixture of Gaussian smoothers should be used in the approximate posterior. Unfortunately, the overlapping transformations introduced in DVAE++ vahdat2018dvae++ are limited to mixtures of exponential or logistic distributions where the inverse CDF can be computed analytically. Here, we provide a general approach for reparameterizing overlapping transformations that does not require analytic inverse CDFs. Our approach is a special case of the reparameterization method for multivariate mixture distributions proposed in graves2016stochastic .
Assume is the mixture distribution resulting from an overlapping transformation defined for onedimensional and where . Ancestral sampling from is accomplished by first sampling from the binary distribution and then sampling from . This process generates samples but is not differentiable with respect to .
To compute the gradient (with respect to ) of samples from , we apply the implicit function theorem. The inverse CDF of at is obtained by solving:
(8) 
where and is the CDF for . Assuming that is a function of but is not, we take the gradient from both sides of Eq. (8) with respect to giving
(9) 
which can be easily computed for a sampled if the PDF and CDF of are known. This generalization allows us to compute gradients of samples generated from a wide range of overlapping transformations. Further, the gradient of with respect to the parameters of (e.g. ) is computed similarly as
With this method, we can apply overlapping transformations beyond the mixture of exponentials considered in vahdat2018dvae++ . The inverse CDF of exponential mixtures is shown in Fig. 1(a) for several . As increases, the relaxation approaches the original binary variables, but this added fidelity comes at the cost of noisy gradients. Other overlapping transformations offer alternative tradeoffs:
Uniform+Exp Transformation: We ensure that the gradient remains finite as
by mixing the exponential with a uniform distribution. This is achieved by defining
where is the exponential smoothing and . The inverse CDF resulting from this smoothing is shown in Fig. 1(b).PowerFunction Transformation: Instead of adding a uniform distribution we substitute the exponential distribution for one with heavier tails. One choice is the powerfunction distribution govindarajulu1966characterization :
(10) 
The conditionals in Eq. (10
) correspond to the Beta distributions
and respectively. The inverse CDF resulted from this smoothing is visualized in Fig. 1(c).Gaussian Transformations: The transformations introduced above have support . We also explore Gaussian smoothing with support .
None of these transformations have an analytic inverse CDF for so we use Eq. (9) to calculate gradients.
4 Experiments
In this section we compare the various relaxations for training DVAEs with Boltzmann priors on statically binarized MNIST
salakhutdinov2008dbn and OMNIGLOT lake2015human datasets. For all experiments we use a generative model of the form where is a continuous relaxation obtained from either the overlapping relaxation of Eq. (4) or the Gaussian integral trick of Eq. (7). The underlying Boltzmann distribution is a restricted Boltzmann machine (RBM) with bipartite connectivity which allows for parallel Gibbs updates. We use a hierarchical autoregressivelystructured
to approximate the posterior distribution over . This structure divides the components of into equallysized groups and defines each conditional using a factorial distribution conditioned on and all from previous groups.The smoothing transformation used in depends on the type of relaxation used in . For overlapping relaxations, we compare exponential, uniform+exp, Gaussian, and powerfunction. With the Gaussian integral trick, we use shifted Gaussian smoothing as described below. The decoder and conditionals
are modeled with neural networks. Following
vahdat2018dvae++ , we consider both linear (—) and nonlinear () versions of these networks. The linear models use a single linear layer to predict the parameters of the distributions andgiven their input. The nonlinear models use two deterministic hidden layers with 200 units, tanh activation and batchnormalization. We use the same initialization scheme, batchsize, optimizer, number of training iterations, schedule of learning rate, weight decay and KL warmup for training that was used in
vahdat2018dvae++ (See Sec. 7.2 in vahdat2018dvae++ ). For the meanfield optimization, we use 5 iterations. To evaluate the trained models, we estimate the loglikelihood on the discrete graphical model using the importanceweighted bound with 4000 samples burda2015importance . At evaluation is replaced with the Boltzmann distribution , and with (corresponding to ).For DVAE, we use the original spikeandexp smoothing. For DVAE++, in addition to exponential smoothing, we use a mixture of powerfunctions. The DVAE# models are trained using the IW bound in Eq. (3) with samples. To fairly compare DVAE# with DVAE and DVAE++ (which can only be trained with the variational bound), we use the same number of samples when estimating the variational bound during DVAE and DVAE++ training.
The smoothing parameter is fixed throughout training (i.e. is not annealed). However, since acts differently for each smoothing function , its value is selected by cross validation per smoothing and structure. We select from for spikeandexp, for exponential, with for uniform+exp, for powerfunction, and for Gaussian smoothing. For models other than the Gaussian integral trick, is set to the same value in and . For the Gaussian integral case, in the encoder is trained as discussed next, but is selected in the prior from .
With the Gaussian integral trick, each mixture component in the prior contains offdiagonal correlations and the approximation of the posterior over should capture this. We recall that a multivariate Gaussian can always be represented as a product of Gaussian conditionals where is linear in . Motivated by this observation, we provide flexibility in the approximate posterior by using shifted Gaussian smoothing where , and is an additional parameter that shifts the distribution. As the approximate posterior in our model is hierarchical, we generate for the element in group as the output of the same neural network that generates the parameters of . The parameter for each component of is a trainable parameter shared for all .
Training also requires sampling from the discrete RBM to compute the gradient of . We have used both population annealing hukushima2003population
with 40 sweeps across variables per parameter update and persistent contrastive divergence
tieleman2008training for sampling. Population annealing usually results in a better generative model (see the supplementary material for a comparison). We use QuPA^{3}^{3}3This library is publicly available at https://try.quadrant.ai/qupa, a GPU implementation of population annealing. To obtain test set loglikelihoods we require , which we estimate with annealed importance sampling neal2001annealed ; salakhutdinov2008quantitative. We use 10,000 temperatures and 1,000 samples to ensure that the standard deviation of the
estimate is small ().We compare the performance of DVAE# against DVAE and DVAE++ in Table 1. We consider four neural net structures when examining the various smoothing models. Each structure is denoted “G —/” where G represent the number of groups in the approximate posterior and —/ indicates linear/nonlinear conditionals. The RBM prior for the structures “1 —/” is (i.e. ) and for structures “2/4 ” the RBM is (i.e. ).
We make several observations based on Table 1: i) Most baselines improve as increases. The improvements are generally larger for DVAE# as they optimize the IW bound. ii) Powerfunction smoothing improves the performance of DVAE++ over the original exponential smoothing. iii) DVAE# and DVAE++ both with powerfunction smoothing for optimizes a similar variational bound with same smoothing transformation. The main difference here is that DVAE# uses the marginal in the prior whereas DVAE++ has the joint . For this case, it can be seen that DVAE# usually outperforms DVAE++ . iv) Among the DVAE# variants, the Gaussian integral trick and Gaussian overlapping relaxation result in similar performance, and both are usually inferior to the other DVAE# relaxations. v) In DVAE#, the uniform+exp smoothing performs better than exponential smoothing alone. vi) DVAE# with the powerfunction smoothing results in the best generative models, and in most cases outperforms both DVAE and DVAE++.
. (c) The variance of the gradient of the objective function with respect to the logit of
is visualized for exponential and powerfunction smoothing transformations. Powerfunction smoothing tends to have lower variance than exponential smoothing. The artifact seen early in training is due to the warmup of KL. Models in (b) and (c) are trained for 100K iterations with batch size of 1,000.DVAE  DVAE++  DVAE#  

Struct.  K  SpikeExp  Exp  Power  Gauss. Int  Gaussian  Exp  Un+Exp  Power  
MNIST  1 —  1  89.000.09  90.430.06  89.120.05  92.140.12  91.330.13  90.550.11  89.570.08  89.350.06 
5  89.150.12  90.130.03  89.090.05  91.320.09  90.150.04  89.620.08  88.560.04  88.250.03  
25  89.200.13  89.920.07  89.040.07  91.180.21  89.550.10  89.270.09  88.020.04  87.670.07  
1  1  85.480.06  85.130.06  85.050.02  86.230.05  86.240.05  85.370.05  85.190.05  84.930.02  
5  85.290.03  85.130.09  85.290.10  84.990.03  84.910.07  84.830.03  84.470.02  84.210.02  
25  85.920.10  86.140.18  85.590.10  84.360.04  84.300.04  84.690.08  84.220.01  83.930.06  
2  1  83.970.04  84.150.07  83.620.04  84.300.05  84.350.04  83.960.06  83.540.06  83.370.02  
5  83.740.03  84.850.13  83.570.07  83.680.02  83.610.04  83.700.04  83.330.04  82.990.04  
25  84.190.21  85.490.12  83.580.15  83.390.04  83.260.04  83.760.04  83.300.04  82.850.03  
4  1  84.380.03  84.630.11  83.440.05  84.590.06  84.810.19  84.060.06  83.520.06  83.180.05  
5  83.930.07  85.410.09  83.170.09  83.890.09  84.200.15  84.150.05  83.410.04  82.950.07  
25  84.120.07  85.420.07  83.200.08  83.520.06  83.800.04  84.220.13  83.390.04  82.820.02  
OMNIGLOT  1 —  1  105.110.11  106.710.08  105.450.08  110.810.32  106.810.07  107.210.14  105.890.06  105.470.09 
5  104.680.21  106.830.09  105.340.05  112.260.70  106.160.11  106.860.10  104.940.05  104.420.09  
25  104.380.15  106.850.07  105.380.14  111.920.30  105.750.10  106.880.09  104.490.07  103.980.05  
1  1  102.950.07  101.840.08  101.880.06  103.500.06  102.740.08  102.230.08  101.860.06  101.700.01  
5  102.450.08  102.130.11  101.670.07  102.150.04  102.000.09  101.590.06  101.220.05  101.000.02  
25  102.740.05  102.660.09  101.800.15  101.420.04  101.600.09  101.480.04  100.930.07  100.600.05  
2  1  103.100.31  101.340.04  100.420.03  102.070.16  102.840.23  100.380.09  99.840.06  99.750.05  
5  100.880.13  100.550.09  99.510.05  100.850.02  101.430.11  99.930.07  99.570.06  99.240.05  
25  100.550.08  100.310.15  99.490.07  100.200.02  100.450.08  100.100.28  99.590.16  98.930.05  
4  1  104.630.47  101.580.22  100.420.08  102.910.25  103.430.10  100.850.12  99.920.11  99.650.09  
5  101.770.20  101.010.09  99.520.09  101.790.25  101.820.13  100.320.19  99.610.07  99.130.10  
25  100.890.13  100.370.09  99.430.14  100.730.08  100.970.21  99.920.30  99.360.09  98.880.09 
Given the superior performance of the models obtained using the meanfield approximation of Sec. 3.1.1 to , we investigate the accuracy of this approximation. In Fig. 2(a), we show that the meanfield model converges quickly by plotting the KL divergence of Eq. (6) with the number of meanfield iterations for a single . To assess the quality of the meanfield approximation, in Fig. 2(b) we compute the KL divergence for randomly selected s during training at different iterations for exponential and powerfunction smoothings with different s. As it can be seen, throughout the training the KL value is typically . For larger s, the KL value is smaller due to the stronger bias that imposes on .
Lastly, we demonstrate that the lower variance of powerfunction smoothing may contribute to its success. As noted in Fig. 1, powerfunction smoothing potentially has moderate gradient noise while still providing a good approximation of binary variables at large . We validate this hypothesis in Fig. 2(c) by measuring the variance of the derivative of the variational bound (with ) with respect to the logit of during training of a 2layer nonlinear model on MNIST. When comparing the exponential () to powerfunction smoothing () at the that performs best for each smoothing method, we find that powerfunction smoothing has significantly lower variance.
5 Conclusions
We have introduced two approaches for relaxing Boltzmann machines to continuous distributions, and shown that the resulting distributions can be trained as priors in DVAEs using an importanceweighted bound. We have proposed a generalization of overlapping transformations that removes the need for computing the inverse CDF analytically. Using this generalization, the mixture of powerfunction smoothing provides a good approximation of binary variables while the gradient noise remains moderate. In the case of sharp power smoothing, our model outperforms previous discrete VAEs.
References
 [1] Diederik Kingma and Max Welling. Autoencoding variational Bayes. In The International Conference on Learning Representations (ICLR), 2014.

[2]
Danilo Jimenez Rezende, Shakir Mohamed, and Daan Wierstra.
Stochastic backpropagation and approximate inference in deep generative models.
InInternational Conference on Machine Learning
, 2014.  [3] Andriy Mnih and Karol Gregor. Neural variational inference and learning in belief networks. In International Conference on Machine Learning, 2014.
 [4] Karol Gregor, Ivo Danihelka, Andriy Mnih, Charles Blundell, and Daan Wierstra. Deep autoregressive networks. In International Conference on Machine Learning, 2014.
 [5] Yuchen Pu, Zhe Gan, Ricardo Henao, Chunyuan Li, Shaobo Han, and Lawrence Carin. VAE learning via Stein variational gradient descent. In Advances in Neural Information Processing Systems. 2017.
 [6] Tim Salimans, Diederik Kingma, and Max Welling. Markov chain Monte Carlo and variational inference: Bridging the gap. In International Conference on Machine Learning, pages 1218–1226, 2015.
 [7] Rafael GómezBombarelli, Jennifer N Wei, David Duvenaud, José Miguel HernándezLobato, Benjamín SánchezLengeling, Dennis Sheberla, Jorge AguileraIparraguirre, Timothy D Hirzel, Ryan P Adams, and Alán AspuruGuzik. Automatic chemical design using a datadriven continuous representation of molecules. ACS Central Science, 2016.
 [8] Matt J Kusner, Brooks Paige, and José Miguel HernándezLobato. Grammar variational autoencoder. In International Conference on Machine Learning, 2017.
 [9] Adam Roberts, Jesse Engel, Colin Raffel, Curtis Hawthorne, and Douglas Eck. A hierarchical latent vector model for learning longterm structure in music. In International Conference on Machine Learning, 2018.
 [10] Vijayaraghavan Murali, Letao Qi, Swarat Chaudhuri, and Chris Jermaine. Neural sketch learning for conditional program generation. In The International Conference on Learning Representations, 2018.
 [11] Jörg Bornschein, Andriy Mnih, Daniel Zoran, and Danilo Jimenez Rezende. Variational memory addressing in generative models. In Advances in Neural Information Processing Systems, pages 3923–3932, 2017.

[12]
Nicolas Le Roux and Yoshua Bengio.
Representational power of restricted Boltzmann machines and deep belief networks.
Neural computation, 2008. 
[13]
Ruslan Salakhutdinov and Geoffrey E. Hinton.
Deep Boltzmann machines.
In
International Conference on Artificial Intelligence and Statistics
, 2009.  [14] Hugo Larochelle and Yoshua Bengio. Classification using discriminative restricted Boltzmann machines. In International Conference on Machine Learning (ICML), 2008.
 [15] Tu Dinh Nguyen, Dinh Phung, Viet Huynh, and Trung Le. Supervised restricted Boltzmann machines. In UAI, 2017.
 [16] Brian Sallans and Geoffrey E. Hinton. Reinforcement learning with factored states and actions. J. Mach. Learn. Res., 5:1063–1088, December 2004.
 [17] Geoffrey E. Hinton and Ruslan Salakhutdinov. Reducing the dimensionality of data with neural networks. Science, 313(5786):504–507, 2006.
 [18] Ruslan Salakhutdinov, Andriy Mnih, and Geoffrey E. Hinton. Restricted Boltzmann machines for collaborative filtering. In Proceedings of the 24th International Conference on Machine Learning, ICML ’07, pages 791–798, New York, NY, USA, 2007. ACM.
 [19] Jason Tyler Rolfe. Discrete variational autoencoders. In International Conference on Learning Representations (ICLR), 2017.
 [20] Arash Vahdat, William G. Macready, Zhengbing Bian, Amir Khoshaman, and Evgeny Andriyash. DVAE++: Discrete variational autoencoders with overlapping transformations. In International Conference on Machine Learning (ICML), 2018.
 [21] Yuri Burda, Roger Grosse, and Ruslan Salakhutdinov. Importance weighted autoencoders. In The International Conference on Learning Representations (ICLR), 2016.
 [22] John Hertz, Richard Palmer, and Anders Krogh. Introduction to the theory of neural computation. 1991.
 [23] J Hubbard. Calculation of partition functions. Physical Review Letters, 3(2):77, 1959.
 [24] Zakkula Govindarajulu. Characterization of the exponential and power distributions. Scandinavian Actuarial Journal, 1966(34):132–136, 1966.
 [25] Diederik P Kingma, Shakir Mohamed, Danilo Jimenez Rezende, and Max Welling. Semisupervised learning with deep generative models. In Advances in Neural Information Processing Systems, 2014.
 [26] Lars Maaløe, Marco Fraccaro, and Ole Winther. Semisupervised generation with clusteraware generative models. arXiv preprint arXiv:1704.00637, 2017.
 [27] Michalis Titsias RC AUEB and Miguel LázaroGredilla. Local expectation gradients for black box variational inference. In Advances in neural information processing systems, pages 2638–2646, 2015.
 [28] Seiya Tokui and Issei Sato. Evaluating the variance of likelihoodratio gradient estimators. In International Conference on Machine Learning, pages 3414–3423, 2017.
 [29] Eric Jang, Shixiang Gu, and Ben Poole. Categorical reparametrization with gumblesoftmax. In International Conference on Learning Representations, 2017.
 [30] Yoshua Bengio, Nicholas Léonard, and Aaron Courville. Estimating or propagating gradients through stochastic neurons for conditional computation. arXiv preprint arXiv:1308.3432, 2013.
 [31] Tapani Raiko, Mathias Berglund, Guillaume Alain, and Laurent Dinh. Techniques for learning binary stochastic feedforward neural networks. arXiv preprint arXiv:1406.2989, 2014.

[32]
Chris J Maddison, Andriy Mnih, and Yee Whye Teh.
The concrete distribution: A continuous relaxation of discrete random variables.
In International Conference on Learning Representations (ICLR), 2017.  [33] Ronald J Williams. Simple statistical gradientfollowing algorithms for connectionist reinforcement learning. In Reinforcement Learning, pages 5–32. Springer, 1992.
 [34] Peter W Glynn. Likelihood ratio gradient estimation for stochastic systems. Communications of the ACM, 33(10):75–84, 1990.
 [35] Shixiang Gu, Sergey Levine, Ilya Sutskever, and Andriy Mnih. MuProp: Unbiased backpropagation for stochastic neural networks. In The International Conference on Learning Representations (ICLR), 2016.
 [36] Andriy Mnih and Danilo Rezende. Variational inference for Monte Carlo objectives. In International Conference on Machine Learning, pages 2188–2196, 2016.
 [37] George Tucker, Andriy Mnih, Chris J Maddison, John Lawson, and Jascha SohlDickstein. REBAR: Lowvariance, unbiased gradient estimates for discrete latent variable models. In Advances in Neural Information Processing Systems, pages 2624–2633, 2017.
 [38] Will Grathwohl, Dami Choi, Yuhuai Wu, Geoff Roeder, and David Duvenaud. Backpropagation through the void: Optimizing control variates for blackbox gradient estimation. In International Conference on Learning Representations (ICLR), 2018.
 [39] Yingzhen Li and Richard E Turner. Rényi divergence variational inference. In Advances in Neural Information Processing Systems, pages 1073–1081, 2016.
 [40] Max Welling and Geoffrey E Hinton. A new learning algorithm for mean field Boltzmann machines. In International Conference on Artificial Neural Networks, pages 351–357. Springer, 2002.
 [41] Yichuan Zhang, Zoubin Ghahramani, Amos J Storkey, and Charles A Sutton. Continuous relaxations for discrete Hamiltonian Monte Carlo. In Advances in Neural Information Processing Systems, pages 3194–3202, 2012.
 [42] Alex Graves. Stochastic backpropagation through mixture density distributions. arXiv preprint arXiv:1607.05690, 2016.
 [43] Ruslan Salakhutdinov and Iain Murray. On the quantitative analysis of deep belief networks. In Proceedings of the 25th international conference on Machine learning, pages 872–879. ACM, 2008.
 [44] Brenden M Lake, Ruslan Salakhutdinov, and Joshua B Tenenbaum. Humanlevel concept learning through probabilistic program induction. Science, 350(6266):1332–1338, 2015.
 [45] K Hukushima and Y Iba. Population annealing and its application to a spin glass. In AIP Conference Proceedings, volume 690, pages 200–206. AIP, 2003.
 [46] Tijmen Tieleman. Training restricted Boltzmann machines using approximations to the likelihood gradient. In Proceedings of the 25th international conference on Machine learning, pages 1064–1071. ACM, 2008.
 [47] Radford M. Neal. Annealed importance sampling. Statistics and computing, 11(2):125–139, 2001.
 [48] Ruslan Salakhutdinov and Iain Murray. On the quantitative analysis of deep belief networks. In Proceedings of the 25th international conference on Machine learning, pages 872–879. ACM, 2008.
Appendix A Population Annealing vs. Persistence Contrastive Divergence
In this section, we compare population annealing (PA) to persistence contrastive divergence (PCD) for sampling in the negative phase. In Table 2, we train DVAE# with the powerfunction smoothing on the binarized MNIST dataset using PA and PCD. As shown, PA results in a comparable generative model when there is one group of latent variables and better models in other cases.
Struct.  K  PCD  PA 

1 —  1  89.250.04  89.350.06 
5  88.180.08  88.250.03  
25  87.660.09  87.670.07  
1  1  84.950.05  84.930.02 
5  84.250.04  84.210.02  
25  83.910.05  83.930.06  
2  1  83.480.04  83.370.02 
5  83.120.04  82.990.04  
25  83.060.03  82.850.03  
4  1  83.620.06  83.180.05 
5  83.340.06  82.950.07  
25  83.180.05  82.820.02 
Appendix B On the Gradient Variance of the Powerfunction Smoothing
Our experiments show that powerfunction smoothing performs best because it provides a better approximation of the binary random variables. We demonstrate this qualitatively in Fig. 1 and quantitatively in Fig. 2(c) of the paper. This is also visualized in Fig. 3. Here, we generate samples from for using both the exponential and power smoothings with different values of ( for exponential, and for power smoothing). The value of is increasing from left to right on each curve. The mean of (for ) vs. the variance of is visualized in this figure. For a given gradient variance, power function smoothing provides a closer approximation to the binary variables.
Comments
There are no comments yet.