Boltzmann Encoded Adversarial Machines

by   Charles K. Fisher, et al.

Restricted Boltzmann Machines (RBMs) are a class of generative neural network that are typically trained to maximize a log-likelihood objective function. We argue that likelihood-based training strategies may fail because the objective does not sufficiently penalize models that place a high probability in regions where the training data distribution has low probability. To overcome this problem, we introduce Boltzmann Encoded Adversarial Machines (BEAMs). A BEAM is an RBM trained against an adversary that uses the hidden layer activations of the RBM to discriminate between the training data and the probability distribution generated by the model. We present experiments demonstrating that BEAMs outperform RBMs and GANs on multiple benchmarks.



There are no comments yet.


page 8

page 9

page 14

page 15

page 16


Boltzmann machines and energy-based models

We review Boltzmann machines and energy-based models. A Boltzmann machin...

Transductive Boltzmann Machines

We present transductive Boltzmann machines (TBMs), which firstly achieve...

Perception-Distortion Trade-off with Restricted Boltzmann Machines

In this work, we introduce a new procedure for applying Restricted Boltz...

Deep Directed Generative Autoencoders

For discrete data, the likelihood P(x) can be rewritten exactly and para...

Modeling Documents with Deep Boltzmann Machines

We introduce a Deep Boltzmann Machine model suitable for modeling and ex...

Inverse Ising inference from high-temperature re-weighting of observations

Maximum Likelihood Estimation (MLE) is the bread and butter of system in...

Can Boltzmann Machines Discover Cluster Updates ?

Boltzmann machines are physics informed generative models with wide appl...

Code Repositories


Boltzmann encoded adversarial machines (BEAMs) in PyTorch :trumpet:

view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

A machine learning model is

generative if it learns to draw new samples from an unknown probability distribution. Generative models have two important applications. First, generative models enable simulations of systems with unknown, or very complicated, mechanistic laws. For example, generative models can be used to design molecular compounds with desired properties Kadurin et al. (2017)

. Second, in the process of learning to generate samples from a distribution a generative model must learn a useful representation of the data. Therefore, generative models enable unsupervised learning with unlabeled data 

Hinton and Sejnowski (1999).

The last decade has produced revolutionary advances in machine learning, largely due to progress in training neural networks. Much of this progress has been on discriminative models rather than generative models. Still, neural generative models such as Restricted Boltzmann Machines (RBMs) Hinton and Salakhutdinov (2006); Salakhutdinov and Hinton (2009)

, Variational Autoencoders (VAEs) 

Kingma and Welling (2013); Rolfe (2016); Kuleshov and Ermon (2017), and Generative Adversarial Networks (GANs) Goodfellow et al. (2014) have demonstrated promising results on a number of problems. GANs, in particular, are generally regarded as the current state-of-the-art Karras et al. (2017).

Unlike most other generative models, GANs are trained to minimize a distance between the data and model distributions rather than to maximize the likelihood of the data under the model Arjovsky and Bottou (2017); Nowozin et al. (2016). As a result of the form of this distance function, and because they are built on feedforward neural networks, typical formulations of GANs can be trained using standard backpropogation Rumelhart et al. (1986). However, GANs have their drawbacks. GAN training can be difficult and unstable Arjovsky and Bottou (2017); Arjovsky et al. (2017a). Moreover, although one of the main advantages of GANs is that they can be trained end-to-end using backpropogation, recent state-of-the-art approaches have used a layerwise training strategy Karras et al. (2017) reminiscent of methods used to train Deep Boltzmann Machines Hinton and Salakhutdinov (2012).

Figure 1: Architecture of a BEAM

. (a) The generator of a GAN is a feed-forward neural network that transforms random noise into an image, and the adversary is a feed-forward neural network classifies the input image. (b) A BEAM uses an RBM generator trained to minimize an objective function that combines the negative log-likelihood and an adversarial loss. The adversarial loss is computed by a critic trained on the activations of the hidden units of the generator.

The popularity of RBM-based generative models, including Deep Belief Networks and Deep Boltzmann Machines, has faded in recent years. The charge is that other approaches, especially GANs, simply work better in practice. However, RBM based architectures do have some advantages. For example, RBMs can be easily adapted for use on multimodal data sets 

Srivastava and Salakhutdinov (2012) and on time series Taylor et al. (2007); Taylor and Hinton (2009); Sutskever et al. (2009) without major modifications, and RBMs allow one to perform both generation and inference with a single model. Given that RBMs and derived models generally have sufficient representational power to learn essentially any distribution Le Roux and Bengio (2008), the difficulties must arise during training.

In this work, we take inspiration from GANs to propose a new method for training RBMs. We call the resulting model a Boltzmann Encoded Adversarial Machine (BEAM; see Figure 1). While the adversarial concept used in BEAMs is similar to GANs, there are some distinct features. The primary one is that the adversary operates on the hidden layer activations of the RBM. Because the latent variable representation from the RBM is a consolidated representation of the visible units, simple adversaries – even ones that do not need to be trained – are often sufficient to obtain good results. This makes training simple and stable. Furthermore, we obtain our best results by optimizing a convex combination of the log-likelihood and adversarial objectives. The component of the objective from the log-likelihood allows the training data to play an active role in determining the gradient (while it only plays a passive role as part of the discriminator in the adversarial gradient, as it also does in GANs).

BEAMs achieve excellent results on a variety of applications, from low dimensional benchmark datasets to higher dimensional applications such as image generation, outperforming GANs of similar or higher complexity. These results indicate that BEAMs provide a powerful approach to unsupervised learning.

This paper is structured as follows. We begin with a brief review of RBMs, then discuss some problems with maximum likelihood training of RBMs and go on to define and describe BEAMs. Finally, we present the results of experiments comparing RBMs, GANs, and BEAMs and discuss.

Figure 2: Comparison of distances between distributions. We consider the distance between

, a mixture of two Gaussian distributions separated by a distance

, and

, a single Gaussian distribution with the same mean and standard deviation as

. The forward KL divergence increases slowly as increases, while the reverse KL divergence and discriminator divergence increase rapidly.

Ii Theory and Methods

ii.1 Restricted Boltzmann Machines

An RBM is an energy based model with two layers of neurons. The visible units

v describe the data and the hidden units h capture interactions between the visible units. The joint probability distribution is defined by an energy function:


with a partition function . This formulation, where and are generic functions and and are scale parameters, is a flexible way of writing a generic RBM that encompasses common models such as Bernoulli RBMs and Gaussian RBMs. The key feature of an RBM is the conditional independence of the layers, i.e. and , which allows one to sample from the distribution using block Gibbs sampling.

RBMs are typically trained to maximize the log-likelihood

using algorithms such as Persistent Contrastive Divergence (PCD) 

Tieleman (2008); Hinton (2006). The derivative of the log-likelihood with respect to a model parameter takes the form Ackley et al. (1985):


The two averages are computed using samples from the data set and samples drawn from the model by Gibbs sampling, respectively. We refer the reader to foundational works such as Hinton (2010) for more detail.

ii.2 The Problem with Maximum Likelihood

A generative model defined by parameters describes the probability of observing a visible state v. Therefore, training a generative model involves minimizing a distance between the distribution of the data, , and the distribution defined by the model, . The traditional algorithms for training RBMs maximize the log-likelihood, which is equivalent to minimizing the forward Kullback-Liebler (KL) divergence Kullback and Leibler (1951):


To illustrate some problems with maximum likelihood, we will compare the forward KL divergence to the reverse KL divergence,


The forward KL divergence, , accumulates differences between the data and model distributions weighted by the probability under the data distribution. The reverse KL divergence, , accumulates differences between the data and model distributions weighted by the probability under the model distribution. As a result, the forward KL divergence strongly punishes models that underestimate the probability of the data, whereas the reverse KL divergence strongly punishes models that overestimate the probability of the data. Figure 2 illustrates the difference between the metrics on a simple problem.

There are a variety of sources of stochasticity that enter into the training of an RBM. For example, the model moments have to be estimated using random sampling by Markov Chain Monte Carlo methods, and the gradients are almost always computed from minibatches of data rather than the whole data set. The stochasticity implies that different models may become statistically indistinguishable if the differences in their log-likelihoods are smaller than the errors in estimating them. This creates an

entropic force because there will be many more models with a small than there are models with both a small and . As a result, training an RBM using a standard approach with PCD decreases (as it should) but tends to increase . This leads to distributions with spurious modes and/or to distributions that are oversmoothed.

ii.3 Advantages of Adversarial Training

One can imagine overcoming the limitations of maximum likelihood training of RBMs by minimizing a combination of the forward and reverse KL divergences. Unfortunately, computing the reverse KL divergence requires knowledge of , which is unknown. Therefore, we introduce a new type of f-divergence that we call a discriminator divergence


Notice that the optimal discriminator between and

will assign a posterior probability


that the sample v was drawn from the data distribution. Therefore, we can write the discriminator divergence as


to show that it measures the probability that the optimal discriminator will incorrectly classify a sample drawn from the model distribution as coming from the data distribution.

The discriminator divergence belongs to the class of f-divergences defined as . The function that defines the discriminator divergence is


which is convex with as required. It is easy to show that the discriminator divergence upper bounds the reverse KL divergence:

We introduce this relationship because we usually do not have access to directly and cannot compute the reverse KL divergence. However, we can train a discriminator to approximate Equation 6 and, therefore, can approximate the discriminator divergence.

A generator that is able to trick the discriminator so that for all samples drawn from will have a low discriminator divergence. The discriminator divergence closely mirrors the reverse KL divergence and strongly punishes models that overestimate the probability of the data (Figure 2). Therefore, as with GANs, we hypothesized that it may be possible to improve the training of RBMs using an adversary. Some previous research in this direction includes the Wasserstein RBM Montavon et al. (2016) and Associate Adversarial Networks Arici and Celikyilmaz (2016).

ii.4 Boltzmann Encoded Adversarial Machines (BEAMs)

We introduce a method – called a Boltzmann Encoded Adversarial Machine (BEAM) – for training an RBM against an adversary. A BEAM minimizes a loss function that is a combination of the negative log-likelihood and an adversarial loss. The adversarial component ensures that BEAM training performs a simultaneous minimization of both the forward and reverse KL divergences, which prevents the oversmoothing problem observed with regular RBMs.

The architecture of a BEAM is very simple, and is illustrated in Figure 1. The RBM (the generative model) is trained with an objective,


that includes a contribution from an adversarial term, . In theory, the adversary could be any model that can be trained to approximate the optimal discriminator.

We take inspiration from GANs and train the RBM against a critic function. However, we use a critic function that acts on the hidden unit activations rather than the visible units. That is, the adversary uses same architecture and weights as the RBM, and encodes visible units into hidden unit activations. These hidden unit activations, computed for both the data and fantasy particles sampled from the RBM, are used by a critic to estimate the distance between the data and model distributions. Thus, the BEAM adversarial term is


This term has a straightforward interpretation: for any sensible critic, it is minimizing the distance between the marginal distributions of the hidden units under the data and model distributions.

For example, suppose that we had access to the optimal discriminator (on the hidden units):


where . Then, we could define a critic to minimize the discriminator divergence (on the hidden units) using . In practice, however, we found that we obtain better results with a linear critic:


Therefore, all experiments that follow use a linear critic. We use the form so that the sign of the critic indicates the best guess of the optimal discriminator, but this choice is not important since it only ends up scaling the derivative by a factor of two.

In practice, of course, we don’t have access to the optimal discriminator. The usual remedy for GANs is to co-train a neural network to approximate it. In our case, we hypothesized that a simple approximation to the optimal discriminator will be sufficient because are working with the hidden unit activities of the RBM generator rather than the visible units. Therefore, we simply approximate the optimal critic using nearest neighbor methods. In our examples, we simply store the data and fantasy particles from the previous minibatch and use a distance-weighted nearest neighbor approximation.

A BEAM can be trained using stochastic gradient descent by computing model averages from persistent fantasy particles in the same way as with maximum likelihood training of an RBM. The derivative of the adversarial term with respect to a model parameter



where the covariance is computed with respect to the model distribution . A derivation of this result is presented in the Supplementary Material. It is also possible to define a critic on the visible units directly, or to use some other method other than a nearest neighbor approximation. We present some comparisons of BEAMs with other critics in the Supplementary Material.

In the context of most formulations of GANs – which use feed-forward neural networks for both the generator and the discriminator – one could say that BEAMs use the RBM as both the generator and as a feature extractor for the adversary. This double-usage allows us to reuse a single set of fantasy particles for multiple steps of the training algorithm. Specifically, we maintain a single set of persistent fantasy particles that are updated times per gradient evaluation. The same set of fantasy particles are used to compute the log-likelihood derivative (Equation 2) and the adversarial derivative (Equation 13). Then, these fantasy particles replace the fantasy particles from the previous gradient evaluation in the nearest neighbor estimates of the critic value. Reusing the fantasy particles for each step means that BEAM training has roughly the same computational cost as training an RBM with PCD.

ii.5 Nearest Neighbor Critics

Suppose are i.i.d. samples from an unknown probability distribution with pdf in . One simple way to estimate at an arbitrary point is to make use of a -nearest-neighbor estimate. Specifically: fix some positive integer and compute the nearest neighbors to in . Define to be the distance between and the furthest of the nearest-neighbors. Then estimate the density

to be the density of the uniform distribution on a ball of radius

. That is,


Now denote by and the unknown pdfs of the model and data distributions respectively. Suppose is a collection of i.i.d. samples exactly half of which are drawn from and half from . We can use the same idea to estimate the ratio . Fix some and compute the nearest neighbors in , denoting by the distance to the furthest. Then we estimate the denominator as in (14). Let be the number of nearest neighbors which come from as opposed to . The numerator then can be estimated as uniform on the same size ball with only of the density of the denominator. As a result the desired estimate is simply the ratio .

We put this concept in action by defining the nearest-neighbor critic. Suppose that we have cached a minibatch of samples from the model and a minibatch of samples from the training dataset. For any new sample we can compute the -nearest neighbors from the joined minibatches for some fixed – we generally use in examples. Then the nearest-neighbor critic is defined as the function which assigns to the ratio in which is the number of nearest neighbors originating from the data minibatch as opposed to the model minibatch.


The distance-weighted nearest-neighbor critic is a generalization which attempts to add some continuity to the nearest-neighbor critic by applying an inverse distance weighting to the ratio count. Specifically, let be the distances of the -nearest neighbors in to some , with the distances for the neighbors originating from the data samples and the distances for the neighbors originating from the model samples. Then distance-weighted nearest-neighbor critic is defined as:


where regularizes the inverse distance.

ii.6 Temperature Driven Sampling

Finally, we use a simple trick to improve the mixing of the RBM while sampling the fantasy particles. We assign each fantasy particle an independently sampled inverse temperature and define the probability as . The inverse temperature is drawn from an autoregressive Gamma process Gouriéroux and Jasiak (2006) with mean , standard deviation , and autocorrelation . For applications in this paper, we set the standard deviation to around and the autocorrelation coefficient to , though specific values are noted in the Supplementary Material. The intuition behind this algorithm is similar to parallel tempering Swendsen and Wang (1986); Geyer (1991); Desjardins et al. (2010a); Brakel et al. (2012); Desjardins et al. (2010b, 2014). When is small, the fantasy particles will be able to explore the space quickly. Setting the mean to ensures that the sampled distribution stays close to the true distribution, while setting the autocorrelation close to ensures that the inverse temperatures evolve slowly relative to the fantasy particles, which can remain in quasi-equilibrium. Unlike parallel tempering, this driven sampling algorithm does not sample from the exact distribution of the RBM. Instead, the driven sampling algorithm samples from a similar distribution that has fatter tails. However, the driven sampling algorithm adds little computational overhead and generally improves training outcomes. Some additional details and simulations are provided in the Supplementary Material.

ii.7 Using KL Divergences to Monitor Training

We monitor both the forward and reverse KL divergences during training. Following Wang et al. (2009), let and be samples drawn from densities and . Let be the distance from to its nearest neighbor in , and be the distance from to its nearest neighbor in . Then,


where is the dimension of the space (i.e., the number of visible units). The reverse KL divergence can be computed by reversing the identities of and . In practice, we monitor the KL divergences using a held-out validation set consisting of 10% of the data. For computational reasons, we compute the KL divergences on minibatches of the validation set and then average the values.

Iii Results

Figure 3: Comparison of generative models on mixtures of Gaussians. Three datasets constructed from mixtures of Gaussians: a 1-D mixture of two Gaussians, a 2-D mixture of eight Gaussians arranged in a circle, and a 2-D mixture of Gaussians arranged on a 5x5 grid. Distributions of fantasy particles from a standard RBM, a vanilla GAN, and a Wasserstein GAN (WGAN) are compared to distributions of fantasy particles from a RBM trained with a driven sampler and to a BEAM.

We present empirical results on BEAMs using some datasets that are commonly used to test generative models. We aim to demonstrate three key results:

  1. RBMs produce poor results because the reverse KL divergence increases during training even though the forward KL divergence decreases.

  2. BEAMs trained with a driven sampler minimize both the forward and reverse KL divergences, leading to better results than RBMs trained by standard methods.

  3. BEAMs produce results that are comparable to, or better than, GANs on multiple benchmarks.

  4. The simplicity of the adversary ensures that BEAM training is stable.

iii.1 Mixture Models

Our first set of experiments are on a series of 1 and 2-dimensional Mixtures of Gaussians (MoGs) similar to those used in the Wasserstein GAN paper Arjovsky et al. (2017b). We compare the results from five different generative models. Models from the literature include a vanilla GAN Goodfellow et al. (2014); greydanus (2017), a Wasserstein GAN Arjovsky et al. (2017a, b); Arjovsky (2017), and a Gaussian-Bernoulli RBM Cho et al. (2013). Our models include a Gaussian-Bernoulli RBM trained with the driven sampler and a Gaussian-Bernoulli BEAM with equally weighted likelihood and adversarial losses. All of the RBM based models have the exact same architecture. Details on the model architectures and training parameters are given in the Supplementary Material.

Figure 3 shows a comparison of fantasy particles from each of the generative models along with the corresponding data distributions. A standard RBM trained using persistent contrastive divergence with 100 update steps per gradient evaluation fails to learn that the data distribution has multiple modes. Instead, it spreads the model density across the support of the data distribution. The vanilla GAN and the WGAN are both able to learn the 1-D mixture of two Gaussians and the 2-D mixture of eight Gaussians, but fail on the 2-D MoGs arranged in a 5x5 grid. Surprisingly, our results with the vanilla GAN are significantly better than those reported in the literature Arjovsky et al. (2017b) and are comparable in quality to the results with WGAN. Training the Gaussian-Bernoulli RBM using the driven sampler leads to improvements over the standard RBM. Notably, the BEAM is the only model that learns all three datasets.

Training an RBM as a BEAM decreases both the forward and reverse KL divergences, as shown in the left panel of Figure 4

for the MoGs arranged in a 5x5 grid. In the early stages of training, the BEAM fantasy particles are spread out across the support of the data distribution – capturing the modes near the edge of the grid. These early epochs resemble the distributions obtained with GANs, which also concentrate density in the modes near the edge of the grid. As training progresses, the BEAM progressively learns to capture the modes near the center of the grid.

Figure 4: Training a BEAM on a 2-D Mixture of Gaussians (MoGs) arranged in a 5x5 grid. Top panel shows estimates of the forward KL divergence, , and the reverse KL divergence, , per training epoch. Right panels show distributions of fantasy particles at various epochs during training.

iii.2 Mnist

The MNIST dataset of handwritten images LeCun and Cortes (2010) is one of the most widely used benchmarks in machine learning. We present results on MNIST with continuous, grayscale images, and MNIST with binary, black and white images.

We compare five different models on continuous MNIST. A non-convolutional (i.e., fully connected) GAN, a non-convolutional (i.e., fully connected) WGAN, a Gaussian-Bernoulli RBM, a Gaussian-Bernoulli RBM with a temperature driven sampler, and a Gaussian-Bernoulli BEAM. Details of the architectures and training parameters are given in the Supplementary Material. It is important to note that none of these models is designed to produce state-of-the-art results on MNIST; for example, you can get better results using convolutional, rather than fully-connected, networks (see Supplementary Material). However, restricting the analyses to the chosen architectures provides a cleaner comparison of the different training approaches.

The critic in a BEAM uses the hidden unit activities as features, but these features are not useful during the early stages of training when there is little mutual information between the visible and hidden units of the generator. Therefore, we train the BEAM through two phases. For the first 25 epochs, we use regular maximum likelihood training with persistent contrastive divergence and driven sampling. After 25 epochs, we change the relative weights of the likelihood and the adversary in the loss function to so that the adversarial term dominates the gradient and train for an additional 30 epochs. The training dynamics are shown in Figure 5.

RBM based architectures trained by maximum likelihood will decrease the forward KL divergence. This is shown clearly in Figure 5 – the forward KL divergence decreases during training of the Gaussian-Bernoulli RBM, the Gaussian-Bernoulli RBM with a driven sampler, and the Gaussian-Bernoulli BEAM. However, the figure also clearly shows that the reverse KL divergence increases during training. The training metrics for the BEAM rapidly diverge from the RBMs once the adversary is turned on after epoch 25. The reverse KL divergence of the BEAM quickly drops towards zero while the reverse KL divergence of the RBMs continue to rise. By the end of training, the BEAM obtains comparable, or better, metrics than all other architectures.

Figure 5: Training metrics on continous MNIST. The forward KL divergence, , and the reverse KL divergence, divergence during training on MNIST. Both divergences were estimated using an approach based on -nearest neighbors Wang et al. (2009). Adversarial training for the BEAM begins after epoch 25 (vertical dashed line).

Fantasy particles for continuous MNIST are shown in the top row of Figure 6 along with a table of KL divergences at the end of training. The non-convolutional GAN, non-convolutional WGAN, and the BEAM have similar metrics at the end of training. The errors that they make, however, are qualitatively different. The GANs produce sharp images that are a bit blotchy, whereas the BEAM produces smooth images that are a bit blurry. The regular Gaussian-Bernoulli RBM fails to produce reasonable digits at all, whereas the Gaussian-Bernoulli RBM trained with the driven sampler is a bit better.

Figure 6: Comparison of MNIST fantasy particles. Sixteen particles sampled at random from each of the generators. The RBM fantasy particles were randomly initialized and sampled for 100 MCMC steps; the figure shows computed from the last iteration. Note the thick line in the table of KL divergences separating results on continuous MNIST from those on binary MNIST indicating that these values are not comparable.

A regular Bernoulli-Bernoulli RBM performs a lot better on binary MNIST than a Gaussian-Bernoulli RBM does on continuous MNIST, as shown in the second row of Figure 6. Even though a Bernoulli-Bernoulli RBM is well-suited to modeling binary MNIST, it still learns a model distribution with a low forward KL divergence and a high reverse KL divergence, as shown in the table. Adversarial training of the genenerator as a Bernoulli-Bernoulli BEAM fixes this problem leading to a better model.

We do not show any GANs for the binary MNIST problem. In general, it is difficult to train GANs on discrete data due to the inability to backprop through a discrete variable (though, there are ways around this problem as in Yu et al. (2017)). Thus, one advantage of a BEAM is that it is much easier to train on discrete data than a GAN and much easier to train on continous data than a standard RBM.

Throughout, we have presented BEAMs as an adversarial approach to training RBMs where the hidden unit activities of the RBM are used as features for the critic. We claim that this allows us to use a simple classifier to approximate the optimal critic. However, it is possible to train an RBM against an adversary that uses the visible units directly (as in a GAN). Empirically, we have found that applying the critic to the hidden unit activities works better; see Figure S2 for an example.

iii.3 Celebrity Faces

Figure 7: Comparison of CelebA fantasy particles. 36 fantasy images sampled from the (a) BEAM with the decoder network, (b) a deep convolutional GAN, and (c) a deep convolutional WGAN. Our implementations were chosen so that each model has very similar architectures. These architectures and training parameters are provided in the Supplementar Material. For comparision, we directly reproduce the CelebA results of a deep convolutional WGAN as reported in Li et al. (2017).

Natural images present a more complex dataset for which model performance can be easily determined. We use the CelebA dataset, consisting of pictures of celebrities’ faces, to demonstrate that BEAMs scale to more complex problems. This dataset requires convolutional architectures to obtain good performance. Because exploring convolutional RBMs is orthogonal to the purpose of this work, we use a separately trained convolutional autoencoder to extract features from the images. These features are used as the visible input to the BEAM.

The autoencoder is trained with sufficient depth and number of features to provide high-quality reconstructions of the data. This architecture forms the basis of the convolutions used in training the BEAM, DCGAN, and DCWGAN, as shown in Figure S6 in the Supplementary Material. Sample images from the dataset and their reconstructions are shown in Figure S7 in the Supplementary Material.

As with the MNIST examples, we train the CelebA BEAM in two phases. For the first 15 epochs, we use the log-likelihood objective function with persistent contrastive divergence and driven sampling. After this phase we train for an additional 50 epochs using the combined log-likelihood-adversarial objective function in Equation 9 with .

For comparison, we train a DCGAN and DCWGAN using the same convolutional architecture as the autoencoder that was used to extract features for BEAM training. That is, the DCGAN/DCWGAN generator uses an initial fully connected layer followed by the same architecture as the decoder of the autoencoder and the DCGAN/DCWGAN critic uses the same architecture as the encoder of the autoencoder followed by a fully connected layer to a single unit. The DCGAN/DCWGAN share many of architectural features with the autoencoder, but do not share any parameters. Instead, the DCGAN and the DCWGAN were trained end-to-end on CelebA.

Images are generated from the BEAM by sampling fantasy particles and passing them through the decoder. Example generated images from the BEAM, DCGAN, and DCWGAN are shown in Figure 7. It is clear that the BEAM images are internally consistent with clear features across each face. However, the images are a bit blurry – especially towards the corners of the image in the backgrounds. The images produced by the DCGAN and DCWGAN have sharper local features, but notably poorer global correlations. Although the images produced by the GANs are not particuarly high-quality, they are qualitatively similar to results appearing in the literature. To illustrate this, we have directly reproduced fantasy images from a DCWGAN that were published in Li et al. (2017) (see Figure 7d).

We note that it is possible to obtain sharper features from the BEAM at the expense of less consistent images by optimizing training to lower the forward KL divergence at the expense of an increased reverse KL divergence. Furthermore, additional approaches such as centering layers and using deep models produce notably better images; see Figure S8 in the Supplementary Material for an example using centered layers.

Iv Discussion

We have introduced a novel formulation of RBMs, called BEAMs, that utilize an adversary acting on the hidden unit activations from the RBM to supplement the traditional likelihood-based training. The additional adversarial loss term ensures that training minimizes both the forward and reverse KL divergences, allowing the model to accurately learn distributions with multiple modes. We have shown that BEAMs excel at a variety of applications, outperforming GANs that use significantly larger computational budgets.

As the machine learning community increasingly turns its attention to unsupervised learning problems, it is valuable to place this work into a larger context. The deep learning revolution has driven tremendous advances on supervised learning problems, and a primary outcome is that feed-forward neural networks have become a powerful tool. GANs and variational autoencoders can be thought of as a natural extension of the broad learning capacity of neural networks and the flexibility of backpropagation, and are tools of choice in many applications. This is further supported by the software ecosystem for machine learning, which makes many sophisticated tools easily accessible.

RBMs played an active role in kicking off the deep learning revolution Hinton and Salakhutdinov (2006), but their development slowed with the increased focus on supervised learning and a general attitude that they were unsuited to more complex problems. However, there are reasons to believe that RBMs will be fundamental in advancing unsupervised learning:

  • Novel training algorithms and novel model architectures can dramatically improve performance.

  • RBMs have several analytic handles to understand models and develop training strategies.

  • Increased capacity to handle complex datasets can be developed through a progressive set of challenging applications.

We hope this work reinforces the promise of RBMs.


  • Kadurin et al. (2017) A. Kadurin, S. Nikolenko, K. Khrabrov, A. Aliper,  and A. Zhavoronkov, Molecular Pharmaceutics 14, 3098 (2017).
  • Hinton and Sejnowski (1999) G. E. Hinton and T. J. Sejnowski, Unsupervised learning: foundations of neural computation (MIT press, 1999).
  • Hinton and Salakhutdinov (2006) G. E. Hinton and R. R. Salakhutdinov, science 313, 504 (2006).
  • Salakhutdinov and Hinton (2009) R. Salakhutdinov and G. Hinton, in Artificial Intelligence and Statistics (2009) pp. 448–455.
  • Kingma and Welling (2013) D. P. Kingma and M. Welling, arXiv preprint arXiv:1312.6114  (2013).
  • Rolfe (2016) J. T. Rolfe, arXiv preprint arXiv:1609.02200  (2016).
  • Kuleshov and Ermon (2017) V. Kuleshov and S. Ermon, in Advances in Neural Information Processing Systems (2017) pp. 6737–6746.
  • Goodfellow et al. (2014) I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville,  and Y. Bengio, in Advances in neural information processing systems (2014) pp. 2672–2680.
  • Karras et al. (2017) T. Karras, T. Aila, S. Laine,  and J. Lehtinen, arXiv preprint arXiv:1710.10196  (2017).
  • Arjovsky and Bottou (2017) M. Arjovsky and L. Bottou, arXiv preprint arXiv:1701.04862  (2017).
  • Nowozin et al. (2016) S. Nowozin, B. Cseke,  and R. Tomioka, in Advances in Neural Information Processing Systems (2016) pp. 271–279.
  • Rumelhart et al. (1986) D. E. Rumelhart, G. E. Hinton,  and R. J. Williams, nature 323, 533 (1986).
  • Arjovsky et al. (2017a) M. Arjovsky, S. Chintala,  and L. Bottou, arXiv preprint arXiv:1701.07875  (2017a).
  • Hinton and Salakhutdinov (2012) G. E. Hinton and R. R. Salakhutdinov, in Advances in Neural Information Processing Systems (2012) pp. 2447–2455.
  • Srivastava and Salakhutdinov (2012) N. Srivastava and R. R. Salakhutdinov, in Advances in neural information processing systems (2012) pp. 2222–2230.
  • Taylor et al. (2007) G. W. Taylor, G. E. Hinton,  and S. T. Roweis, in Advances in neural information processing systems (2007) pp. 1345–1352.
  • Taylor and Hinton (2009) G. W. Taylor and G. E. Hinton, in Proceedings of the 26th annual international conference on machine learning (ACM, 2009) pp. 1025–1032.
  • Sutskever et al. (2009) I. Sutskever, G. E. Hinton,  and G. W. Taylor, in Advances in Neural Information Processing Systems (2009) pp. 1601–1608.
  • Le Roux and Bengio (2008) N. Le Roux and Y. Bengio, Neural computation 20, 1631 (2008).
  • Tieleman (2008) T. Tieleman, in Proceedings of the 25th international conference on Machine learning (ACM, 2008) pp. 1064–1071.
  • Hinton (2006) G. E. Hinton, Training 14 (2006).
  • Ackley et al. (1985) D. H. Ackley, G. E. Hinton,  and T. J. Sejnowski, Cognitive science 9, 147 (1985).
  • Hinton (2010) G. Hinton, Momentum 9, 926 (2010).
  • Kullback and Leibler (1951) S. Kullback and R. A. Leibler, The annals of mathematical statistics 22, 79 (1951).
  • Montavon et al. (2016) G. Montavon, K.-R. Müller,  and M. Cuturi, in Advances in Neural Information Processing Systems (2016) pp. 3718–3726.
  • Arici and Celikyilmaz (2016) T. Arici and A. Celikyilmaz, arXiv preprint arXiv:1611.06953  (2016).
  • Gouriéroux and Jasiak (2006) C. Gouriéroux and J. Jasiak, Journal of Forecasting 25, 129 (2006).
  • Swendsen and Wang (1986) R. H. Swendsen and J.-S. Wang, Physical Review Letters 57, 2607 (1986).
  • Geyer (1991) C. J. Geyer,  (1991).
  • Desjardins et al. (2010a) G. Desjardins, A. Courville, Y. Bengio, P. Vincent,  and O. Delalleau, in Proceedings of the thirteenth international conference on artificial intelligence and statistics (2010) pp. 145–152.
  • Brakel et al. (2012) P. Brakel, S. Dieleman,  and B. Schrauwen, Artificial Neural Networks and Machine Learning–ICANN 2012 , 92 (2012).
  • Desjardins et al. (2010b) G. Desjardins, A. Courville, Y. Bengio, P. Vincent,  and O. Delalleau, in Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics (MIT Press Cambridge, MA, 2010) pp. 145–152.
  • Desjardins et al. (2014) G. Desjardins, H. Luo, A. Courville,  and Y. Bengio, arXiv preprint arXiv:1410.0123  (2014).
  • Wang et al. (2009) Q. Wang, S. R. Kulkarni,  and S. Verdú, IEEE Transactions on Information Theory 55, 2392 (2009).
  • Arjovsky et al. (2017b) M. Arjovsky, S. Chintala,  and L. Bottou, in Proceedings of the 34th International Conference on Machine Learning (2017) pp. 214–223.
  • greydanus (2017) greydanus, “Mnist gan,” (2017).
  • Arjovsky (2017) M. Arjovsky, “Wasserstein gan,” (2017).
  • Cho et al. (2013) K. H. Cho, T. Raiko,  and A. Ilin, in Neural Networks (IJCNN), The 2013 International Joint Conference on (IEEE, 2013) pp. 1–7.
  • LeCun and Cortes (2010) Y. LeCun and C. Cortes,  (2010).
  • Yu et al. (2017) L. Yu, W. Zhang, J. Wang,  and Y. Yu, in AAAI (2017) pp. 2852–2858.
  • Li et al. (2017) C.-L. Li, W.-C. Chang, Y. Cheng, Y. Yang,  and B. Póczos, in Advances in Neural Information Processing Systems (2017) pp. 2200–2210.

V Supplmentary Material

v.1 Gradients of the Adversarial Loss

In a general adversarial approach to learning, we train a Boltzmann machine, , to minimize a compound objective function . The compound objective function represents a tradeoff between maximum likelihood learning () and adversarial learning (). Just as with maximum likelihood, the compound objective function can be optimized using stochastic gradient descent. Using a compound objective function helps to mitigate some of the instability problems that plague traditional GANs. For example, the gradient does not vanish even if the discriminator is completely untrained because there is always the term from the likelihood.

We need to compute the derivatives of the compound objective function in order to minimize it. The differential operator is linear, so we can distribute it across the two terms:


The first term can be computed from Equation 2 (Main Text). So, all we need to do is compute the second term. It turns out that derivatives of this form can be computed using a simple formula when the model is a Boltzmann machine.

Let be a critic function and


be the associated adversarial loss. This formulation reduces to the adversarial loss for a BEAM when is independent of the visible units, but we derive the general case. We need to compute . First, we use the stochastic derivative trick:


We can write the model distribution of a Boltzmann machine as so that , with . Then, we have . Plugging this in, we find:


v.2 Details of Temperature Driven Sampling

Figure 8: Sampling with a driven sampler. Comparison of temperature driven sampling (TDS) to regular Gibbs sampling. The RBMs have a single Gaussian visible layer and a softmax hidden layer with 3 hidden units that encode the modes of a mixture of 3 Gaussians. The standard deviation of the inverse temperature was set to for the driven sampler.

Our approach to accelerated sampling, which we call Temperature Driven Sampling (TDS), greatly improves the ability to train Boltzmann machines without incurring significant additional computational cost. The algorithm is a variant of a sequential Monte Carlo sampler. A collection of samples are evolved independently using Gibbs sampling updates from the model. Note that this is not the same as running multiple chains for a parallel tempering algorithm because each of the samples in the sequential Monte Carlo sampler will be used compute statistics, as opposed to just the samples from the chain during parallel tempering. Each of these samples has an inverse temperature that is drawn from a distribution with mean

and variance

. The inverse temperatures of each sample are independently updated once for every Gibbs sampling iteration of the model; however, the updates are autocorrelated across time so that the inverse temperatures are slowly varying. As a result, the collection of samples are drawn from a distribution that is close to the model distribution, but with fatter tails. This allows for much faster mixing, while ensuring that the model averages (computed over the collection of samples) remain close approximations to averages computed from the model with .

Autocorrelation coefficient .
Variance of the distribution .
Current value of .
Set: and .
Draw .
Draw .
Algorithm 1

Sampling from an autocorrelated Gamma distribution.

Details of the TDS algorithm are provided in Algorithms 1 and 2. Note that this algorithm includes a standard Gibbs sampling based sequential Monte Carlo sampler in the limit that . The samples drawn with the TDS algorithm are not samples from the equilibrium distribution of the Boltzmann machine. In principle, it is possible to reweight these samples to correct for the bias due to the varying temperature. In practice, we have not found that reweighting is necessary. An example of temperature driven sampling applied to a 3-mode MoG is show in Figure 8.

Number of samples .
Number of update steps .
Autocorrelation coefficient for the inverse temperature .
Variance of the inverse temperature .
Randomly initialize samples .
Randomly initialize inverse temperatures .
for t = 1, …, k do
        for i = 1, …, m do
               Update using Algorithm 1.
               Update using Gibbs sampling.
        end for
end for
Algorithm 2 Temperature Driven Sampling.

v.3 Details on Models and Training

v.3.1 Gaussian Mixtures

Table 1 lays out the parameters of the Gaussian mixture comparison examples. It is interesting to note just how few parameters are required by the BEAM to model this data.

v.3.2 Mnist

Figure 9: Comparing a BEAM with the critic on the hidden layer to one with the critic on the visible layer. The KL divergences of two BEAMs trained on MNIST differing only in whether or not the critic acts on the encoded data or directly on the visible data. Adversarial training begins after 25 epochs.

Figure 9 provides a comparison of the training metrics when the discriminator is trained on the hidden unit activities to when the discriminator is trained on the visible units. Both architectures use the same nearest neighbor classifier as the rest of our examples. The two training curves overlay exactly for the first 25 epochs while the generator is pre-trained with maximum likelihood learning. Once the discriminator is turned on, the reverse KL divergence decreases, but training the adversary on the hidden units decreases these metrics much more rapidly. Table 2 lays out the parameters of the models.

Figure 10: Training progress on continuous MNIST with different critics.

RF = Random Forest.

Figure 11: Fantasy particles for continuous MNIST with different critics.
Figure 12: Comparison of a non-convolutional GAN to a DCGAN on continuous MNIST.

v.3.3 Celebrity Faces

Figure 13: Architectures used in the CelebA experiment. The autoencoder is purely convolutional, with the encoder (decoder) shown by the stacks of convolutional (deconvolutional) layers. The BEAM uses the flattened encoded features as the visible units. The GAN/WGAN uses the same encoder (decoder) architectures for the discriminator (generator), with added single fully connected layers. The autoencoder, BEAM, and GANs are each trained fully independently. Table 3 lays out the parameters of the models.
Figure 14: Sample autoencoder reconstructions The compression factor is () to ().

Figure 13 shows a diagram of the architectures used in the CelebA dataset experiments for the BEAM (including the autoencoder) and GAN/WGAN. The GANs share the same convolution architecture as the autoencoder, but are separately trained.

There is plenty of room to improve the quality of generated faces by employing more advanced RBM training techniques. For example, centering the RBMs tends to improve the variance in the generated faces and increases the definition of the hair and face edges.

Figure 15: BEAM vs. centered BEAM fantasy particles Example fantasy particles generated by a BEAM using centered visible layer.
Bimodal Gaussian samples, batch size

fully-connected with ReLU activations, WGAN weight clamping

generator dimensions critic dimensions epochs lr
RBM/BEAM distance-weighted nearest-neighbor critic , for BEAM
dims MCMC steps epochs lr
Radial Gaussian samples, batch size
GAN/WGAN fully-connected with ReLU activations, WGAN weight clamping
generator dimensions critic dimensions epochs lr
RBM/BEAM distance-weighted nearest-neighbor critic , for BEAM
dims MCMC steps epochs lr
Grid Gaussian samples, batch size
GAN/WGAN fully-connected with ReLU activations, WGAN weight clamping
generator dimensions critic dimensions epochs lr
RBM/BEAM distance-weighted nearest-neighbor critic , for BEAM
dims MCMC steps epochs lr

MCMC stepsepochslrcritic dimensionsepochslrMCMC stepsepochslrcritic dimensionsepochslrMCMC stepsepochslr

Table 1:

Gaussian mixture architectures and hyperparameters

All GAN/WGAN models use ReLU activations between fully-connected layers. Network weights are initialized with normal distributions of standard deviation

, with biases zero-initialized. The beta standard deviation for the driven sampler is set to for RBM, for driven RBM and BEAMs. The RBMs’ learning rates decrease according to a power-law decay, and all training uses ADAM optimization with beta .
MNIST samples, batch size
GAN/WGAN fully-connected with ReLU activations, sigmoid on generator, WGAN weight clamping
generator dimensions critic dimensions epochs lr
RBM/BEAM distance-weighted nearest-neighbor critic , for BEAM
dims MCMC steps epochs lr
ML, adv. ML, adv.

MCMC stepsepochslr ML, adv. ML, adv.

Table 2: Gaussian mixture architectures and hyperparameters All GAN/WGAN models use ReLU activations between fully-connected layers. Generator and discriminator weights are initialized with normal distributions of standard deviation and resp., with biases zero-initialized. All training uses ADAM optimization with beta for the GANs and for the BEAM. For the BEAM, the beta standard deviation for the driven sampler is set to . The RBMs’ learning rates decrease according to a power-law decay.
CelebA samples, batch size
GAN/WGAN conv. with batch-norm and ReLU activations, WGAN weight clamping
generator dimensions critic dimensions epochs lr
BEAM distance-weighted nearest-neighbor critic , for BEAM
dims MCMC steps epochs lr
ML, adv. ML, adv.

MCMC stepsepochslr ML, adv. ML, adv.

Table 3: CelebA architectures and hyperparameters All GAN/WGAN models use ReLU activations between fully-connected layers. Generator and discriminator weights are initialized with normal distributions of standard deviation , with biases zero-initialized. The beta standard deviation for the driven sampler is set to for RBM, for the BEAM. All training uses ADAM optimization with beta for the GANs and for the BEAM. The RBMs’ learning rates decrease according to a power-law decay.