Learning Energy-based Model with Flow-based Backbone by Neural Transport MCMC

06/12/2020 ∙ by Erik Nijkamp, et al. ∙ Google 61

Learning energy-based model (EBM) requires MCMC sampling of the learned model as the inner loop of the learning algorithm. However, MCMC sampling of EBM in data space is generally not mixing, because the energy function, which is usually parametrized by deep network, is highly multi-modal in the data space. This is a serious handicap for both the theory and practice of EBM. In this paper, we propose to learn EBM with a flow-based model serving as a backbone, so that the EBM is a correction or an exponential tilting of the flow-based model. We show that the model has a particularly simple form in the space of the latent variables of the flow-based model, and MCMC sampling of the EBM in the latent space, which is a simple special case of neural transport MCMC, mixes well and traverses modes in the data space. This enables proper sampling and learning of EBM.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 2

page 6

page 7

page 8

page 15

page 16

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The energy-based model (EBM) LeCun et al. (2006); Ngiam et al. (2011); Kim and Bengio (2016); Zhao et al. (2016); Xie et al. (2016); Gao et al. (2018); Kumar et al. (2019b); Nijkamp et al. (2019); Du and Mordatch (2019); Finn et al. (2016)

defines an unnormalized probability density function on the observed data such as images via an energy function, so that the density is proportional to the exponential of the negative energy. Taking advantage of the approximation capacity of modern deep networks such as convolutional network (ConvNet)

LeCun et al. (1998); Krizhevsky et al. (2012), recent papers Xie et al. (2016); Gao et al. (2018); Kumar et al. (2019b); Nijkamp et al. (2019); Du and Mordatch (2019) parametrize the energy function by ConvNet. The ConvNet-EBM is highly expressive and the learned EBM can produce realistic synthesized examples.

The EBM can be learned by maximum likelihood, and the gradient-based maximum likelihood learning algorithm follows an “analysis by synthesis” scheme. In the synthesis step, synthesized examples are generated by sampling from the current model. In the analysis step, the model parameters are updated based on the statistical difference between synthesized examples and observed examples. The synthesis step usually requires Markov chain Monte Carlo (MCMC) sampling, and gradient-based sampling such as Langevin dynamics 

Langevin (1908) or Hamiltonian Monte Carlo (HMC) Neal (2011)

can be conveniently implemented on the current deep learning platforms where gradients can be efficiently and automatically computed by back-propagation.

However, gradient-based MCMC sampling in the data space generally does not mix. The data distribution is typically highly multi-modal. To approximate such a distribution, the density function or the energy function of the ConvNet-EBM needs to be highly multi-modal as well. When sampling from such a multi-modal density in the data space, gradient-based MCMC tends to get trapped in local modes with little chance to traverse the modes freely, rendering the MCMC non-mixing. Without being able to generate fair examples from the model, the estimated gradient of the maximum likelihood learning can be very biased, and the learned model can be far from the maximum likelihood estimate (MLE). Even if we can learn the model by other means without resorting to MCMC sampling, e.g., by noise contrastive estimation (NCE)

Gutmann and Hyvärinen (2010); Gao et al. (2019), it is still necessary to be able to draw fair examples from the learned model for the purpose of model checking or some downstream applications based on the learned model.

Accepting the fact that MCMC sampling is not mixing, contrastive divergence

Tieleman (2008) initializes finite step MCMC from the observed examples, so that the learned model is admittedly biased from the MLE. Recently, Nijkamp et al. (2019) proposes to initialize short-run MCMC from a fixed noise distribution, and shows that even though the learned EBM is biased, the short-run MCMC can be considered a valid model that can generate realistic examples. This partially explains why EBM learning algorithm can synthesize high quality examples even though the MCMC does not mix. However, the problem of non-mixing MCMC remains unsolved. Without proper MCMC sampling, the theory and practice of learning EBM is on a very shaky ground. The goal of this paper is to address this problem.

Figure 1: Demonstration of the mixing MCMC with neural transport on a mixture of Gaussians as target distribution. Top: Trajectories of Markov chains in data space and latent space . Bottom: Density estimations with exponentially tilted model of underlying flow .

We propose to learn the EBM with a flow-based model as a backbone model, so that the EBM is in the form of a correction, or an exponential tilting, of the flow-based model. Flow-based models have gained popularity in generative modeling Dinh et al. (2014, 2016); Kingma and Dhariwal (2018); Grathwohl et al. (2018); Behrmann et al. (2018); Kumar et al. (2019a); Tran et al. (2019) and variational inference Kingma and Welling (2013); Rezende and Mohamed (2015); Kingma et al. (2016); Kingma and Welling (2014); Khemakhem et al. (2019). Similar to the generator model Kingma and Welling (2013); Goodfellow et al. (2014), flow-based model is based on a mapping from latent space to the data space. However, unlike the generator model, the mapping in the flow-based model is one-to-one, with closed form inversion and Jacobian. This leads to an explicit normalized density via change of variable. However, to ensure tractable inversion and Jacobian, the mapping in the flow-based model has to be a composition of a sequence of simple transformations of highly constrained forms. In order to approximate a complex distribution, it is necessary to compose a large number of such transformations. In our work, we propose to learn EBM by correcting a relatively simple flow-based model with a relatively simple energy function parametrized by a free-form ConvNet. We show that the resulting EBM has a particularly simple form in the space of the latent variables of the flow-based model. MCMC sampling of the EBM in the latent space, which is a simple special case of neural transport MCMC Hoffman et al. (2019), mixes well and is able to traverse modes in the data space. This enables proper sampling and learning of EBM.

Our experiments show that it is possible to learn EBM with flow-based backbone, and the neural transport sampling of the learned EBM solves or greatly mitigates the non-mixing problem of MCMC.

2 Contributions and related work

Contributions. This paper tackles the problem of non-mixing of MCMC for sampling from an EBM. We propose to learn EBM with a flow-based backbone model. The resulting EBM in the latent space is of a simple form that is much more friendly to MCMC mixing.

The following are research themes in generative modeling and MCMC sampling that are closely related to our work.

Neural transport MCMC. Our work is inspired by neural transport sampling Hoffman et al. (2019)

. For an unnormalized target distribution, the neural transport sampler trains a flow-based model as a variational approximation to the target distribution, and then samples the target distribution in the space of latent variables of the flow-based model via change of variable. In the latent space, the target distribution is close to the prior distribution of the latent variables of the flow-based model, which is usually a unimodal Gaussian white noise distribution. Consequently the target distribution in the latent space is close to be unimodal and is much more conducive to the mixing and fast convergence of MCMC than sampling in the original space 

Mangoubi and Smith (2017).

Our work is a simplified special case of this idea, where we learn the EBM as a correction of a pre-trained flow-based model, so that we do not need to train a separate flow-based approximation to the EBM. The energy function, which is a correction of the flow-based model, does not need to reproduce the content of the flow-based model, and thus can be kept relatively simple. Moreover, in the latent space, the resulting EBM takes on a very simple form where the inversion and Jacobian in the flow-based model disappear. This may allow for more free-form flow-based model where inversion and Jacobian do not need to be in closed form Grathwohl et al. (2018); Behrmann et al. (2018).

Energy-based corrections. Our model is based on an energy-based correction or an exponential tilting of a more tractable model. This idea has been explored in noise contrastive estimation (NCE) Gutmann and Hyvärinen (2010); Gao et al. (2019)

and introspective neural networks (INN) 

Tu (2007); Jin et al. (2017); Lazarow et al. (2017), where the correction is obtained by discriminative learning. Earlier works include Rosenfeld et al. (2001); Wang and Ou (2018). Correcting or refining a simpler and more tractable backbone model can be much easier than learning an EBM from scratch, because the EBM does not need to reproduce the knowledge learned by the backbone model. It also allows easier sampling of EBM.

Latent space sampling. Non-mixing MCMC sampling of an EBM is a clear call for latent variables to represent multiple modes of the original model distribution via explicit top-down mapping, so that the distribution of the latent variables is less multi-modal. Earlier work in this direction include Bengio et al. (2013); Brock et al. (2018); Kumar et al. (2019b). In this paper, we choose to use flow-based model for its simplicity, because the distribution in the data space can be translated into the distribution in the latent space by a simple change of variable, without requiring integrating out extra dimensions as in the generator model.

3 Model and learning

3.1 Flow-based model

Let be the input example, such as an image. A flow-based model is of the form

(1)

where

is the latent vector of the same dimensionality as

, and is a known prior distribution such as Gaussian white noise distribution. is a composition of a sequence of invertible transformations whose inversion and log-determinants of the Jacobians can be obtained in closed form. As a result, these transformations are of highly constrained forms. denotes the parameters. Let be the probability density at under the transformation , then according to the change of variable,

(2)

where and are understood as the volumes of the infinitesimal local neighborhoods around and respectively under the mapping . Then for a given , , and

(3)

where the ratio between the volumes is the absolute value of the determinant of the Jacobian.

Suppose we observe training examples , where is the data distribution, which is typically highly multi-modal. We can learn by MLE. For large , the MLE of

minimizes the Kullback-Leibler divergence

. strives to cover most of the modes in , and the learned tends to be more dispersed than . In order for to approximate closely, it is usually necessary for to be a composition of a large number of transformations of highly constrained forms with closed-form inversions and Jacobians. The learned mapping transports the unimodal Gaussian white noise distribution to a highly multi-modal distribution in the data space as an approximation to the data distribution .

3.2 Energy-based model

An energy-based model (EBM) is defined as follows:

(4)

where is a reference measure, such as the uniform measure or a Gaussian white noise distribution as in Xie et al. (2016). is defined by a bottom-up ConvNet whose parameters are denoted by . The normalizing constant or the partition function is typically analytically intractable.

Suppose we observe training examples for . For large , the sample average over approximates the expectation with respect to . For notational convenience, we treat the sample average and the expectation as the same.

The log-likelihood is

(5)

The derivative of the log-likelihood is

(6)

where for are synthesized examples sampled from the current model .

The above equation leads to the “analysis by synthesis” learning algorithm. At iteration , let be the current model parameters. We generate for . Then we update , where is the learning rate.

To generate synthesized examples from , we can use gradient-based MCMC sampling such as Langevin dynamics Langevin (1908) or Hamiltonian Monte Carlo (HMC) Neal (2011), where can be automatically computed. Since is in general highly multi-modal, the learned or tends to be multi-modal as well. As a result, gradient-based MCMC tends to get trapped in the local modes of with little chance of mixing between the modes.

3.3 Energy-based model with flow-based backbone

Instead of using uniform or Gaussian white noise distribution for the reference distribution in the EBM in (4), we can use a relatively simple flow-based model as the reference model. can be pre-trained by MLE, and serves as the backbone of the model, so that the model is of the following form

(7)

which is almost the same as in (4) except that the reference distribution is a pre-trained flow-based model . The resulting model is a correction or refinement of , or an exponential tilting of , and is a free-form ConvNet to parametrize the correction. The overall negative energy is .

In the latent space of , let be the distribution of under , then

(8)

Recall equation (2), , we have

(9)

is an exponential tilting of the prior noise distribution . It is a very simple form that does not involve the Jacobian or inversion of .

We can also apply the above exponential tilting and change of variable scheme to the generator model, i.e., using the generator model as the backbone model. However, for the generator model, is not in closed form, and after exponential tilting, the marginal requires integral. See Appendix 6.2 for details. In comparison, flow-based model is simpler and more explicit.

3.4 Learning by Hamiltonian neural transport sampling

Instead of sampling , we can sample in (9). While is multi-modal, is unimodal. Since is a correction of , is a correction of , and can be much less multi-modal than in the data space. After sampling from , we can generate .

The above MCMC sampling scheme is a special case of neutral transport MCMC proposed by Hoffman et al. (2019) for sampling from an EBM or the posterior distribution of a generative model. The basic idea is to train a flow-based model as a variational approximation to the target EBM, and sample the EBM in the latent space of the flow-based model. In our case, since is a correction of , we can simply use directly as the approximate flow-based model in neural transport sampler. The extra benefit is that the distribution is of an even simpler form than , because does not involve the inversion and Jacobian of . As a result, we may use a flow-based backbone model of a more free form such as one based on residual network Behrmann et al. (2018), and we will leave this issue to future investigation.

We use Hamiltonian Monte Carlo (HMC) Neal (2011) to sample from . We can then learn by MLE according to equation (6). Algorithm 1 describes the detailed learning algorithm. We refer to Appendix 6.4 for details.

input : Learning iterations , learning rate , batch size , pre-trained parameters , initial parameters , initial latent variables , observed examples , number of MCMC steps in each learning iteration.
output : Parameters .
for  do
       1. Update by HMC with target distribution in equation (9) for steps. 2. Push the -space samples forward through to obtain synthesized examples . 3. Draw observed training examples . 4. Update according to (6).
Algorithm 1 Learning the correction with Hamiltonian neural transport (NT).

3.5 Learning by noise contrastive estimation

We may also learn the correction discriminatively, as in noise contrastive estimation (NCE) Gutmann and Hyvärinen (2010) or introspective neural networks (INN) Tu (2007); Jin et al. (2017); Lazarow et al. (2017). Let be the training examples, which are treated as positive examples, and let be the examples generated from , which are treated as negative examples. For each batch, let be the proportion of positive examples, and the proportion of negative examples. Then

(10)

where is treated as a separate bias parameter. Then we can estimate and

by fitting a logistic regression on the positive and negative examples.

Note, that NCE is the discriminator side of GAN. Similar to GAN, we can also improve the flow-based model based on the value function of GAN. This may further improve the NCE results.

4 Experiments

In the subsequent empirical evaluations, we shall address the following questions:
(1) Is the mixing of HMC with neural transport, both qualitatively and quantitatively, apparent?
(2) Does the exponential tilting with correction term improve the quality of synthesis?

(3) In the latent space, does smooth interpolation remain feasible?


(4) In terms of ablation, what is the effect of amount of parameters for flow-based ?
(5) Is discriminative learning in the form of NCE an efficient alternative learning method?

4.1 Mixing

Gelman-Rubin. The Gelman-Rubin statistic Gelman et al. (1992); Brooks and Gelman (1998)

measures the convergence of Markov chains to the target distribution. It is based on the notion that if multiple chains have converged, by definition, they should appear “similar” to one another, else, one or more chains have failed to converge. Specifically, the diagnostic recruits an analysis of variance to access the difference between the between-chain and within-chain variances.

Let denote the target distribution with mean and variance . Gelman et al. (1992) designs two estimators of and compares the square root of their ratio to . Let denote Markov chains of length . Let be the within-chain variance. The quantity underestimates due to positive correlation in the Markov chain. Let be a mixture of within-chain variance and between-chain variance . The quantity will overestimate , if an over-dispersed initial distribution for the Markov chains was used Gelman et al. (1992). That is, underestimates while overestimates . Both estimators are consistent for as Vats and Knudson (2018). In light of this, the Gelman-Rubin statistic monitors convergence as the ratio . If all chains converge to , then as , . Before that,

. The heuristics

indicates approximate convergence Brooks and Gelman (1998). Figure 1(a) depicts for chains over steps with a burn-in time of steps. The mean value is , which we treat as approximative convergence to the target distribution.

(a) Gelman-Rubin statistic for convergence of multiple long-run Markov chains where indicates approximative convergence.
(b) Auto-correlation of a single long-run Markov chain over time lag with mean depicted as line and min/max as bands.
Figure 2: Diagnostics for the mixing of MCMC chains with steps and target .
Figure 3: A single long-run Markov Chain with steps depicted in 5 steps intervals sampled by Hamiltonian neural transport. Left: SVHN . Right: CelebA .

Auto-Correlation. MCMC sampling leads to autocorrelated samples due to the inherent Markovian dependence structure. The (sample) auto-correlation is the correlation between samples steps apart in time. Figure 1(b) shows auto-correlation against increasing time lag . While the auto-correlation of Hamiltonian Markov chains with neural transport vanishes within steps, the over-damped Langevin sampler requires steps. This finding for single long-run Markov chain is consistent with the Gelman-Rubin statistic assessing multiple Markov chains.

Visual Inspection. Assume a Markov chain is run for a large numbers of steps with a Hamiltonian neural transport. Then, we pull back Markov chains into data space and visualize the long run trajectory in Figure 3 with learned on the SVHN (Netzer et al. (2011) dataset and CelebA  Liu et al. (2015) dataset. We observe the Markov chain is traversing between local modes, which we consider a weak indication of mixing. Figure 4 contrasts the Markov chain that samples the EBM learned with short-run MCMC Nijkamp et al. (2019), which does not mix, against our method in which the pulled back chain mixes freely.

Figure 4: Long-run Markov chains for learned models without and with mixing. Top: Chains trapped in an over-saturated local mode. Model learned by short-run MCMC Nijkamp et al. (2019) without mixing. Bottom: Chain is freely traversing local modes. Model learned by Hamiltonian neural transport with mixing.

4.2 Synthesis

We evaluate the quality of synthesis on four datasets which include MNIST (LeCun et al. (2010), SVHN (Netzer et al. (2011), CelebA (Liu et al. (2015), and, CIFAR-10 (Krizhevsky et al. . The qualitative results are depicted in Figure 6 which contrast generated samples from Glow against long-run Markov chains by Hamiltonian neural transport. Table 1 compares the Fréchet Inception Distance (FID) Heusel et al. (2017)

with Inception v3 classifier 

Szegedy et al. (2016) on generated examples. Both, qualitatively and quantitatively speaking, we observe a significant improvement in quality of synthesis with exponentially tilting of the reference distribution by the correction .

(a) Samples drawn from flow by ancestral sampling.
(b) Samples drawn from by Hamiltonian neural transport.
Figure 5: Generated samples on SVHN ().
(a) Samples drawn from flow by ancestral sampling.
(b) Samples drawn from by Hamiltonian neural transport.
Figure 6: Generated samples on CelebA ().
Method MNIST SVHN CelebA CIFAR-10
VAE Kingma and Welling (2013) 32.86 49.72 48.27 106.37
ABP Han et al. (2017) 39.12 48.65 51.92 114.13
Glow (MLE) Kingma and Dhariwal (2018) 66.04 94.23 59.35 90.08
NCE-EBM (Ours) 36.52 79.84 51.73
NT-EBM (Ours) 21.32 48.01 46.38 78.12
Table 1: FID scores for generated examples.

4.3 Interpolation

Interpolation allows us to appraise the smoothness of the latent space. In particular, two samples and are drawn from the prior distribution . We may spherically interpolate between them in -space and then push forward into data space to assess . To evaluate the tilted model , we run a magnetized form of the over-damped Langevin equation for which we alter negative energy to with magnetization constant  Hill et al. (2019). Note, , thus, the magnetization term introduces a vector field pointing with uniform strength towards . The resulting Langevin equation is with Wiener process . To find a low energy path from towards , we set , and perform steps of the discretized, magnetized Langevin equation with small . Figure 7 depicts the low-energy path in data-space and energy over time. The qualitatively smooth interpolation and narrow energy spectrum indicate that a Langevin dynamics in latent space (with small magnetization) is able to traverse two arbitrary local modes, thus, substantiating our claim that the underlying geometry is amenable to mixing.

Figure 7: Low energy path between and by magnetized Langevin dynamics over steps on MNIST . Top: Trajectory in data-space. Bottom: Energy profile over time.

4.4 Ablation

We investigate the influence of the number of parameters of flow-based on the quality of synthesis. Specifically, we show (1) the threshold of a “large” learned by MLE outperforming NT with a “small” tilted , and, (2) the minimal size of which allows for the learning by our method. Our method with a “medium” sized backbone significantly outperforms the “largest” Glow baseline.

Method Small Medium Large Largest
Glow (MLE) Kingma and Dhariwal (2018) 110.55 94.34 89.31 86.18
NT-EBM (Ours) 74.77 48.01 43.82
Table 2: FID scores for generated examples for Glow with varying sizes of parameters on SVHN . Small: , , Medium: , . Large: , , Largest: , .

4.5 Noise Contrastive Estimation

Noise Contrastive Estimation (NCE) is an efficient alternative to our neural transport learning method. We wish to learn the correction according to (10) while sampling from the learned model with neural transport MCMC. Table 1 compares the learned models with both learning methods. The long-run MCMC chains in models learned by NCE are conducive to mixing and remain of high visual quality. See Appendix 6.6. We leave improvements to this method to future investigations.

5 Conclusion

This paper proposes to learn EBM as a correction or exponential tilting of a flow-based model, so that neural transport MCMC sampling in the latent space of the flow-based model can mix well and traverse the modes in the data space.

Energy-based correction of a more tractable backbone model is a general modeling strategy that goes beyond correcting the flow-based model. Consider latent EBM such as Boltzmann machine

Ackley et al. (1985), which is an undirected graphical model with a simple energy function defined on both the observed variables and multiple layers of latent variables. Instead of learning a latent EBM from scratch, we may learn a latent EBM as a correction of a top-down generation model such as the one in the Helmholtz machine Hinton et al. (1995), to correct for conditional independence assumptions in the top-down model. We shall investigate this problem in future work.

Acknowledgments

The work is supported by DARPA XAI project N66001-17-2-4029; ARO project W911NF1810296; and ONR MURI project N00014-16-1-2007; and XSEDE grant ASC170063. We thank Matthew D. Hoffman, Diederik P. Kingma, Alexander A. Alemi, and Will Grathwohl for helpful discussions.

References

  • [1] D. H. Ackley, G. E. Hinton, and T. J. Sejnowski (1985) A learning algorithm for boltzmann machines. Cognitive Science 9 (1), pp. 147–169. External Links: Link, Document Cited by: §5.
  • [2] C. Andrieu and J. Thoms (2008) A tutorial on adaptive mcmc. Statistics and computing 18 (4), pp. 343–373. Cited by: §6.4.
  • [3] J. Behrmann, D. Duvenaud, and J. Jacobsen (2018) Invertible residual networks. arXiv preprint arXiv:1811.00995. Cited by: §1, §2, §3.4.
  • [4] Y. Bengio, G. Mesnil, Y. Dauphin, and S. Rifai (2013) Better mixing via deep representations. In

    International conference on machine learning

    ,
    pp. 552–560. Cited by: §2.
  • [5] A. Beskos, N. Pillai, G. Roberts, J. Sanz-Serna, A. Stuart, et al. (2013) Optimal tuning of the hybrid monte carlo algorithm. Bernoulli 19 (5A), pp. 1501–1534. Cited by: §6.4.
  • [6] A. Brock, J. Donahue, and K. Simonyan (2018) Large scale gan training for high fidelity natural image synthesis. arXiv preprint arXiv:1809.11096. Cited by: §2.
  • [7] S. P. Brooks and A. Gelman (1998) General methods for monitoring convergence of iterative simulations. Journal of computational and graphical statistics 7 (4), pp. 434–455. Cited by: §4.1, §4.1, §6.7.
  • [8] R. T. Chen, J. Behrmann, and J. Jacobsen Residual flows: unbiased generative modeling with norm-learned i-resnets. Cited by: §6.3.
  • [9] L. Dinh, D. Krueger, and Y. Bengio (2014) Nice: non-linear independent components estimation. arXiv preprint arXiv:1410.8516. Cited by: §1.
  • [10] L. Dinh, J. Sohl-Dickstein, and S. Bengio (2016) Density estimation using real nvp. arXiv preprint arXiv:1605.08803. Cited by: §1.
  • [11] Y. Du and I. Mordatch (2019) Implicit generation and generalization in energy-based models. arXiv preprint arXiv:1903.08689. Cited by: §1.
  • [12] C. Finn, P. Christiano, P. Abbeel, and S. Levine (2016)

    A connection between generative adversarial networks, inverse reinforcement learning, and energy-based models

    .
    arXiv preprint arXiv:1611.03852. Cited by: §1.
  • [13] R. Gao, Y. Lu, J. Zhou, S. Zhu, and Y. Nian Wu (2018) Learning generative convnets via multi-grid modeling and sampling. In

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    ,
    pp. 9155–9164. Cited by: §1.
  • [14] R. Gao, E. Nijkamp, D. P. Kingma, Z. Xu, A. M. Dai, and Y. N. Wu (2019) Flow contrastive estimation of energy-based models. arXiv preprint arXiv:1912.00589. Cited by: §1, §2.
  • [15] A. Gelman, D. B. Rubin, et al. (1992) Inference from iterative simulation using multiple sequences. Statistical science 7 (4), pp. 457–472. Cited by: §4.1, §4.1.
  • [16] X. Glorot and Y. Bengio (2010) Understanding the difficulty of training deep feedforward neural networks. In

    Proceedings of the thirteenth international conference on artificial intelligence and statistics

    ,
    pp. 249–256. Cited by: §6.4.
  • [17] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio (2014) Generative adversarial nets. In Advances in neural information processing systems, pp. 2672–2680. Cited by: §1.
  • [18] W. Grathwohl, R. T. Chen, J. Betterncourt, I. Sutskever, and D. Duvenaud (2018) Ffjord: free-form continuous dynamics for scalable reversible generative models. arXiv preprint arXiv:1810.01367. Cited by: §1, §2.
  • [19] M. Gutmann and A. Hyvärinen (2010) Noise-contrastive estimation: a new estimation principle for unnormalized statistical models. In Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics, pp. 297–304. Cited by: §1, §2, §3.5.
  • [20] T. Han, Y. Lu, S. Zhu, and Y. N. Wu (2017) Alternating back-propagation for generator network. In Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence, February 4-9, 2017, San Francisco, California, USA., pp. 1976–1984. External Links: Link Cited by: Table 1.
  • [21] M. Heusel, H. Ramsauer, T. Unterthiner, B. Nessler, and S. Hochreiter (2017) Gans trained by a two time-scale update rule converge to a local nash equilibrium. In Advances in Neural Information Processing Systems, pp. 6626–6637. Cited by: §4.2, §6.4.
  • [22] M. Hill, E. Nijkamp, and S. Zhu (2019) Building a telescope to look into high-dimensional image spaces. Quarterly of Applied Mathematics 77 (2), pp. 269–321. Cited by: §4.3.
  • [23] G. E. Hinton, P. Dayan, B. J. Frey, and R. M. Neal (1995) The" wake-sleep" algorithm for unsupervised neural networks. Science 268 (5214), pp. 1158–1161. Cited by: §5.
  • [24] M. Hoffman, P. Sountsov, J. V. Dillon, I. Langmore, D. Tran, and S. Vasudevan (2019) NeuTra-lizing bad geometry in hamiltonian monte carlo using neural transport. arXiv preprint arXiv:1903.03704. Cited by: §1, §2, §3.4.
  • [25] L. Jin, J. Lazarow, and Z. Tu (2017) Introspective classification with convolutional nets. In Advances in Neural Information Processing Systems, pp. 823–833. Cited by: §2, §3.5.
  • [26] I. Khemakhem, D. P. Kingma, and A. Hyvärinen (2019)

    Variational autoencoders and nonlinear ica: a unifying framework

    .
    arXiv preprint arXiv:1907.04809. Cited by: §1.
  • [27] T. Kim and Y. Bengio (2016) Deep directed generative models with energy-based probability estimation. arXiv preprint arXiv:1606.03439. Cited by: §1.
  • [28] D. P. Kingma and M. Welling (2013) Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114. Cited by: §1, Table 1.
  • [29] D. P. Kingma and J. Ba (2015) Adam: A method for stochastic optimization. In 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings, External Links: Link Cited by: §6.4.
  • [30] D. Kingma and M. Welling (2014) Efficient gradient-based inference through transformations between bayes nets and neural nets. In International Conference on Machine Learning, pp. 1782–1790. Cited by: §1.
  • [31] D. P. Kingma and P. Dhariwal (2018) Glow: generative flow with invertible 1x1 convolutions. In Advances in Neural Information Processing Systems, pp. 10215–10224. Cited by: §1, Table 1, Table 2, §6.3.
  • [32] D. P. Kingma, T. Salimans, R. Jozefowicz, X. Chen, I. Sutskever, and M. Welling (2016) Improved variational inference with inverse autoregressive flow. In Advances in neural information processing systems, pp. 4743–4751. Cited by: §1.
  • [33] A. Krizhevsky, V. Nair, and G. Hinton () CIFAR-10 (canadian institute for advanced research). . External Links: Link Cited by: §4.2.
  • [34] A. Krizhevsky, I. Sutskever, and G. E. Hinton (2012) Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pp. 1097–1105. Cited by: §1.
  • [35] M. Kumar, M. Babaeizadeh, D. Erhan, C. Finn, S. Levine, L. Dinh, and D. Kingma (2019) VideoFlow: a flow-based generative model for video. arXiv preprint arXiv:1903.01434. Cited by: §1.
  • [36] R. Kumar, A. Goyal, A. Courville, and Y. Bengio (2019) Maximum entropy generators for energy-based models. arXiv preprint arXiv:1901.08508. Cited by: §1, §2.
  • [37] P. Langevin (1908) On the theory of brownian motion. Cited by: §1, §3.2.
  • [38] J. Lazarow, L. Jin, and Z. Tu (2017) Introspective neural networks for generative modeling. In Proceedings of the IEEE International Conference on Computer Vision, pp. 2774–2783. Cited by: §2, §3.5.
  • [39] Y. LeCun, L. Bottou, Y. Bengio, P. Haffner, et al. (1998) Gradient-based learning applied to document recognition. Proceedings of the IEEE 86 (11), pp. 2278–2324. Cited by: §1.
  • [40] Y. LeCun, S. Chopra, R. Hadsell, M. Ranzato, and F. Huang (2006) A tutorial on energy-based learning. Predicting structured data 1 (0). Cited by: §1.
  • [41] Y. LeCun, C. Cortes, and C. Burges (2010) MNIST handwritten digit database. ATT Labs [Online]. Available: http://yann.lecun.com/exdb/mnist 2. Cited by: §4.2.
  • [42] Z. Liu, P. Luo, X. Wang, and X. Tang (2015-12) Deep learning face attributes in the wild. In Proceedings of International Conference on Computer Vision (ICCV), Cited by: §4.1, §4.2.
  • [43] O. Mangoubi and A. Smith (2017) Rapid mixing of hamiltonian monte carlo on strongly log-concave distributions. arXiv preprint arXiv:1708.07114. Cited by: §2.
  • [44] R. M. Neal (2011) MCMC using hamiltonian dynamics. Handbook of Markov Chain Monte Carlo 2. Cited by: §1, §3.2, §3.4.
  • [45] Y. Netzer, T. Wang, A. Coates, A. Bissacco, B. Wu, and A. Y. Ng (2011) Reading digits in natural images with unsupervised feature learning. Cited by: §4.1, §4.2, §6.6.
  • [46] J. Ngiam, Z. Chen, P. W. Koh, and A. Y. Ng (2011) Learning deep energy models. In Proceedings of the 28th international conference on machine learning (ICML-11), pp. 1105–1112. Cited by: §1.
  • [47] E. Nijkamp, S. Zhu, and Y. N. Wu (2019) On learning non-convergent short-run mcmc toward energy-based model. arXiv preprint arXiv:1904.09770. Cited by: §1, §1, Figure 4, §4.1.
  • [48] P. Ramachandran, B. Zoph, and Q. V. Le (2017)

    Searching for activation functions

    .
    arXiv preprint arXiv:1710.05941. Cited by: §6.3.
  • [49] D. J. Rezende and S. Mohamed (2015) Variational inference with normalizing flows. arXiv preprint arXiv:1505.05770. Cited by: §1.
  • [50] R. Rosenfeld, S. F. Chen, and X. Zhu (2001) Whole-sentence exponential language models: a vehicle for linguistic-statistical integration. Computer Speech & Language 15 (1), pp. 55–73. Cited by: §2.
  • [51] C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna (2016) Rethinking the inception architecture for computer vision. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2818–2826. Cited by: §4.2, §6.4.
  • [52] T. Tieleman (2008)

    Training restricted boltzmann machines using approximations to the likelihood gradient

    .
    In Proceedings of the 25th international conference on Machine learning, pp. 1064–1071. Cited by: §1, §6.4.
  • [53] D. Tran, K. Vafa, K. K. Agrawal, L. Dinh, and B. Poole (2019) Discrete flows: invertible generative models of discrete data. arXiv preprint arXiv:1905.10347. Cited by: §1.
  • [54] Z. Tu (2007) Learning generative models via discriminative approaches. In 2007 IEEE Conference on Computer Vision and Pattern Recognition, pp. 1–8. Cited by: §2, §3.5.
  • [55] D. Vats and C. Knudson (2018) Revisiting the gelman-rubin diagnostic. arXiv preprint arXiv:1812.09384. Cited by: §4.1.
  • [56] B. Wang and Z. Ou (2018) Learning neural trans-dimensional random field language models with noise-contrastive estimation. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6134–6138. Cited by: §2.
  • [57] J. Xie, Y. Lu, S. Zhu, and Y. Wu (2016) A theory of generative convnet. In International Conference on Machine Learning, pp. 2635–2644. Cited by: §1, §3.2.
  • [58] J. Zhao, M. Mathieu, and Y. LeCun (2016) Energy-based generative adversarial network. arXiv preprint arXiv:1609.03126. Cited by: §1.

6 Appendix

6.1 Change of variable

Under the invertible transformation , let be the density of , and be the density of . Let be an infinitesimal neighborhood around , and let be an infinitesimal neighborhood around , so that maps to , and maps to . Then

(11)

, and , where and are the volumes of and respectively. Thus we have

(12)

where we ignore and terms. This is the meaning of

(13)

where or is the determinant of the Jacobian of .

Equation (13) is a convenient starting point for deriving densities under change of variable.

6.2 Energy-based correction and change of variable for generator model

The generator model is of the form , and , , where is the dimensionality of , and is the dimensionality of the latent vector. Unlike the flow-based model, the marginal distribution of involves intractable integral.

We shall study exponential tilting of generator model using the simple equation (13) for change of variable. To that end, we let , and let . Then

(14)

Let be the Gaussian white noise distribution of under the generator model. Let be the distribution of under the generator model. Consider the change of variable between and . In parallel to equation (13), we have

(15)

The marginal distribution , which is intractable.

Suppose we exponentially tilt to

(16)

Again this can be translated into the space of so that under ,

(17)

Combining equations (15), (16), and (17), we have

(18)

that is, under the tilted model ,

(19)

For , the marginal distribution cannot be obtained in closed form, in particular, for gradient-based sampling, we need to compute

(20)
(21)
(22)

That is, there is an inner loop for approximating . This is less convenient than the flow-based model.

6.3 Model architectures

For Glow model , we follow the setting of [31] with , , .

For the EBM model , we use the following Conv-Net structure.

We use the following notation. Convolutional operation with output feature maps and bias term. We recruit  [8] nonlinearity where  [48] as activation function . We set .

Specifically, we set use the following hyper-parameters:

  1. MNIST: For Glow, , , . For EBM, .

  2. SVHN: For Glow, , , . ForEBM, .

  3. CelebA: For Glow, , , . For EBM, .

  4. CIFAR-10: For Glow, , , . For EBM, .

Energy-based Model
Layers In-Out Size Stride
Input
conv(), 1
conv(), 2
conv(), 2
conv(), 2
conv(1) 1
Table 3: Network structures for EBM with data-space .
Energy-based Model
Layers In-Out Size Stride
Input
conv(), 1
conv(), 2
conv(), 2
conv(), 2
conv(), 2
conv(1) 1
Table 4: Network structures for EBM with data-space .

6.4 Training

Data. The training image dataset are resized and scaled to . We use 60,000, 70,000, 30,000, 50,000 observed examples for MNIST , SVHN , CelebA , and CIFAR-10 , respectively.

Optimization. The network parameters are initialized with Xavier [16] and optimized using Adam [29] with . For NT-EBM, the learning rates used are , , , for MNIST, SVHN, CelebA, CIFAR-10, respectively and a batch-size of examples. For NCE-EBM, the learning rates used are ,, for MNIST, SVHN, and CelebA, respectively, and a batch-size of examples. For NT-EBM, in training the maximum number of parameter updates was . For NCE-EBM, in training the maximum number of parameter updates was .

HMC. We run Hamiltonian Monte Carlo (HMC) with persistent chains [52] initialized from and steps of MCMC and leapfrog integrator steps per update of parameters of . The initial discretization step-size with a simple adaptive policy multiplicatively increases or decreasing the step-size of the inner kernel based on the value of the Metropolis-Hastings acceptance rate [2]. The target acceptance-rate is set to  [5]. Figure 8 depicts the MH acceptance-rate and adaptive step-size over time.

Figure 8: Metropolis-Hastings acceptance rate (top) and adaptive step-size (bottom) over time.

FID. The Fréchet Inception Distance (FID) [21] with Inception v3 classifier [51] was computed on generated examples with observed examples as reference.

6.5 Synthesis

Figure 9 depicts samples from pre-trained flow and samples from learned by neural transport MCMC for the dataset CIFAR-10 ().

(a) Samples drawn from flow by ancestral sampling.
(b) Samples drawn from by Hamiltonian neural transport.
Figure 9: Generated samples from a model learned by NT-EBM on CIFAR-10 ().

6.6 Noise contrastive estimation

For numerical stability, the noise contrastive estimation objective is rewritten in the equivalent form using and .

Figure 10 depicts samples from (left) and samples from learned by our NCE algorithm for which sampling is performed using Hamiltonian neural transport (right) for the dataset CelebA .

(a) Samples drawn from flow by ancestral sampling.
(b) Samples drawn from by Hamiltonian neural transport.
Figure 10: Generated samples from a model learned by NCE on CelebA ().

We learn with our NCE-EBM algorithm on the SVHN ([45] dataset and investigate the possibility of sampling from the learned model by Hamiltonian neural transport. Assume a Markov chain is run for a large numbers of steps with a Hamiltonian neural transport. Then, we pull back Markov chains into data space and visualize the long run trajectory in Figure 11 with . Note, the long-run Markov chains synthesizes realistic images with high diversity.

Figure 11: A single long-run Markov Chain with steps depicted in 5 steps intervals sampled by Hamiltonian neural transport for a model learned by NCE on SVHN .

Notice that NCE is the discriminator side of GAN. Similar to GAN, we can also improve the flow-based model based on the value function of GAN. This may further improve the NCE results.

6.7 Sampling in data space

In Section 4.1, we analyze the quality of mixing with multiple Markov chains based on the Gelman-Rubin statistic . Recall, is considered approximate convergence [7]. We concluded sampling with Hamiltonian neural transport exhibits a strong indication of mixing. In Figure 12, we contrast this result with HMC sampling in data space with unfavorable diagnostics of mixing.

(a) Gelman-Rubin statistic for convergence of multiple long-run Markov chains where indicates approximative convergence.
(b) Auto-correlation of a single long-run Markov chain over time lag with mean depicted as line and min/max as bands.
Figure 12: Diagnostics for the mixing of MCMC chains with steps in data space.