1 Introduction
The energybased model (EBM) LeCun et al. (2006); Ngiam et al. (2011); Kim and Bengio (2016); Zhao et al. (2016); Xie et al. (2016); Gao et al. (2018); Kumar et al. (2019b); Nijkamp et al. (2019); Du and Mordatch (2019); Finn et al. (2016)
defines an unnormalized probability density function on the observed data such as images via an energy function, so that the density is proportional to the exponential of the negative energy. Taking advantage of the approximation capacity of modern deep networks such as convolutional network (ConvNet)
LeCun et al. (1998); Krizhevsky et al. (2012), recent papers Xie et al. (2016); Gao et al. (2018); Kumar et al. (2019b); Nijkamp et al. (2019); Du and Mordatch (2019) parametrize the energy function by ConvNet. The ConvNetEBM is highly expressive and the learned EBM can produce realistic synthesized examples.The EBM can be learned by maximum likelihood, and the gradientbased maximum likelihood learning algorithm follows an “analysis by synthesis” scheme. In the synthesis step, synthesized examples are generated by sampling from the current model. In the analysis step, the model parameters are updated based on the statistical difference between synthesized examples and observed examples. The synthesis step usually requires Markov chain Monte Carlo (MCMC) sampling, and gradientbased sampling such as Langevin dynamics
Langevin (1908) or Hamiltonian Monte Carlo (HMC) Neal (2011)can be conveniently implemented on the current deep learning platforms where gradients can be efficiently and automatically computed by backpropagation.
However, gradientbased MCMC sampling in the data space generally does not mix. The data distribution is typically highly multimodal. To approximate such a distribution, the density function or the energy function of the ConvNetEBM needs to be highly multimodal as well. When sampling from such a multimodal density in the data space, gradientbased MCMC tends to get trapped in local modes with little chance to traverse the modes freely, rendering the MCMC nonmixing. Without being able to generate fair examples from the model, the estimated gradient of the maximum likelihood learning can be very biased, and the learned model can be far from the maximum likelihood estimate (MLE). Even if we can learn the model by other means without resorting to MCMC sampling, e.g., by noise contrastive estimation (NCE)
Gutmann and Hyvärinen (2010); Gao et al. (2019), it is still necessary to be able to draw fair examples from the learned model for the purpose of model checking or some downstream applications based on the learned model.Accepting the fact that MCMC sampling is not mixing, contrastive divergence
Tieleman (2008) initializes finite step MCMC from the observed examples, so that the learned model is admittedly biased from the MLE. Recently, Nijkamp et al. (2019) proposes to initialize shortrun MCMC from a fixed noise distribution, and shows that even though the learned EBM is biased, the shortrun MCMC can be considered a valid model that can generate realistic examples. This partially explains why EBM learning algorithm can synthesize high quality examples even though the MCMC does not mix. However, the problem of nonmixing MCMC remains unsolved. Without proper MCMC sampling, the theory and practice of learning EBM is on a very shaky ground. The goal of this paper is to address this problem.We propose to learn the EBM with a flowbased model as a backbone model, so that the EBM is in the form of a correction, or an exponential tilting, of the flowbased model. Flowbased models have gained popularity in generative modeling Dinh et al. (2014, 2016); Kingma and Dhariwal (2018); Grathwohl et al. (2018); Behrmann et al. (2018); Kumar et al. (2019a); Tran et al. (2019) and variational inference Kingma and Welling (2013); Rezende and Mohamed (2015); Kingma et al. (2016); Kingma and Welling (2014); Khemakhem et al. (2019). Similar to the generator model Kingma and Welling (2013); Goodfellow et al. (2014), flowbased model is based on a mapping from latent space to the data space. However, unlike the generator model, the mapping in the flowbased model is onetoone, with closed form inversion and Jacobian. This leads to an explicit normalized density via change of variable. However, to ensure tractable inversion and Jacobian, the mapping in the flowbased model has to be a composition of a sequence of simple transformations of highly constrained forms. In order to approximate a complex distribution, it is necessary to compose a large number of such transformations. In our work, we propose to learn EBM by correcting a relatively simple flowbased model with a relatively simple energy function parametrized by a freeform ConvNet. We show that the resulting EBM has a particularly simple form in the space of the latent variables of the flowbased model. MCMC sampling of the EBM in the latent space, which is a simple special case of neural transport MCMC Hoffman et al. (2019), mixes well and is able to traverse modes in the data space. This enables proper sampling and learning of EBM.
Our experiments show that it is possible to learn EBM with flowbased backbone, and the neural transport sampling of the learned EBM solves or greatly mitigates the nonmixing problem of MCMC.
2 Contributions and related work
Contributions. This paper tackles the problem of nonmixing of MCMC for sampling from an EBM. We propose to learn EBM with a flowbased backbone model. The resulting EBM in the latent space is of a simple form that is much more friendly to MCMC mixing.
The following are research themes in generative modeling and MCMC sampling that are closely related to our work.
Neural transport MCMC. Our work is inspired by neural transport sampling Hoffman et al. (2019)
. For an unnormalized target distribution, the neural transport sampler trains a flowbased model as a variational approximation to the target distribution, and then samples the target distribution in the space of latent variables of the flowbased model via change of variable. In the latent space, the target distribution is close to the prior distribution of the latent variables of the flowbased model, which is usually a unimodal Gaussian white noise distribution. Consequently the target distribution in the latent space is close to be unimodal and is much more conducive to the mixing and fast convergence of MCMC than sampling in the original space
Mangoubi and Smith (2017).Our work is a simplified special case of this idea, where we learn the EBM as a correction of a pretrained flowbased model, so that we do not need to train a separate flowbased approximation to the EBM. The energy function, which is a correction of the flowbased model, does not need to reproduce the content of the flowbased model, and thus can be kept relatively simple. Moreover, in the latent space, the resulting EBM takes on a very simple form where the inversion and Jacobian in the flowbased model disappear. This may allow for more freeform flowbased model where inversion and Jacobian do not need to be in closed form Grathwohl et al. (2018); Behrmann et al. (2018).
Energybased corrections. Our model is based on an energybased correction or an exponential tilting of a more tractable model. This idea has been explored in noise contrastive estimation (NCE) Gutmann and Hyvärinen (2010); Gao et al. (2019)
and introspective neural networks (INN)
Tu (2007); Jin et al. (2017); Lazarow et al. (2017), where the correction is obtained by discriminative learning. Earlier works include Rosenfeld et al. (2001); Wang and Ou (2018). Correcting or refining a simpler and more tractable backbone model can be much easier than learning an EBM from scratch, because the EBM does not need to reproduce the knowledge learned by the backbone model. It also allows easier sampling of EBM.Latent space sampling. Nonmixing MCMC sampling of an EBM is a clear call for latent variables to represent multiple modes of the original model distribution via explicit topdown mapping, so that the distribution of the latent variables is less multimodal. Earlier work in this direction include Bengio et al. (2013); Brock et al. (2018); Kumar et al. (2019b). In this paper, we choose to use flowbased model for its simplicity, because the distribution in the data space can be translated into the distribution in the latent space by a simple change of variable, without requiring integrating out extra dimensions as in the generator model.
3 Model and learning
3.1 Flowbased model
Let be the input example, such as an image. A flowbased model is of the form
(1) 
where
is the latent vector of the same dimensionality as
, and is a known prior distribution such as Gaussian white noise distribution. is a composition of a sequence of invertible transformations whose inversion and logdeterminants of the Jacobians can be obtained in closed form. As a result, these transformations are of highly constrained forms. denotes the parameters. Let be the probability density at under the transformation , then according to the change of variable,(2) 
where and are understood as the volumes of the infinitesimal local neighborhoods around and respectively under the mapping . Then for a given , , and
(3) 
where the ratio between the volumes is the absolute value of the determinant of the Jacobian.
Suppose we observe training examples , where is the data distribution, which is typically highly multimodal. We can learn by MLE. For large , the MLE of
minimizes the KullbackLeibler divergence
. strives to cover most of the modes in , and the learned tends to be more dispersed than . In order for to approximate closely, it is usually necessary for to be a composition of a large number of transformations of highly constrained forms with closedform inversions and Jacobians. The learned mapping transports the unimodal Gaussian white noise distribution to a highly multimodal distribution in the data space as an approximation to the data distribution .3.2 Energybased model
An energybased model (EBM) is defined as follows:
(4) 
where is a reference measure, such as the uniform measure or a Gaussian white noise distribution as in Xie et al. (2016). is defined by a bottomup ConvNet whose parameters are denoted by . The normalizing constant or the partition function is typically analytically intractable.
Suppose we observe training examples for . For large , the sample average over approximates the expectation with respect to . For notational convenience, we treat the sample average and the expectation as the same.
The loglikelihood is
(5) 
The derivative of the loglikelihood is
(6) 
where for are synthesized examples sampled from the current model .
The above equation leads to the “analysis by synthesis” learning algorithm. At iteration , let be the current model parameters. We generate for . Then we update , where is the learning rate.
To generate synthesized examples from , we can use gradientbased MCMC sampling such as Langevin dynamics Langevin (1908) or Hamiltonian Monte Carlo (HMC) Neal (2011), where can be automatically computed. Since is in general highly multimodal, the learned or tends to be multimodal as well. As a result, gradientbased MCMC tends to get trapped in the local modes of with little chance of mixing between the modes.
3.3 Energybased model with flowbased backbone
Instead of using uniform or Gaussian white noise distribution for the reference distribution in the EBM in (4), we can use a relatively simple flowbased model as the reference model. can be pretrained by MLE, and serves as the backbone of the model, so that the model is of the following form
(7) 
which is almost the same as in (4) except that the reference distribution is a pretrained flowbased model . The resulting model is a correction or refinement of , or an exponential tilting of , and is a freeform ConvNet to parametrize the correction. The overall negative energy is .
In the latent space of , let be the distribution of under , then
(8) 
Recall equation (2), , we have
(9) 
is an exponential tilting of the prior noise distribution . It is a very simple form that does not involve the Jacobian or inversion of .
We can also apply the above exponential tilting and change of variable scheme to the generator model, i.e., using the generator model as the backbone model. However, for the generator model, is not in closed form, and after exponential tilting, the marginal requires integral. See Appendix 6.2 for details. In comparison, flowbased model is simpler and more explicit.
3.4 Learning by Hamiltonian neural transport sampling
Instead of sampling , we can sample in (9). While is multimodal, is unimodal. Since is a correction of , is a correction of , and can be much less multimodal than in the data space. After sampling from , we can generate .
The above MCMC sampling scheme is a special case of neutral transport MCMC proposed by Hoffman et al. (2019) for sampling from an EBM or the posterior distribution of a generative model. The basic idea is to train a flowbased model as a variational approximation to the target EBM, and sample the EBM in the latent space of the flowbased model. In our case, since is a correction of , we can simply use directly as the approximate flowbased model in neural transport sampler. The extra benefit is that the distribution is of an even simpler form than , because does not involve the inversion and Jacobian of . As a result, we may use a flowbased backbone model of a more free form such as one based on residual network Behrmann et al. (2018), and we will leave this issue to future investigation.
3.5 Learning by noise contrastive estimation
We may also learn the correction discriminatively, as in noise contrastive estimation (NCE) Gutmann and Hyvärinen (2010) or introspective neural networks (INN) Tu (2007); Jin et al. (2017); Lazarow et al. (2017). Let be the training examples, which are treated as positive examples, and let be the examples generated from , which are treated as negative examples. For each batch, let be the proportion of positive examples, and the proportion of negative examples. Then
(10) 
where is treated as a separate bias parameter. Then we can estimate and
by fitting a logistic regression on the positive and negative examples.
Note, that NCE is the discriminator side of GAN. Similar to GAN, we can also improve the flowbased model based on the value function of GAN. This may further improve the NCE results.
4 Experiments
In the subsequent empirical evaluations, we shall address the following questions:
(1) Is the mixing of HMC with neural transport, both qualitatively and quantitatively, apparent?
(2) Does the exponential tilting with correction term improve the quality of synthesis?
(3) In the latent space, does smooth interpolation remain feasible?
(4) In terms of ablation, what is the effect of amount of parameters for flowbased ?
(5) Is discriminative learning in the form of NCE an efficient alternative learning method?
4.1 Mixing
GelmanRubin. The GelmanRubin statistic Gelman et al. (1992); Brooks and Gelman (1998)
measures the convergence of Markov chains to the target distribution. It is based on the notion that if multiple chains have converged, by definition, they should appear “similar” to one another, else, one or more chains have failed to converge. Specifically, the diagnostic recruits an analysis of variance to access the difference between the betweenchain and withinchain variances.
Let denote the target distribution with mean and variance . Gelman et al. (1992) designs two estimators of and compares the square root of their ratio to . Let denote Markov chains of length . Let be the withinchain variance. The quantity underestimates due to positive correlation in the Markov chain. Let be a mixture of withinchain variance and betweenchain variance . The quantity will overestimate , if an overdispersed initial distribution for the Markov chains was used Gelman et al. (1992). That is, underestimates while overestimates . Both estimators are consistent for as Vats and Knudson (2018). In light of this, the GelmanRubin statistic monitors convergence as the ratio . If all chains converge to , then as , . Before that,
. The heuristics
indicates approximate convergence Brooks and Gelman (1998). Figure 1(a) depicts for chains over steps with a burnin time of steps. The mean value is , which we treat as approximative convergence to the target distribution.AutoCorrelation. MCMC sampling leads to autocorrelated samples due to the inherent Markovian dependence structure. The (sample) autocorrelation is the correlation between samples steps apart in time. Figure 1(b) shows autocorrelation against increasing time lag . While the autocorrelation of Hamiltonian Markov chains with neural transport vanishes within steps, the overdamped Langevin sampler requires steps. This finding for single longrun Markov chain is consistent with the GelmanRubin statistic assessing multiple Markov chains.
Visual Inspection. Assume a Markov chain is run for a large numbers of steps with a Hamiltonian neural transport. Then, we pull back Markov chains into data space and visualize the long run trajectory in Figure 3 with learned on the SVHN () Netzer et al. (2011) dataset and CelebA Liu et al. (2015) dataset. We observe the Markov chain is traversing between local modes, which we consider a weak indication of mixing. Figure 4 contrasts the Markov chain that samples the EBM learned with shortrun MCMC Nijkamp et al. (2019), which does not mix, against our method in which the pulled back chain mixes freely.
4.2 Synthesis
We evaluate the quality of synthesis on four datasets which include MNIST () LeCun et al. (2010), SVHN () Netzer et al. (2011), CelebA () Liu et al. (2015), and, CIFAR10 () Krizhevsky et al. . The qualitative results are depicted in Figure 6 which contrast generated samples from Glow against longrun Markov chains by Hamiltonian neural transport. Table 1 compares the Fréchet Inception Distance (FID) Heusel et al. (2017)
with Inception v3 classifier
Szegedy et al. (2016) on generated examples. Both, qualitatively and quantitatively speaking, we observe a significant improvement in quality of synthesis with exponentially tilting of the reference distribution by the correction .Method  MNIST  SVHN  CelebA  CIFAR10 

VAE Kingma and Welling (2013)  32.86  49.72  48.27  106.37 
ABP Han et al. (2017)  39.12  48.65  51.92  114.13 
Glow (MLE) Kingma and Dhariwal (2018)  66.04  94.23  59.35  90.08 
NCEEBM (Ours)  36.52  79.84  51.73  — 
NTEBM (Ours)  21.32  48.01  46.38  78.12 
4.3 Interpolation
Interpolation allows us to appraise the smoothness of the latent space. In particular, two samples and are drawn from the prior distribution . We may spherically interpolate between them in space and then push forward into data space to assess . To evaluate the tilted model , we run a magnetized form of the overdamped Langevin equation for which we alter negative energy to with magnetization constant Hill et al. (2019). Note, , thus, the magnetization term introduces a vector field pointing with uniform strength towards . The resulting Langevin equation is with Wiener process . To find a low energy path from towards , we set , and perform steps of the discretized, magnetized Langevin equation with small . Figure 7 depicts the lowenergy path in dataspace and energy over time. The qualitatively smooth interpolation and narrow energy spectrum indicate that a Langevin dynamics in latent space (with small magnetization) is able to traverse two arbitrary local modes, thus, substantiating our claim that the underlying geometry is amenable to mixing.
4.4 Ablation
We investigate the influence of the number of parameters of flowbased on the quality of synthesis. Specifically, we show (1) the threshold of a “large” learned by MLE outperforming NT with a “small” tilted , and, (2) the minimal size of which allows for the learning by our method. Our method with a “medium” sized backbone significantly outperforms the “largest” Glow baseline.
Method  Small  Medium  Large  Largest 

Glow (MLE) Kingma and Dhariwal (2018)  110.55  94.34  89.31  86.18 
NTEBM (Ours)  74.77  48.01  43.82  — 
4.5 Noise Contrastive Estimation
Noise Contrastive Estimation (NCE) is an efficient alternative to our neural transport learning method. We wish to learn the correction according to (10) while sampling from the learned model with neural transport MCMC. Table 1 compares the learned models with both learning methods. The longrun MCMC chains in models learned by NCE are conducive to mixing and remain of high visual quality. See Appendix 6.6. We leave improvements to this method to future investigations.
5 Conclusion
This paper proposes to learn EBM as a correction or exponential tilting of a flowbased model, so that neural transport MCMC sampling in the latent space of the flowbased model can mix well and traverse the modes in the data space.
Energybased correction of a more tractable backbone model is a general modeling strategy that goes beyond correcting the flowbased model. Consider latent EBM such as Boltzmann machine
Ackley et al. (1985), which is an undirected graphical model with a simple energy function defined on both the observed variables and multiple layers of latent variables. Instead of learning a latent EBM from scratch, we may learn a latent EBM as a correction of a topdown generation model such as the one in the Helmholtz machine Hinton et al. (1995), to correct for conditional independence assumptions in the topdown model. We shall investigate this problem in future work.Acknowledgments
The work is supported by DARPA XAI project N660011724029; ARO project W911NF1810296; and ONR MURI project N000141612007; and XSEDE grant ASC170063. We thank Matthew D. Hoffman, Diederik P. Kingma, Alexander A. Alemi, and Will Grathwohl for helpful discussions.
References
 [1] (1985) A learning algorithm for boltzmann machines. Cognitive Science 9 (1), pp. 147–169. External Links: Link, Document Cited by: §5.
 [2] (2008) A tutorial on adaptive mcmc. Statistics and computing 18 (4), pp. 343–373. Cited by: §6.4.
 [3] (2018) Invertible residual networks. arXiv preprint arXiv:1811.00995. Cited by: §1, §2, §3.4.

[4]
(2013)
Better mixing via deep representations.
In
International conference on machine learning
, pp. 552–560. Cited by: §2.  [5] (2013) Optimal tuning of the hybrid monte carlo algorithm. Bernoulli 19 (5A), pp. 1501–1534. Cited by: §6.4.
 [6] (2018) Large scale gan training for high fidelity natural image synthesis. arXiv preprint arXiv:1809.11096. Cited by: §2.
 [7] (1998) General methods for monitoring convergence of iterative simulations. Journal of computational and graphical statistics 7 (4), pp. 434–455. Cited by: §4.1, §4.1, §6.7.
 [8] Residual flows: unbiased generative modeling with normlearned iresnets. Cited by: §6.3.
 [9] (2014) Nice: nonlinear independent components estimation. arXiv preprint arXiv:1410.8516. Cited by: §1.
 [10] (2016) Density estimation using real nvp. arXiv preprint arXiv:1605.08803. Cited by: §1.
 [11] (2019) Implicit generation and generalization in energybased models. arXiv preprint arXiv:1903.08689. Cited by: §1.

[12]
(2016)
A connection between generative adversarial networks, inverse reinforcement learning, and energybased models
. arXiv preprint arXiv:1611.03852. Cited by: §1. 
[13]
(2018)
Learning generative convnets via multigrid modeling and sampling.
In
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
, pp. 9155–9164. Cited by: §1.  [14] (2019) Flow contrastive estimation of energybased models. arXiv preprint arXiv:1912.00589. Cited by: §1, §2.
 [15] (1992) Inference from iterative simulation using multiple sequences. Statistical science 7 (4), pp. 457–472. Cited by: §4.1, §4.1.

[16]
(2010)
Understanding the difficulty of training deep feedforward neural networks.
In
Proceedings of the thirteenth international conference on artificial intelligence and statistics
, pp. 249–256. Cited by: §6.4.  [17] (2014) Generative adversarial nets. In Advances in neural information processing systems, pp. 2672–2680. Cited by: §1.
 [18] (2018) Ffjord: freeform continuous dynamics for scalable reversible generative models. arXiv preprint arXiv:1810.01367. Cited by: §1, §2.
 [19] (2010) Noisecontrastive estimation: a new estimation principle for unnormalized statistical models. In Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics, pp. 297–304. Cited by: §1, §2, §3.5.
 [20] (2017) Alternating backpropagation for generator network. In Proceedings of the ThirtyFirst AAAI Conference on Artificial Intelligence, February 49, 2017, San Francisco, California, USA., pp. 1976–1984. External Links: Link Cited by: Table 1.
 [21] (2017) Gans trained by a two timescale update rule converge to a local nash equilibrium. In Advances in Neural Information Processing Systems, pp. 6626–6637. Cited by: §4.2, §6.4.
 [22] (2019) Building a telescope to look into highdimensional image spaces. Quarterly of Applied Mathematics 77 (2), pp. 269–321. Cited by: §4.3.
 [23] (1995) The" wakesleep" algorithm for unsupervised neural networks. Science 268 (5214), pp. 1158–1161. Cited by: §5.
 [24] (2019) NeuTralizing bad geometry in hamiltonian monte carlo using neural transport. arXiv preprint arXiv:1903.03704. Cited by: §1, §2, §3.4.
 [25] (2017) Introspective classification with convolutional nets. In Advances in Neural Information Processing Systems, pp. 823–833. Cited by: §2, §3.5.

[26]
(2019)
Variational autoencoders and nonlinear ica: a unifying framework
. arXiv preprint arXiv:1907.04809. Cited by: §1.  [27] (2016) Deep directed generative models with energybased probability estimation. arXiv preprint arXiv:1606.03439. Cited by: §1.
 [28] (2013) Autoencoding variational bayes. arXiv preprint arXiv:1312.6114. Cited by: §1, Table 1.
 [29] (2015) Adam: A method for stochastic optimization. In 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 79, 2015, Conference Track Proceedings, External Links: Link Cited by: §6.4.
 [30] (2014) Efficient gradientbased inference through transformations between bayes nets and neural nets. In International Conference on Machine Learning, pp. 1782–1790. Cited by: §1.
 [31] (2018) Glow: generative flow with invertible 1x1 convolutions. In Advances in Neural Information Processing Systems, pp. 10215–10224. Cited by: §1, Table 1, Table 2, §6.3.
 [32] (2016) Improved variational inference with inverse autoregressive flow. In Advances in neural information processing systems, pp. 4743–4751. Cited by: §1.
 [33] () CIFAR10 (canadian institute for advanced research). . External Links: Link Cited by: §4.2.
 [34] (2012) Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pp. 1097–1105. Cited by: §1.
 [35] (2019) VideoFlow: a flowbased generative model for video. arXiv preprint arXiv:1903.01434. Cited by: §1.
 [36] (2019) Maximum entropy generators for energybased models. arXiv preprint arXiv:1901.08508. Cited by: §1, §2.
 [37] (1908) On the theory of brownian motion. Cited by: §1, §3.2.
 [38] (2017) Introspective neural networks for generative modeling. In Proceedings of the IEEE International Conference on Computer Vision, pp. 2774–2783. Cited by: §2, §3.5.
 [39] (1998) Gradientbased learning applied to document recognition. Proceedings of the IEEE 86 (11), pp. 2278–2324. Cited by: §1.
 [40] (2006) A tutorial on energybased learning. Predicting structured data 1 (0). Cited by: §1.
 [41] (2010) MNIST handwritten digit database. ATT Labs [Online]. Available: http://yann.lecun.com/exdb/mnist 2. Cited by: §4.2.
 [42] (201512) Deep learning face attributes in the wild. In Proceedings of International Conference on Computer Vision (ICCV), Cited by: §4.1, §4.2.
 [43] (2017) Rapid mixing of hamiltonian monte carlo on strongly logconcave distributions. arXiv preprint arXiv:1708.07114. Cited by: §2.
 [44] (2011) MCMC using hamiltonian dynamics. Handbook of Markov Chain Monte Carlo 2. Cited by: §1, §3.2, §3.4.
 [45] (2011) Reading digits in natural images with unsupervised feature learning. Cited by: §4.1, §4.2, §6.6.
 [46] (2011) Learning deep energy models. In Proceedings of the 28th international conference on machine learning (ICML11), pp. 1105–1112. Cited by: §1.
 [47] (2019) On learning nonconvergent shortrun mcmc toward energybased model. arXiv preprint arXiv:1904.09770. Cited by: §1, §1, Figure 4, §4.1.

[48]
(2017)
Searching for activation functions
. arXiv preprint arXiv:1710.05941. Cited by: §6.3.  [49] (2015) Variational inference with normalizing flows. arXiv preprint arXiv:1505.05770. Cited by: §1.
 [50] (2001) Wholesentence exponential language models: a vehicle for linguisticstatistical integration. Computer Speech & Language 15 (1), pp. 55–73. Cited by: §2.
 [51] (2016) Rethinking the inception architecture for computer vision. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2818–2826. Cited by: §4.2, §6.4.

[52]
(2008)
Training restricted boltzmann machines using approximations to the likelihood gradient
. In Proceedings of the 25th international conference on Machine learning, pp. 1064–1071. Cited by: §1, §6.4.  [53] (2019) Discrete flows: invertible generative models of discrete data. arXiv preprint arXiv:1905.10347. Cited by: §1.
 [54] (2007) Learning generative models via discriminative approaches. In 2007 IEEE Conference on Computer Vision and Pattern Recognition, pp. 1–8. Cited by: §2, §3.5.
 [55] (2018) Revisiting the gelmanrubin diagnostic. arXiv preprint arXiv:1812.09384. Cited by: §4.1.
 [56] (2018) Learning neural transdimensional random field language models with noisecontrastive estimation. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6134–6138. Cited by: §2.
 [57] (2016) A theory of generative convnet. In International Conference on Machine Learning, pp. 2635–2644. Cited by: §1, §3.2.
 [58] (2016) Energybased generative adversarial network. arXiv preprint arXiv:1609.03126. Cited by: §1.
6 Appendix
6.1 Change of variable
Under the invertible transformation , let be the density of , and be the density of . Let be an infinitesimal neighborhood around , and let be an infinitesimal neighborhood around , so that maps to , and maps to . Then
(11) 
, and , where and are the volumes of and respectively. Thus we have
(12) 
where we ignore and terms. This is the meaning of
(13) 
where or is the determinant of the Jacobian of .
Equation (13) is a convenient starting point for deriving densities under change of variable.
6.2 Energybased correction and change of variable for generator model
The generator model is of the form , and , , where is the dimensionality of , and is the dimensionality of the latent vector. Unlike the flowbased model, the marginal distribution of involves intractable integral.
We shall study exponential tilting of generator model using the simple equation (13) for change of variable. To that end, we let , and let . Then
(14) 
Let be the Gaussian white noise distribution of under the generator model. Let be the distribution of under the generator model. Consider the change of variable between and . In parallel to equation (13), we have
(15) 
The marginal distribution , which is intractable.
Suppose we exponentially tilt to
(16) 
Again this can be translated into the space of so that under ,
(17) 
Combining equations (15), (16), and (17), we have
(18) 
that is, under the tilted model ,
(19) 
For , the marginal distribution cannot be obtained in closed form, in particular, for gradientbased sampling, we need to compute
(20)  
(21)  
(22) 
That is, there is an inner loop for approximating . This is less convenient than the flowbased model.
6.3 Model architectures
For Glow model , we follow the setting of [31] with , , .
For the EBM model , we use the following ConvNet structure.
We use the following notation. Convolutional operation with output feature maps and bias term. We recruit [8] nonlinearity where [48] as activation function . We set .
Specifically, we set use the following hyperparameters:

MNIST: For Glow, , , . For EBM, .

SVHN: For Glow, , , . ForEBM, .

CelebA: For Glow, , , . For EBM, .

CIFAR10: For Glow, , , . For EBM, .
Energybased Model  

Layers  InOut Size  Stride 
Input  
conv(),  1  
conv(),  2  
conv(),  2  
conv(),  2  
conv(1)  1 
Energybased Model  

Layers  InOut Size  Stride 
Input  
conv(),  1  
conv(),  2  
conv(),  2  
conv(),  2  
conv(),  2  
conv(1)  1 
6.4 Training
Data. The training image dataset are resized and scaled to . We use 60,000, 70,000, 30,000, 50,000 observed examples for MNIST , SVHN , CelebA , and CIFAR10 , respectively.
Optimization. The network parameters are initialized with Xavier [16] and optimized using Adam [29] with . For NTEBM, the learning rates used are , , , for MNIST, SVHN, CelebA, CIFAR10, respectively and a batchsize of examples. For NCEEBM, the learning rates used are ,, for MNIST, SVHN, and CelebA, respectively, and a batchsize of examples. For NTEBM, in training the maximum number of parameter updates was . For NCEEBM, in training the maximum number of parameter updates was .
HMC. We run Hamiltonian Monte Carlo (HMC) with persistent chains [52] initialized from and steps of MCMC and leapfrog integrator steps per update of parameters of . The initial discretization stepsize with a simple adaptive policy multiplicatively increases or decreasing the stepsize of the inner kernel based on the value of the MetropolisHastings acceptance rate [2]. The target acceptancerate is set to [5]. Figure 8 depicts the MH acceptancerate and adaptive stepsize over time.
6.5 Synthesis
Figure 9 depicts samples from pretrained flow and samples from learned by neural transport MCMC for the dataset CIFAR10 ().
6.6 Noise contrastive estimation
For numerical stability, the noise contrastive estimation objective is rewritten in the equivalent form using and .
Figure 10 depicts samples from (left) and samples from learned by our NCE algorithm for which sampling is performed using Hamiltonian neural transport (right) for the dataset CelebA .
We learn with our NCEEBM algorithm on the SVHN () [45] dataset and investigate the possibility of sampling from the learned model by Hamiltonian neural transport. Assume a Markov chain is run for a large numbers of steps with a Hamiltonian neural transport. Then, we pull back Markov chains into data space and visualize the long run trajectory in Figure 11 with . Note, the longrun Markov chains synthesizes realistic images with high diversity.
Notice that NCE is the discriminator side of GAN. Similar to GAN, we can also improve the flowbased model based on the value function of GAN. This may further improve the NCE results.
6.7 Sampling in data space
In Section 4.1, we analyze the quality of mixing with multiple Markov chains based on the GelmanRubin statistic . Recall, is considered approximate convergence [7]. We concluded sampling with Hamiltonian neural transport exhibits a strong indication of mixing. In Figure 12, we contrast this result with HMC sampling in data space with unfavorable diagnostics of mixing.
Comments
There are no comments yet.