# On Learning Non-Convergent Short-Run MCMC Toward Energy-Based Model

This paper studies a curious phenomenon in learning energy-based model (EBM) using MCMC. In each learning iteration, we generate synthesized examples by running a non-convergent, non-mixing, and non-persistent short-run MCMC toward the current model, always starting from the same initial distribution such as uniform noise distribution, and always running a fixed number of MCMC steps. After generating synthesized examples, we then update the model parameters according to the maximum likelihood learning gradient, as if the synthesized examples are fair samples from the current model. We treat this non-convergent short-run MCMC as a learned generator model or a flow model, with the initial image serving as the latent variables, and discard the learned EBM. We provide arguments for treating the learned non-convergent short-run MCMC as a valid model. We show that the learned short-run MCMC is capable of generating realistic images. Moreover, unlike traditional EBM or MCMC, the learned short-run MCMC is also capable of reconstructing observed images and interpolating different images, like generator model or flow model. The code can be found in the Appendix.

Comments

There are no comments yet.

## Authors

• 8 publications
• 80 publications
• 47 publications
• ### On the Anatomy of MCMC-based Maximum Likelihood Learning of Energy-Based Models

This study investigates the effects Markov Chain Monte Carlo (MCMC) samp...
03/29/2019 ∙ by Erik Nijkamp, et al. ∙ 16

read it

• ### Learning Multi-grid Generative ConvNets by Minimal Contrastive Divergence

This paper proposes a minimal contrastive divergence method for learning...
09/26/2017 ∙ by Ruiqi Gao, et al. ∙ 0

read it

• ### Flow Contrastive Estimation of Energy-Based Models

This paper studies a training method to jointly estimate an energy-based...
12/02/2019 ∙ by Ruiqi Gao, et al. ∙ 42

read it

• ### Learning Deep Generative Models with Short Run Inference Dynamics

This paper studies the fundamental problem of learning deep generative m...
12/04/2019 ∙ by Erik Nijkamp, et al. ∙ 0

read it

• ### Underdamped Langevin MCMC: A non-asymptotic analysis

We study the underdamped Langevin diffusion when the log of the target d...
07/12/2017 ∙ by Xiang Cheng, et al. ∙ 0

read it

• ### Learning Descriptor Networks for 3D Shape Synthesis and Analysis

This paper proposes a 3D shape descriptor network, which is a deep convo...
04/02/2018 ∙ by Jianwen Xie, et al. ∙ 0

read it

• ### Cooperative Training of Descriptor and Generator Networks

This paper studies the cooperative training of two probabilistic models ...
09/29/2016 ∙ by Jianwen Xie, et al. ∙ 0

read it

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

### 1.1 Learning Energy-Based Model by MCMC Sampling

The maximum likelihood learning of the energy-based model (EBM) lecun2006tutorial ; zhu1998frame ; hinton2006unsupervised ; salakhutdinov2009deep ; lee2009convolutional ; Ng2011 ; Dai2015ICLR ; LuZhuWu2016 ; xie2016convnet ; zhao2016energy ; kim2016deep ; dai2017calibrating ; coopnets_pami , or the Gibbs distribution, follows what Grenander grenander2007pattern

called “analysis by synthesis” scheme. Within each learning iteration, we generate synthesized examples by sampling from the current model, and then update the model parameters based on the difference between the synthesized examples and the observed examples, so that eventually the synthesized examples match the observed examples in terms of some statistical properties defined by the model. To sample from the current EBM, we need to use Markov chain Monte Carlo (MCMC), such as the Gibbs sampler

gibbs , Langevin dynamics, or Hamiltonian Monte Carlo neal2011mcmc .

Recent work that parametrizes the energy function by modern convolutional neural networks (Conv-Nets)

lecun1998gradient ; krizhevsky2012imagenet suggests that the “analysis by synthesis” process can indeed generate highly realistic images. For instance, xie2016convnet

initializes persistent MCMC from zero or noise images, and within each learning iteration, run a finite-step MCMC starting from the synthesized examples generated in the previous learning iteration. The resulting learning and sampling process can generate realistic images. The convergence of such persistent algorithm to the maximum likelihood estimate (MLE) has been studied by

younes1991maximum as an MCMC version of stochastic approximation robbins1951stochastic . Alternatively, gao2018learning devises a non-persistent multi-grid short-run MCMC that is always initialized from the histogram extracted from versions of the observed images. Such a finite budget sampling scheme can be conveniently scaled up to big datasets.

Although the “analysis by synthesis” learning scheme is intuitively appealing, the sampling or synthesis step can be very challenging. The convergence of MCMC can be extremely slow or impractical, especially if the energy function is multi-modal, which is typically the case if the EBM is to approximate the complex data distribution, such as that of natural images. For such EBM, the MCMC usually does not mix, i.e., MCMC chains from different starting points tend to get trapped in different local modes instead of traversing modes and mixing with each other.

### 1.2 Short-Run MCMC as Generator or Flow Model

In this paper, we investigate a learning scheme that is apparently wrong with no hope of learning a valid model. Within each learning iteration, we run a non-convergent, non-mixing and non-persistent short-run MCMC, such as to steps of Langevin dynamics, toward the current EBM. Here, we always initialize the non-persistent short-run MCMC from the same distribution, such as the uniform noise distribution, and we always run the same number of MCMC steps. We then update the model parameters as usual, as if the synthesized examples generated by the non-convergent and non-persistent noise-initialized short-run MCMC are the fair samples generated from the current EBM. We show that, after the convergence of such a learning algorithm, the resulting noise-initialized short-run MCMC can generate realistic images, see Figures 1 and 2.

The short-run MCMC is not a valid sampler of the EBM because it is short-run. As a result, the learned EBM cannot be a valid model because it is learned based on a wrong sampler. Thus we learn a wrong sampler of a wrong model. However, the short-run MCMC can indeed generate realistic images. What is going on?

The goal of this paper is to understand the learned short-run MCMC. We provide arguments that it is a valid model for the data in terms of matching the statistical properties of the data distribution. We also show that the learned short-run MCMC can be used as a generative model, such as a generator model goodfellow2014generative ; kingma2013auto or the flow model dinh2014nice ; dinh2016density ; kingma2018glow , with the Langevin dynamics serving as a noise-injected residual network, with the initial image serving as the latent variables, and with the initial uniform noise distribution serving as the prior distribution of the latent variables. We show that unlike traditional EBM and MCMC, the learned short-run MCMC is capable of reconstructing the observed images and interpolating different images, just like a generator or a flow model can do. See Figures 3 and 4. This is very unconventional for EBM or MCMC, and this is due to the fact that the MCMC is non-convergent, non-mixing and non-persistent. In fact, our argument applies to the situation where the short-MCMC does not need to have the EBM as the stationary distribution.

While the learned short-run MCMC can be used for synthesis, the above learning scheme can be generalized to tasks such as image inpainting, super-resolution, style transfer, or inverse optimal control

ziebart2008maximum ; abbeel2004apprenticeship etc., using informative initial distributions and conditional energy functions.

## 2 Contributions and Related Work

This paper constitutes a conceptual shift, where we shift attention from learning EBM with unrealistic convergent MCMC to the non-convergent short-run MCMC. This is a break away from the long tradition of both EBM and MCMC. We provide theoretical and empirical evidences that the learned short-run MCMC is a valid generator or flow model. This conceptual shift frees us from the convergence issue of MCMC, and makes the short-run MCMC a reliable and efficient technology.

More generally, we shift the focus from energy-based model to energy-based dynamics. This appears to be consistent with the common practice of computational neuroscience krotov2019unsupervised , where researchers often directly start from the dynamics, such as attractor dynamics hopfield1982neural ; amit1989world ; poucet2005attractors whose express goal is to be trapped in a local mode. It is our hope that our work may help to understand the learning of such dynamics. We leave it to future work.

For short-run MCMC, contrastive divergence (CD)

hinton is the most prominent framework for theoretical underpinning. The difference between CD and our study is that in our study, the short-run MCMC is initialized from noise, while CD initializes from observed images. CD has been generalized to persistent CD pcd , and has more recently been generalized to modified CD gao2018learning and adversarial CD kim2016deep ; dai2017calibrating ; han2018divergence . However, in all those CD-based framework, the goal is still to learn the EBM, whereas in our framework, we discard the learned EBM, and only keep the learned short-run MCMC.

Recently in xie2018learning ; xie2017synthesizing , the maximum likelihood learning algorithm has been understood as an adversarial scheme or herding welling2009herding . The focus there is on the EBM instead of the short-run MCMC, which is the target of our study.

Generalizing tu2007learning , TuNIPS ; TuCVPR17 ; TuCVPR18 developed an introspective learning method where the EBM is discriminatively learned, and the EBM is both a generative model and a discriminative model. TuNIPS ; TuCVPR17 ; TuCVPR18 used residual networks to parametrize the EBM. Recent work du_ebm scales the learning to large residual networks.

Unlike gao2018learning that uses non-persistent MCMC, past work on learning EBM tends to involve persistent MCMC LuZhuWu2016 ; xie2016convnet , with the hope that the persistent chains may lead to convergence to the MLE in parameter estimate, as well as convergence to the corresponding EBM in MCMC sampling. Such a hope may be unrealistic due to the highly multi-modal nature of the EBM. Compared to persistent MCMC, the non-persistent MCMC in our method is much more efficient and convenient. See the recent work nijkamp2019anatomy on a thorough diagnosis of various persistent and non-persistent, as well as convergent and non-convergent implementations of MCMC for learning EBM.

A separate generator model goodfellow2014generative ; radford2015unsupervised ; kingma2013auto ; RezendeICML2014 ; MnihGregor2014 can be recruited and learned jointly with an EBM kim2016deep ; dai2017calibrating ; coopnets_pami , where the generator model serves as an approximate sampler of the EBM. In our work, we do not recruit a separate sampler model. Instead we treat the learned short-run MCMC as the generator model and it shares the same set of parameters as the EBM. Meanwhile, we believe our theoretical understanding can also be applied to learning generator model jointly with EBM.

The variational walkback method goyal2017variational is an energy-free method that can directly learn a MCMC sampling process.

Our theoretical understanding of non-convergent MCMC is based on generalized moment matching estimator. It is related to moment matching GAN

li2017mmd , however, we do not learn a separate generator model adversarially.

## 3 Non-Convergent Short-Run MCMC as Generator Model

### 3.1 Maximum Likelihood Learning of EBM

Let be the signal, such as an image. The energy-based model (EBM) is a Gibbs distribution

 pθ(x)=1Z(θ)exp(fθ(x)), (1)

where we assume is within a bounded range. is the negative energy and is parametrized by a bottom-up convolutional neural network (ConvNet) with weights . is the normalizing constant.

Suppose we observe training examples , where is the data distribution. For large , the sample average over approximates the expectation with respect with . For notational convenience, we treat the sample average and the expectation as the same.

The log-likelihood is

 L(θ)=1nn∑i=1logpθ(xi)≐Epdata[logpθ(x)]. (2)

The derivative of the log-likelihood is

 L′(θ)=Epdata[∂∂θfθ(x)]−Epθ[∂∂θfθ(x)]≐1nn∑i=1∂∂θfθ(xi)−1nn∑i=1∂∂θfθ(x−i), (3)

where for are the generated examples from the current model .

The above equation leads to the “analysis by synthesis” learning algorithm. At iteration , let be the current model parameters. We generate for . Then we update , where is the learning rate.

### 3.2 Short-Run MCMC

Generating synthesized examples requires MCMC, such as Langevin dynamics (or Hamiltonian Monte Carlo) neal2011mcmc , which iterates

 xτ+Δτ=xτ+Δτ2f′θ(xτ)+√ΔτUτ, (4)

where indexes the time, is the discretization of time, and is the Gaussian noise term. can be obtained by back-propagation. If is of low entropy or low temperature, the gradient term dominates the diffusion noise term, and the Langevin dynamics behaves like gradient descent.

If is multi-modal, then different chains tend to get trapped in different local modes, and they do not mix. Thus the convergence of the MCMC can be very slow, regardless of the initial distribution and the length of the Markov chain. This makes the maximum likelihood learning impractical. With the difficulty of the convergence of MCMC, even if we learn accurately, it may not be that useful because it is difficult to generate fair samples from .

We propose to give up the sampling of . Instead, we run a fixed number, e.g., , steps of MCMC, toward , starting from a fixed initial distribution, , such as the uniform noise distribution. Let be the -step MCMC transition kernel. Define

 qθ(x)=(Mθp0)(z)=∫p0(z)Mθ(x|z)dz, (5)

which is the marginal distribution of the sample after running -step MCMC from .

According to the second law of thermodynamics cover2012elements , decreases monotonically as increases, where

denotes the Kullback-Leibler divergence from

to . Thus can be considered a variational approximation to , and as , in theory. However, for multi-modal

, the second largest eigenvalue of

may still be quite close to 1 even if is large diaconis1991geometric , so that the chain does not mix and the convergence is practically impossible no matter what is. Thus in general, because of the finite steps in the Markov transition . But both and are defined by the same set of parameters , except that is an explicit unnormalized density, whereas is a generative process. If is small, can be of higher temperature or higher entropy than .

In this paper, instead of learning , we treat to be the target of learning. After learning, we keep , but we discard . That is, the sole purpose of is to guide a -step MCMC from .

In fact, we do not even require that to be the stationary or steady state distribution of the Markov transition kernel

. For instance, in the above Langevin dynamics, we can disable the noise term or change its variance. We can also tune the step size

without worrying about the discretization error.

A common choice of is uniform noise distribution, although other choices are also allowed. For tasks like super-resolution, inpainting, style transfer, etc., we may choose more informative and conditional versions of energy .

### 3.3 Learning Short-Run MCMC

The learning algorithm is as follows. Initialize . At learning iteration , let be the model parameters. We generate for . Then we update , where

 Δ(θ)=Epdata[∂∂θfθ(x)]−Eqθ[∂∂θfθ(x)]≈m∑i=1∂∂θfθ(xi)−m∑i=1∂∂θfθ(x−i). (6)

We assume that the algorithm converges so that . At convergence, the resulting solves the estimating equation .

To further improve training, we smooth

by convolution with a Gaussian white noise distribution, i.e., injecting additive noises

to observed examples sonderby2016amortised ; roth2017stabilizing . This makes it easy for to converge to 0, especially if the number of MCMC steps, , is small, so that the estimating equation may not have solution without smoothing .

According to the contrastive divergence formulation hinton ; gao2018learning , the above learning algorithm approximately minimizes the contrastive divergence . More precisely, define , then . In the above contrastive divergence, the minimization of first divergence leads to MLE. The second divergence measures the non-convergence, i.e., the gap between short-run MCMC and the stationary distribution . However, in CD, the target of learning is , whereas we care about .

The learning procedure in Algorithm 1 is simple. The code can be found in Appendix 7.3.

The key to the algorithm is that the generated are independent and fair samples from the model . This is much simpler than the algorithms that involve persistent chains, where the generated samples are neither independent nor fair samples from the EBM. The theoretical framework for understanding the above algorithm is Robbins-Monro’s stochastic approximation robbins1951stochastic , which solves an equation in expectation based on independent random samples. This is exactly what our method seeks to accomplish: solving the estimating equation in terms of expectation under the model, based on independent fair samples from the model.

### 3.4 Generator or Flow Model for Interpolation and Reconstruction

We may consider to be a generative model,

 z∼p0(z);x=Mθ(z,u), (7)

where denotes all the randomness in the short-run MCMC. For the -step Langevin dynamics, can be considered a -layer noise-injected residual network. can be considered latent variables, and the prior distribution of . Due to the non-convergence and non-mixing, can be highly dependent on , and can be inferred from . This is different from the convergent MCMC, where is independent of . When the learning algorithm converges, the learned EBM tends to have low entropy and the Langevin dynamics behaves like gradient descent, where the noise terms are disabled, i.e., . In that case, we simply write .

We can perform interpolation as follows. Generate and from . Let . This interpolation keeps the marginal variance of fixed. Let . Then is the interpolation of and . Figure 8 displays for a sequence of .

For an observed image , we can reconstruct

by running gradient descent on the least squares loss function

, initializing from , and iterates . Figure 8 displays the sequence of .

In general, defines an energy-based dynamics. does not need to be fixed. It can be a stopping time that depends on the past history of the dynamics. The dynamics can be a deterministic by setting . This includes the attractor dynamics popular in computational neuroscience hopfield1982neural ; amit1989world ; poucet2005attractors .

## 4 Understanding the Learned Short-Run MCMC

### 4.1 Exponential Family and Moment Matching Estimator

An early version of EBM is the FRAME (Filters, Random field, And Maximum Entropy) model zhu1998frame ; wu2000equivalence ; zhu1997GRADE , which is an exponential family model, where the features are the responses from a bank of filters. The deep FRAME model LuZhuWu2016 replaces the linear filters by the pre-trained ConvNet filters. This amounts to only learning the top layer weight parameters of the ConvNet. Specifically, where are the top-layer filter responses of a pre-trained ConvNet, and consists of the top-layer weight parameters. For such an , Then, the maximum likelihood estimator of is actually a moment matching estimator, i.e., If we use the short-run MCMC learning algorithm, it will converge (assume convergence is attainable) to a moment matching estimator, i.e., Thus, the learned model is a valid estimator in that it matches to the data distribution in terms of sufficient statistics defined by the EBM.

Consider two families of distributions: , and . They are illustrated by two curves in Figure 5. contains all the distributions that match the data distribution in terms of . Both and belong to , and of course also belongs to . contains all the EBMs with different values of the parameter

. The uniform distribution

corresponds to , thus belongs to .

The EBM under , i.e., does not belong to , and it may be quite far from . In general, that is, the corresponding EBM does not match the data distribution as far as is concerned. It can be much further from the uniform than is from , and thus may have a much lower entropy than .

Figure 5 illustrates the above idea. The red dotted line illustrates MCMC. Starting from , -step MCMC leads to . If we continue to run MCMC for infinite steps, we will get to . Thus the role of is to serve as an unreachable target to guide the -step MCMC which stops at the mid-way . One can say that the short-run MCMC is a wrong sampler of a wrong model, but it itself is a valid model because it belongs to .

The MLE is the projection of onto . Thus it belongs to . It also belongs to as can be seen from the maximum likelihood estimating equation. Thus it is the intersection of and . Among all the distributions in , is the closest to . Thus it has the maximum entropy among all the distributions in .

The above duality between maximum likelihood and maximum entropy follows from the following fact. Let be the intersection between and . and are orthogonal in terms of the Kullback-Leibler divergence. For any and for any , we have the Pythagorean property della1995inducing : . See Appendix 7.1 for a proof. Thus (1) , i.e., is MLE within . (2) , i.e., has maximum entropy within .

We can understand the learned from two Pythagorean results.

(1) Pythagorean for the right triangle formed by , , and ,

 KL(q^θMME|p^θMLE)=KL(q^θMME|p0)−KL(p^θMLE|p0)=H(p^θMLE)−H(q^θMME), (8)

where is the entropy of . See Appendix 7.1. Thus we want the entropy of to be high in order for it to be a good approximation to . Thus for small , it is important to let be the uniform distribution, which has the maximum entropy.

(2) Pythagorean for the right triangle formed by , , and ,

 KL(q^θMME|p^θMME)=KL(q^θMME|p^θMLE)+KL(p^θMLE|p^θMME). (9)

For fixed , as increases, decreases monotonically cover2012elements . The smaller is, the smaller and are. Thus, it is desirable to use large as long as we can afford the computational cost, to make both and close to .

### 4.2 General ConvNet-EBM and Generalized Moment Matching Estimator

For a general ConvNet , the learning algorithm based on short-run MCMC solves the following estimating equation: whose solution is , which can be considered a generalized moment matching estimator that in general solves the following estimating equation: where we generalize in the original moment matching estimator to that involves both and . For our learning algorithm, That is, the learned is still a valid estimator in the sense of matching to the data distribution. The above estimating equation can be solved by Robbins-Monro’s stochastic approximation robbins1951stochastic , as long as we can generate independent fair samples from .

In classical statistics, we often assume that the model is correct, i.e., corresponds to a for some true value . In that case, the generalized moment matching estimator

follows an asymptotic normal distribution centered at the true value

. The variance of depends on the choice of . The variance is minimized by the choice which corresponds to the maximum likelihood estimate of , and which leads to the Cramer-Rao lower bound and Fisher information. See Appendix 7.2 for a brief explanation.

is not equal to . Thus the learning algorithm will not give us the maximum likelihood estimate of . However, the validity of the learned does not require to be . In practice, one can never assume that the model is true. As a result, the optimality of the maximum likelihood may not hold, and there is no compelling reason that we must use MLE.

The relationship between , , , and may still be illustrated by Figure 5, although we need to modify the definition of as all the distributions that can be parametrized by , i.e., , so that is solved at . For instance, within , each may correspond to a different implementation of MCMC.

## 5 Experimental Results

In this section, we will demonstrate (1) realistic synthesis, (2) smooth interpolation, (3) faithful reconstruction of observed examples, (4) the influence of hyperparameters.

denotes the number of MCMC steps in equation (4). denotes the number of output features maps in the first layer of .

We emphasize the simplicity of the algorithm and models, see Appendix 7.3 and 7.4, respectively.

### 5.1 Fidelity

We evaluate the fidelity of generated examples on various datasets, each reduced to observed examples. Figure 8 depicts generated samples for various datasets with Langevin steps for both training and evaluation. For CIFAR-10 we set the number of features , whereas for CelebA and LSUN we use . We use iterations of model updates, then gradually decrease the learning rate and injected noise for observed examples. Table 1 (a) compares the Inception Score (IS) salimans2016improved ; barratt2018note and Fréchet Inception Distance (FID) heusel2017gans

with Inception v2 classifier

szegedy2016rethinking on generated examples. Despite its simplicity, short-run MCMC is competitive.

### 5.2 Interpolation

We demonstrate interpolation between generated examples. We follow the procedure outlined in Section 3.4. Let where to denotes the -step gradient descent with . Figure 8 illustrates for a sequence of on CelebA. The interpolation appears smooth and the intermediate samples resemble realistic faces. The interpolation experiment highlights that the short-run MCMC does not mix, which is in fact an advantage instead of a disadvantage. The interpolation ability goes far beyond the capacity of EBM and convergent MCMC.

### 5.3 Reconstruction

We demonstrate reconstruction of observed examples. For short-run MCMC, we follow the procedure outlined in Section 3.4. For an observed image , we reconstruct by running gradient descent on the least squares loss function , initializing from , and iterates . For VAE, reconstruction is readily available. For GAN, we perform Langevin inference of latent variables HanLu2016 ; CoopNets2016 . Figure 8 depicts faithful reconstruction. Table 1 (b) illustrates competitive reconstructions in terms of MSE (per pixel) for observed leave-out examples. Again, the reconstruction ability of the short-run MCMC is due to the fact that it is not mixing.

### 5.4 Influence of Hyperparameters

MCMC Steps. Table 2 depicts the influence of varying the number of MCMC steps while training on synthesis and average magnitude over -step Langevin (4). We observe: (1) the quality of synthesis decreases with decreasing , and, (2) the shorter the MCMC, the colder the learned EBM, and the more dominant the gradient descent part of the Langevin. With small , short-run MCMC fails “gracefully” in terms of synthesis. A choice of appears reasonable.

Injected Noise. To stabilize training, we smooth by injecting additive noises to observed examples . Table 3 (a) depicts the influence of on the fidelity of negative examples in terms of IS and FID. That is, when lowering , the fidelity of the examples improves. Hence, it is desirable to pick smallest while maintaining the stability of training. Further, to improve synthesis, we may gradually decrease the learning rate and anneal while training.

Model Complexity. We investigate the influence of the number of output features maps on generated samples with . Table 3 (b) summarizes the quality of synthesis in terms of IS and FID. As the number of features increases, so does the quality of the synthesis. Hence, the quality of synthesis may scale with until the computational means are exhausted.

## 6 Conclusion

This paper provides a new way to think about and utilize MCMC. It shifts the focus from the EBM and impractical convergent MCMC to the practical and efficient short-run MCMC guided by EBM. The short-run MCMC is non-convergent, non-mixing, and non-persistent. Thus it is as bad as an MCMC can possibly be. However, the vice of non-convergence and non-mixing actually becomes a virtue in that the short-run MCMC is capable of reconstruction and interpolation, which is beyond the capacity of EBM and convergent MCMC. The main goal of this paper is to advocate the short-run MCMC as a valid generative model.

Our theoretical understanding of the short-run MCMC is based on the moment matching estimator. In the case of exponential family models, the learned short-run MCMC matches to the data distribution in terms of expectations of sufficient statistics, and it can be considered an approximation to the MLE of the EBM. In the general case of ConvNet-EBM, the learned short-run MCMC is a generalized moment matching estimator.

Despite our focus on short-run MCMC, we do not advocate abandoning EBM all together. On the contrary, we ultimately aim to learn valid EBM. Hopefully, the non-convergent short-run MCMC studied in this paper may be useful in this endeavor. It is also our hope that our work may help to understand the learning of attractor dynamics popular in neuroscience.

## Acknowledgments

The work is supported by DARPA XAI project N66001-17-2-4029; ARO project W911NF1810296; and ONR MURI project N00014-16-1-2007; and Extreme Science and Engineering Discovery Environment (XSEDE) grant ASC170063. We thank Prof. Stu Geman and Prof. Xianfeng (David) Gu for helpful discussions.

## 7 Appendix

### 7.1 Proof of Pythagorean Identity

For , let .

 KL(p|pθ) = Ep[logp(x)−⟨θ,h(x)⟩+logZ(θ)] (10) = −H(p)−⟨θ,^h⟩+logZ(θ), (11)

where is the entropy of .

For ,

 KL(^p|pθ)=−H(^p)−⟨θ,^h⟩+logZ(θ). (12)
 KL(p|^p) = Ep[logp(x)]−Ep[log^p(x)] (13) = Ep[logp(x)]−E^p[log^p(x)] (14) = −H(p)+H(^p). (15)

Thus .

### 7.2 Estimating Equation and Cramer-Rao Theory

For a model , we can estimate by solving the estimating equation . Assume the solution exists and let it be . Assume there exists so that . Let . We can change . Then , and the estimating equation becomes . A Taylor expansion around gives us the asymptotic linear equation , where . Thus the estimate , i.e., one-step Newton-Raphson update from . Since for any , including , the estimator is asymptotically unbiased. The Cramer-Rao theory establishes that has an asymptotic normal distribution, , where . is minimized if we take , which leads to the maximum likelihood estimating equation, and the corresponding , where is the Fisher information.

### 7.3 Code

import torch as t, torch.nn as nn
import torchvision as tv, torchvision.transforms as tr
seed = 1
im_sz = 32
sigma = 3e-2 # decrease until training is unstable
n_ch = 3
m = 8**2
K = 100
n_f = 64 # increase until compute is exhausted
n_i = 10**5
t.manual_seed(seed)
if t.cuda.is_available():
t.cuda.manual_seed_all(seed)
device = t.device(’cuda’ if t.cuda.is_available() else ’cpu’)
class F(nn.Module):
def __init__(self, n_c=n_ch, n_f=n_f, l=0.2):
super(F, self).__init__()
self.f = nn.Sequential(
nn.Conv2d(n_c, n_f, 3, 1, 1),
nn.LeakyReLU(l),
nn.Conv2d(n_f, n_f * 2, 4, 2, 1),
nn.LeakyReLU(l),
nn.Conv2d(n_f * 2, n_f * 4, 4, 2, 1),
nn.LeakyReLU(l),
nn.Conv2d(n_f * 4, n_f * 8, 4, 2, 1),
nn.LeakyReLU(l),
nn.Conv2d(n_f * 8, 1, 4, 1, 0))
def forward(self, x):
return self.f(x).squeeze()
f = F().to(device)
transform = tr.Compose([tr.Resize(im_sz), tr.ToTensor(), tr.Normalize((.5, .5, .5), (.5, .5, .5))])
p_d = t.stack([x[0] for x in tv.datasets.CIFAR10(root=’data/cifar10’, transform=transform)]).to(device)
noise = lambda x: x + sigma * t.randn_like(x)
def sample_p_d():
p_d_i = t.LongTensor(m).random_(0, p_d.shape[0])
return noise(p_d[p_d_i]).detach()
sample_p_0 = lambda: t.FloatTensor(m, n_ch, im_sz, im_sz).uniform_(-1, 1).to(device)
def sample_q(K=K):
x_k = t.autograd.Variable(sample_p_0(), requires_grad=True)
for k in range(K):
f_prime = t.autograd.grad(f(x_k).sum(), [x_k], retain_graph=True)[0]
x_k.data += f_prime + 1e-2 * t.randn_like(x_k)
return x_k.detach()
sqrt = lambda x: int(t.sqrt(t.Tensor([x])))
plot = lambda p, x: tv.utils.save_image(t.clamp(x, -1., 1.), p, normalize=True, nrow=sqrt(m))
optim = t.optim.Adam(f.parameters(), lr=1e-4, betas=[.9, .999])
for i in range(n_i):
x_p_d, x_q = sample_p_d(), sample_q()
L = f(x_p_d).mean() - f(x_q).mean()
optim.zero_grad()
(-L).backward()
optim.step()
if i % 100 == 0:
print(’{:>6d}f(x_p_d)={:>14.9f}f(x_q)={:>14.9f}’.format(i, f(x_p_d).mean(), f(x_q).mean()))
plot(’x_q_{:>06d}.png’.format(i), x_q)

### 7.4 Model Architecture

We use the following notation. Convolutional operation with

output feature maps and bias term. Leaky-ReLU nonlinearity

with default leaky factor . We set .