1.1 Learning Energy-Based Model by MCMC Sampling
The maximum likelihood learning of the energy-based model (EBM) lecun2006tutorial ; zhu1998frame ; hinton2006unsupervised ; salakhutdinov2009deep ; lee2009convolutional ; Ng2011 ; Dai2015ICLR ; LuZhuWu2016 ; xie2016convnet ; zhao2016energy ; kim2016deep ; dai2017calibrating ; coopnets_pami , or the Gibbs distribution, follows what Grenander grenander2007pattern
called “analysis by synthesis” scheme. Within each learning iteration, we generate synthesized examples by sampling from the current model, and then update the model parameters based on the difference between the synthesized examples and the observed examples, so that eventually the synthesized examples match the observed examples in terms of some statistical properties defined by the model. To sample from the current EBM, we need to use Markov chain Monte Carlo (MCMC), such as the Gibbs samplergibbs , Langevin dynamics, or Hamiltonian Monte Carlo neal2011mcmc .
Recent work that parametrizes the energy function by modern convolutional neural networks (Conv-Nets)lecun1998gradient ; krizhevsky2012imagenet suggests that the “analysis by synthesis” process can indeed generate highly realistic images. For instance, xie2016convnet
initializes persistent MCMC from zero or noise images, and within each learning iteration, run a finite-step MCMC starting from the synthesized examples generated in the previous learning iteration. The resulting learning and sampling process can generate realistic images. The convergence of such persistent algorithm to the maximum likelihood estimate (MLE) has been studied byyounes1991maximum as an MCMC version of stochastic approximation robbins1951stochastic . Alternatively, gao2018learning devises a non-persistent multi-grid short-run MCMC that is always initialized from the histogram extracted from versions of the observed images. Such a finite budget sampling scheme can be conveniently scaled up to big datasets.
Although the “analysis by synthesis” learning scheme is intuitively appealing, the sampling or synthesis step can be very challenging. The convergence of MCMC can be extremely slow or impractical, especially if the energy function is multi-modal, which is typically the case if the EBM is to approximate the complex data distribution, such as that of natural images. For such EBM, the MCMC usually does not mix, i.e., MCMC chains from different starting points tend to get trapped in different local modes instead of traversing modes and mixing with each other.
1.2 Short-Run MCMC as Generator or Flow Model
In this paper, we investigate a learning scheme that is apparently wrong with no hope of learning a valid model. Within each learning iteration, we run a non-convergent, non-mixing and non-persistent short-run MCMC, such as to steps of Langevin dynamics, toward the current EBM. Here, we always initialize the non-persistent short-run MCMC from the same distribution, such as the uniform noise distribution, and we always run the same number of MCMC steps. We then update the model parameters as usual, as if the synthesized examples generated by the non-convergent and non-persistent noise-initialized short-run MCMC are the fair samples generated from the current EBM. We show that, after the convergence of such a learning algorithm, the resulting noise-initialized short-run MCMC can generate realistic images, see Figures 1 and 2.
The short-run MCMC is not a valid sampler of the EBM because it is short-run. As a result, the learned EBM cannot be a valid model because it is learned based on a wrong sampler. Thus we learn a wrong sampler of a wrong model. However, the short-run MCMC can indeed generate realistic images. What is going on?
The goal of this paper is to understand the learned short-run MCMC. We provide arguments that it is a valid model for the data in terms of matching the statistical properties of the data distribution. We also show that the learned short-run MCMC can be used as a generative model, such as a generator model goodfellow2014generative ; kingma2013auto or the flow model dinh2014nice ; dinh2016density ; kingma2018glow , with the Langevin dynamics serving as a noise-injected residual network, with the initial image serving as the latent variables, and with the initial uniform noise distribution serving as the prior distribution of the latent variables. We show that unlike traditional EBM and MCMC, the learned short-run MCMC is capable of reconstructing the observed images and interpolating different images, just like a generator or a flow model can do. See Figures 3 and 4. This is very unconventional for EBM or MCMC, and this is due to the fact that the MCMC is non-convergent, non-mixing and non-persistent. In fact, our argument applies to the situation where the short-MCMC does not need to have the EBM as the stationary distribution.
While the learned short-run MCMC can be used for synthesis, the above learning scheme can be generalized to tasks such as image inpainting, super-resolution, style transfer, or inverse optimal controlziebart2008maximum ; abbeel2004apprenticeship etc., using informative initial distributions and conditional energy functions.
2 Contributions and Related Work
This paper constitutes a conceptual shift, where we shift attention from learning EBM with unrealistic convergent MCMC to the non-convergent short-run MCMC. This is a break away from the long tradition of both EBM and MCMC. We provide theoretical and empirical evidences that the learned short-run MCMC is a valid generator or flow model. This conceptual shift frees us from the convergence issue of MCMC, and makes the short-run MCMC a reliable and efficient technology.
More generally, we shift the focus from energy-based model to energy-based dynamics. This appears to be consistent with the common practice of computational neuroscience krotov2019unsupervised , where researchers often directly start from the dynamics, such as attractor dynamics hopfield1982neural ; amit1989world ; poucet2005attractors whose express goal is to be trapped in a local mode. It is our hope that our work may help to understand the learning of such dynamics. We leave it to future work.
For short-run MCMC, contrastive divergence (CD)hinton is the most prominent framework for theoretical underpinning. The difference between CD and our study is that in our study, the short-run MCMC is initialized from noise, while CD initializes from observed images. CD has been generalized to persistent CD pcd , and has more recently been generalized to modified CD gao2018learning and adversarial CD kim2016deep ; dai2017calibrating ; han2018divergence . However, in all those CD-based framework, the goal is still to learn the EBM, whereas in our framework, we discard the learned EBM, and only keep the learned short-run MCMC.
Recently in xie2018learning ; xie2017synthesizing , the maximum likelihood learning algorithm has been understood as an adversarial scheme or herding welling2009herding . The focus there is on the EBM instead of the short-run MCMC, which is the target of our study.
Generalizing tu2007learning , TuNIPS ; TuCVPR17 ; TuCVPR18 developed an introspective learning method where the EBM is discriminatively learned, and the EBM is both a generative model and a discriminative model. TuNIPS ; TuCVPR17 ; TuCVPR18 used residual networks to parametrize the EBM. Recent work du_ebm scales the learning to large residual networks.
Unlike gao2018learning that uses non-persistent MCMC, past work on learning EBM tends to involve persistent MCMC LuZhuWu2016 ; xie2016convnet , with the hope that the persistent chains may lead to convergence to the MLE in parameter estimate, as well as convergence to the corresponding EBM in MCMC sampling. Such a hope may be unrealistic due to the highly multi-modal nature of the EBM. Compared to persistent MCMC, the non-persistent MCMC in our method is much more efficient and convenient. See the recent work nijkamp2019anatomy on a thorough diagnosis of various persistent and non-persistent, as well as convergent and non-convergent implementations of MCMC for learning EBM.
A separate generator model goodfellow2014generative ; radford2015unsupervised ; kingma2013auto ; RezendeICML2014 ; MnihGregor2014 can be recruited and learned jointly with an EBM kim2016deep ; dai2017calibrating ; coopnets_pami , where the generator model serves as an approximate sampler of the EBM. In our work, we do not recruit a separate sampler model. Instead we treat the learned short-run MCMC as the generator model and it shares the same set of parameters as the EBM. Meanwhile, we believe our theoretical understanding can also be applied to learning generator model jointly with EBM.
The variational walkback method goyal2017variational is an energy-free method that can directly learn a MCMC sampling process.
3 Non-Convergent Short-Run MCMC as Generator Model
3.1 Maximum Likelihood Learning of EBM
Let be the signal, such as an image. The energy-based model (EBM) is a Gibbs distribution
where we assume is within a bounded range. is the negative energy and is parametrized by a bottom-up convolutional neural network (ConvNet) with weights . is the normalizing constant.
Suppose we observe training examples , where is the data distribution. For large , the sample average over approximates the expectation with respect with . For notational convenience, we treat the sample average and the expectation as the same.
The log-likelihood is
The derivative of the log-likelihood is
where for are the generated examples from the current model .
The above equation leads to the “analysis by synthesis” learning algorithm. At iteration , let be the current model parameters. We generate for . Then we update , where is the learning rate.
3.2 Short-Run MCMC
Generating synthesized examples requires MCMC, such as Langevin dynamics (or Hamiltonian Monte Carlo) neal2011mcmc , which iterates
where indexes the time, is the discretization of time, and is the Gaussian noise term. can be obtained by back-propagation. If is of low entropy or low temperature, the gradient term dominates the diffusion noise term, and the Langevin dynamics behaves like gradient descent.
If is multi-modal, then different chains tend to get trapped in different local modes, and they do not mix. Thus the convergence of the MCMC can be very slow, regardless of the initial distribution and the length of the Markov chain. This makes the maximum likelihood learning impractical. With the difficulty of the convergence of MCMC, even if we learn accurately, it may not be that useful because it is difficult to generate fair samples from .
We propose to give up the sampling of . Instead, we run a fixed number, e.g., , steps of MCMC, toward , starting from a fixed initial distribution, , such as the uniform noise distribution. Let be the -step MCMC transition kernel. Define
which is the marginal distribution of the sample after running -step MCMC from .
According to the second law of thermodynamics cover2012elements , decreases monotonically as increases, where
denotes the Kullback-Leibler divergence fromto . Thus can be considered a variational approximation to , and as , in theory. However, for multi-modal
, the second largest eigenvalue ofmay still be quite close to 1 even if is large diaconis1991geometric , so that the chain does not mix and the convergence is practically impossible no matter what is. Thus in general, because of the finite steps in the Markov transition . But both and are defined by the same set of parameters , except that is an explicit unnormalized density, whereas is a generative process. If is small, can be of higher temperature or higher entropy than .
In this paper, instead of learning , we treat to be the target of learning. After learning, we keep , but we discard . That is, the sole purpose of is to guide a -step MCMC from .
In fact, we do not even require that to be the stationary or steady state distribution of the Markov transition kernel
. For instance, in the above Langevin dynamics, we can disable the noise term or change its variance. We can also tune the step sizewithout worrying about the discretization error.
A common choice of is uniform noise distribution, although other choices are also allowed. For tasks like super-resolution, inpainting, style transfer, etc., we may choose more informative and conditional versions of energy .
3.3 Learning Short-Run MCMC
The learning algorithm is as follows. Initialize . At learning iteration , let be the model parameters. We generate for . Then we update , where
We assume that the algorithm converges so that . At convergence, the resulting solves the estimating equation .
To further improve training, we smooth
by convolution with a Gaussian white noise distribution, i.e., injecting additive noisesto observed examples sonderby2016amortised ; roth2017stabilizing . This makes it easy for to converge to 0, especially if the number of MCMC steps, , is small, so that the estimating equation may not have solution without smoothing .
According to the contrastive divergence formulation hinton ; gao2018learning , the above learning algorithm approximately minimizes the contrastive divergence . More precisely, define , then . In the above contrastive divergence, the minimization of first divergence leads to MLE. The second divergence measures the non-convergence, i.e., the gap between short-run MCMC and the stationary distribution . However, in CD, the target of learning is , whereas we care about .
The key to the algorithm is that the generated are independent and fair samples from the model . This is much simpler than the algorithms that involve persistent chains, where the generated samples are neither independent nor fair samples from the EBM. The theoretical framework for understanding the above algorithm is Robbins-Monro’s stochastic approximation robbins1951stochastic , which solves an equation in expectation based on independent random samples. This is exactly what our method seeks to accomplish: solving the estimating equation in terms of expectation under the model, based on independent fair samples from the model.
3.4 Generator or Flow Model for Interpolation and Reconstruction
We may consider to be a generative model,
where denotes all the randomness in the short-run MCMC. For the -step Langevin dynamics, can be considered a -layer noise-injected residual network. can be considered latent variables, and the prior distribution of . Due to the non-convergence and non-mixing, can be highly dependent on , and can be inferred from . This is different from the convergent MCMC, where is independent of . When the learning algorithm converges, the learned EBM tends to have low entropy and the Langevin dynamics behaves like gradient descent, where the noise terms are disabled, i.e., . In that case, we simply write .
We can perform interpolation as follows. Generate and from . Let . This interpolation keeps the marginal variance of fixed. Let . Then is the interpolation of and . Figure 8 displays for a sequence of .
For an observed image , we can reconstruct
by running gradient descent on the least squares loss function, initializing from , and iterates . Figure 8 displays the sequence of .
In general, defines an energy-based dynamics. does not need to be fixed. It can be a stopping time that depends on the past history of the dynamics. The dynamics can be a deterministic by setting . This includes the attractor dynamics popular in computational neuroscience hopfield1982neural ; amit1989world ; poucet2005attractors .
4 Understanding the Learned Short-Run MCMC
4.1 Exponential Family and Moment Matching Estimator
An early version of EBM is the FRAME (Filters, Random field, And Maximum Entropy) model zhu1998frame ; wu2000equivalence ; zhu1997GRADE , which is an exponential family model, where the features are the responses from a bank of filters. The deep FRAME model LuZhuWu2016 replaces the linear filters by the pre-trained ConvNet filters. This amounts to only learning the top layer weight parameters of the ConvNet. Specifically, where are the top-layer filter responses of a pre-trained ConvNet, and consists of the top-layer weight parameters. For such an , Then, the maximum likelihood estimator of is actually a moment matching estimator, i.e., If we use the short-run MCMC learning algorithm, it will converge (assume convergence is attainable) to a moment matching estimator, i.e., Thus, the learned model is a valid estimator in that it matches to the data distribution in terms of sufficient statistics defined by the EBM.
Consider two families of distributions: , and . They are illustrated by two curves in Figure 5. contains all the distributions that match the data distribution in terms of . Both and belong to , and of course also belongs to . contains all the EBMs with different values of the parameter
. The uniform distributioncorresponds to , thus belongs to .
The EBM under , i.e., does not belong to , and it may be quite far from . In general, that is, the corresponding EBM does not match the data distribution as far as is concerned. It can be much further from the uniform than is from , and thus may have a much lower entropy than .
Figure 5 illustrates the above idea. The red dotted line illustrates MCMC. Starting from , -step MCMC leads to . If we continue to run MCMC for infinite steps, we will get to . Thus the role of is to serve as an unreachable target to guide the -step MCMC which stops at the mid-way . One can say that the short-run MCMC is a wrong sampler of a wrong model, but it itself is a valid model because it belongs to .
The MLE is the projection of onto . Thus it belongs to . It also belongs to as can be seen from the maximum likelihood estimating equation. Thus it is the intersection of and . Among all the distributions in , is the closest to . Thus it has the maximum entropy among all the distributions in .
The above duality between maximum likelihood and maximum entropy follows from the following fact. Let be the intersection between and . and are orthogonal in terms of the Kullback-Leibler divergence. For any and for any , we have the Pythagorean property della1995inducing : . See Appendix 7.1 for a proof. Thus (1) , i.e., is MLE within . (2) , i.e., has maximum entropy within .
We can understand the learned from two Pythagorean results.
(1) Pythagorean for the right triangle formed by , , and ,
where is the entropy of . See Appendix 7.1. Thus we want the entropy of to be high in order for it to be a good approximation to . Thus for small , it is important to let be the uniform distribution, which has the maximum entropy.
(2) Pythagorean for the right triangle formed by , , and ,
For fixed , as increases, decreases monotonically cover2012elements . The smaller is, the smaller and are. Thus, it is desirable to use large as long as we can afford the computational cost, to make both and close to .
4.2 General ConvNet-EBM and Generalized Moment Matching Estimator
For a general ConvNet , the learning algorithm based on short-run MCMC solves the following estimating equation: whose solution is , which can be considered a generalized moment matching estimator that in general solves the following estimating equation: where we generalize in the original moment matching estimator to that involves both and . For our learning algorithm, That is, the learned is still a valid estimator in the sense of matching to the data distribution. The above estimating equation can be solved by Robbins-Monro’s stochastic approximation robbins1951stochastic , as long as we can generate independent fair samples from .
In classical statistics, we often assume that the model is correct, i.e., corresponds to a for some true value . In that case, the generalized moment matching estimator
follows an asymptotic normal distribution centered at the true value. The variance of depends on the choice of . The variance is minimized by the choice which corresponds to the maximum likelihood estimate of , and which leads to the Cramer-Rao lower bound and Fisher information. See Appendix 7.2 for a brief explanation.
is not equal to . Thus the learning algorithm will not give us the maximum likelihood estimate of . However, the validity of the learned does not require to be . In practice, one can never assume that the model is true. As a result, the optimality of the maximum likelihood may not hold, and there is no compelling reason that we must use MLE.
The relationship between , , , and may still be illustrated by Figure 5, although we need to modify the definition of as all the distributions that can be parametrized by , i.e., , so that is solved at . For instance, within , each may correspond to a different implementation of MCMC.
5 Experimental Results
In this section, we will demonstrate (1) realistic synthesis, (2) smooth interpolation, (3) faithful reconstruction of observed examples, (4) the influence of hyperparameters.denotes the number of MCMC steps in equation (4). denotes the number of output features maps in the first layer of .
We evaluate the fidelity of generated examples on various datasets, each reduced to observed examples. Figure 8 depicts generated samples for various datasets with Langevin steps for both training and evaluation. For CIFAR-10 we set the number of features , whereas for CelebA and LSUN we use . We use iterations of model updates, then gradually decrease the learning rate and injected noise for observed examples. Table 1 (a) compares the Inception Score (IS) salimans2016improved ; barratt2018note and Fréchet Inception Distance (FID) heusel2017gans
with Inception v2 classifierszegedy2016rethinking on generated examples. Despite its simplicity, short-run MCMC is competitive.
We demonstrate interpolation between generated examples. We follow the procedure outlined in Section 3.4. Let where to denotes the -step gradient descent with . Figure 8 illustrates for a sequence of on CelebA. The interpolation appears smooth and the intermediate samples resemble realistic faces. The interpolation experiment highlights that the short-run MCMC does not mix, which is in fact an advantage instead of a disadvantage. The interpolation ability goes far beyond the capacity of EBM and convergent MCMC.
We demonstrate reconstruction of observed examples. For short-run MCMC, we follow the procedure outlined in Section 3.4. For an observed image , we reconstruct by running gradient descent on the least squares loss function , initializing from , and iterates . For VAE, reconstruction is readily available. For GAN, we perform Langevin inference of latent variables HanLu2016 ; CoopNets2016 . Figure 8 depicts faithful reconstruction. Table 1 (b) illustrates competitive reconstructions in terms of MSE (per pixel) for observed leave-out examples. Again, the reconstruction ability of the short-run MCMC is due to the fact that it is not mixing.
5.4 Influence of Hyperparameters
MCMC Steps. Table 2 depicts the influence of varying the number of MCMC steps while training on synthesis and average magnitude over -step Langevin (4). We observe: (1) the quality of synthesis decreases with decreasing , and, (2) the shorter the MCMC, the colder the learned EBM, and the more dominant the gradient descent part of the Langevin. With small , short-run MCMC fails “gracefully” in terms of synthesis. A choice of appears reasonable.
Injected Noise. To stabilize training, we smooth by injecting additive noises to observed examples . Table 3 (a) depicts the influence of on the fidelity of negative examples in terms of IS and FID. That is, when lowering , the fidelity of the examples improves. Hence, it is desirable to pick smallest while maintaining the stability of training. Further, to improve synthesis, we may gradually decrease the learning rate and anneal while training.
Model Complexity. We investigate the influence of the number of output features maps on generated samples with . Table 3 (b) summarizes the quality of synthesis in terms of IS and FID. As the number of features increases, so does the quality of the synthesis. Hence, the quality of synthesis may scale with until the computational means are exhausted.
This paper provides a new way to think about and utilize MCMC. It shifts the focus from the EBM and impractical convergent MCMC to the practical and efficient short-run MCMC guided by EBM. The short-run MCMC is non-convergent, non-mixing, and non-persistent. Thus it is as bad as an MCMC can possibly be. However, the vice of non-convergence and non-mixing actually becomes a virtue in that the short-run MCMC is capable of reconstruction and interpolation, which is beyond the capacity of EBM and convergent MCMC. The main goal of this paper is to advocate the short-run MCMC as a valid generative model.
Our theoretical understanding of the short-run MCMC is based on the moment matching estimator. In the case of exponential family models, the learned short-run MCMC matches to the data distribution in terms of expectations of sufficient statistics, and it can be considered an approximation to the MLE of the EBM. In the general case of ConvNet-EBM, the learned short-run MCMC is a generalized moment matching estimator.
Despite our focus on short-run MCMC, we do not advocate abandoning EBM all together. On the contrary, we ultimately aim to learn valid EBM. Hopefully, the non-convergent short-run MCMC studied in this paper may be useful in this endeavor. It is also our hope that our work may help to understand the learning of attractor dynamics popular in neuroscience.
The work is supported by DARPA XAI project N66001-17-2-4029; ARO project W911NF1810296; and ONR MURI project N00014-16-1-2007; and Extreme Science and Engineering Discovery Environment (XSEDE) grant ASC170063. We thank Prof. Stu Geman and Prof. Xianfeng (David) Gu for helpful discussions.
Pieter Abbeel and Andrew Y Ng.
Apprenticeship learning via inverse reinforcement learning.In
Proceedings of the twenty-first international conference on Machine learning, page 1, 2004.
-  DJ Amit and Modelling Brain Function. The world of attractor neural networks. Modeling Brain Function, 1989.
-  Shane Barratt and Rishi Sharma. A note on the inception score. arXiv preprint arXiv:1801.01973, 2018.
-  Thomas M Cover and Joy A Thomas. Elements of information theory. John Wiley & Sons, 2012.
-  Jifeng Dai, Yang Lu, and Ying-Nian Wu. Generative modeling of convolutional neural networks. arXiv preprint arXiv:1412.6296, 2014.
-  Zihang Dai, Amjad Almahairi, Philip Bachman, Eduard Hovy, and Aaron Courville. Calibrating energy-based generative adversarial networks. arXiv preprint arXiv:1702.01691, 2017.
-  Stephen Della Pietra, Vincent Della Pietra, and John Lafferty. Inducing features of random fields. arXiv preprint cmp-lg/9506014, 1995.
Persi Diaconis, Daniel Stroock, et al.
Geometric bounds for eigenvalues of markov chains.
The Annals of Applied Probability, 1(1):36–61, 1991.
-  Laurent Dinh, David Krueger, and Yoshua Bengio. Nice: Non-linear independent components estimation. arXiv preprint arXiv:1410.8516, 2014.
-  Laurent Dinh, Jascha Sohl-Dickstein, and Samy Bengio. Density estimation using real nvp. arXiv preprint arXiv:1605.08803, 2016.
-  Yilun Du and Igor Mordatch. Implicit generation and generalization in energy-based models. arXiv preprint arXiv:1903.08689, 2019.
-  Ruiqi Gao, Yang Lu, Junpei Zhou, Song-Chun Zhu, and Ying Nian Wu. Learning generative convnets via multi-grid modeling and sampling. In , pages 9155–9164, 2018.
-  Stuart Geman and Donald Geman. Stochastic relaxation, gibbs distributions, and the bayesian restoration of images. IEEE Transactions on pattern analysis and machine intelligence, (6):721–741, 1984.
-  Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In Advances in neural information processing systems, pages 2672–2680, 2014.
-  Anirudh Goyal Alias Parth Goyal, Nan Rosemary Ke, Surya Ganguli, and Yoshua Bengio. Variational walkback: Learning a transition operator as a stochastic recurrent net. In Advances in Neural Information Processing Systems, pages 4392–4402, 2017.
-  Ulf Grenander and Michael I Miller. Pattern theory: from representation to inference. Oxford University Press, 2007.
-  Tian Han, Yang Lu, Song-Chun Zhu, and Ying Nian Wu. Alternating back-propagation for generator network. In AAAI, volume 3, page 13, 2017.
-  Tian Han, Erik Nijkamp, Xiaolin Fang, Mitch Hill, Song-Chun Zhu, and Ying Nian Wu. Divergence triangle for joint training of generator model, energy-based model, and inference model. arXiv preprint arXiv:1812.10907, 2018.
-  Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. Gans trained by a two time-scale update rule converge to a local nash equilibrium. In Advances in Neural Information Processing Systems, pages 6626–6637, 2017.
-  Geoffrey Hinton. Training products of experts by minimizing contrastive divergence. Neural Computation, pages 1771–1800, 2002.
Geoffrey E Hinton, Simon Osindero, Max Welling, and Yee-Whye Teh.
Unsupervised discovery of nonlinear structure using contrastive backpropagation.Cognitive science, 30(4):725–731, 2006.
-  John J Hopfield. Neural networks and physical systems with emergent collective computational abilities. Proceedings of the national academy of sciences, 79(8):2554–2558, 1982.
-  Long Jin, Justin Lazarow, and Zhuowen Tu. Introspective learning for discriminative classification. In Advances in Neural Information Processing Systems, 2017.
-  Taesup Kim and Yoshua Bengio. Deep directed generative models with energy-based probability estimation. arXiv preprint arXiv:1606.03439, 2016.
-  Diederik Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
-  Diederik P Kingma and Max Welling. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114, 2013.
-  Durk P Kingma and Prafulla Dhariwal. Glow: Generative flow with invertible 1x1 convolutions. In Advances in Neural Information Processing Systems, pages 10215–10224, 2018.
-  Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural networks. In NIPS, pages 1097–1105, 2012.
-  Dmitry Krotov and John J Hopfield. Unsupervised learning by competing hidden units. Proceedings of the National Academy of Sciences, page 201820458, 2019.
-  Justin Lazarow, Long Jin, and Zhuowen Tu. Introspective neural networks for generative modeling. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2774–2783, 2017.
-  Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998.
-  Yann LeCun, Sumit Chopra, Raia Hadsell, M Ranzato, and F Huang. A tutorial on energy-based learning. Predicting structured data, 1(0), 2006.
Honglak Lee, Roger Grosse, Rajesh Ranganath, and Andrew Y Ng.
Convolutional deep belief networks for scalable unsupervised learning of hierarchical representations.In International Conference on Machine Learning, pages 609–616, 2009.
-  Kwonjoon Lee, Weijian Xu, Fan Fan, and Zhuowen Tu. Wasserstein introspective neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018.
-  Chun-Liang Li, Wei-Cheng Chang, Yu Cheng, Yiming Yang, and Barnabás Póczos. Mmd gan: Towards deeper understanding of moment matching network. In Advances in Neural Information Processing Systems, pages 2203–2213, 2017.
Yang Lu, Song-Chun Zhu, and Ying Nian Wu.
Learning FRAME models using CNN filters.
Thirtieth AAAI Conference on Artificial Intelligence, 2016.
-  Andriy Mnih and Karol Gregor. Neural variational inference and learning in belief networks. In International Conference on Machine Learning, pages 1791–1799, 2014.
-  Radford M Neal. Mcmc using hamiltonian dynamics. Handbook of Markov Chain Monte Carlo, 2, 2011.
-  Jiquan Ngiam, Zhenghao Chen, Pang W Koh, and Andrew Y Ng. Learning deep energy models. In International Conference on Machine Learning, pages 1105–1112, 2011.
-  Erik Nijkamp, Mitch Hill, Tian Han, Song-Chun Zhu, and Ying Nian Wu. On the anatomy of mcmc-based maximum likelihood learning of energy-based models. arXiv, 2019.
-  Bruno Poucet and Etienne Save. Attractors in memory. Science, 308(5723):799–800, 2005.
-  Alec Radford, Luke Metz, and Soumith Chintala. Unsupervised representation learning with deep convolutional generative adversarial networks. arXiv preprint arXiv:1511.06434, 2015.
-  Danilo Jimenez Rezende, Shakir Mohamed, and Daan Wierstra. Stochastic backpropagation and approximate inference in deep generative models. In International Conference on Machine Learning, pages 1278–1286, 2014.
-  Herbert Robbins and Sutton Monro. A stochastic approximation method. The annals of mathematical statistics, pages 400–407, 1951.
-  Kevin Roth, Aurelien Lucchi, Sebastian Nowozin, and Thomas Hofmann. Stabilizing training of generative adversarial networks through regularization. In Advances in neural information processing systems, pages 2018–2028, 2017.
Ruslan Salakhutdinov and Geoffrey E Hinton.
Deep boltzmann machines.In International Conference on Artificial Intelligence and Statistics, 2009.
-  Tim Salimans, Ian Goodfellow, Wojciech Zaremba, Vicki Cheung, Alec Radford, and Xi Chen. Improved techniques for training gans. In Advances in Neural Information Processing Systems, pages 2234–2242, 2016.
-  Casper Kaae Sønderby, Jose Caballero, Lucas Theis, Wenzhe Shi, and Ferenc Huszár. Amortised map inference for image super-resolution. arXiv preprint arXiv:1610.04490, 2016.
-  Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jon Shlens, and Zbigniew Wojna. Rethinking the inception architecture for computer vision. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2818–2826, 2016.
Training restricted boltzmann machines using approximations to the likelihood gradient.ICML, pages 1064–1071, 2008.
-  Zhuowen Tu. Learning generative models via discriminative approaches. In 2007 IEEE Conference on Computer Vision and Pattern Recognition, pages 1–8, 2007.
-  Max Welling. Herding dynamical weights to learn. In Proceedings of the 26th Annual International Conference on Machine Learning, pages 1121–1128. ACM, 2009.
-  Ying Nian Wu, Song Chun Zhu, and Xiuwen Liu. Equivalence of julesz ensembles and frame models. International Journal of Computer Vision, 38(3):247–265, 2000.
-  Jianwen Xie, Yang Lu, Ruiqi Gao, and Ying Nian Wu. Cooperative learning of energy-based model and latent variable model via mcmc teaching. In AAAI, 2018.
-  Jianwen Xie, Yang Lu, Ruiqi Gao, Song-Chun Zhu, and Ying Nian Wu. Cooperative training of descriptor and generator networks. IEEE transactions on pattern analysis and machine intelligence (PAMI), 2018.
-  Jianwen Xie, Yang Lu, Song Chun Zhu, and Ying Nian Wu. A theory of generative convnet. International Conference on Machine Learning, 2016.
-  Jianwen Xie, Zilong Zheng, Ruiqi Gao, Wenguan Wang, Song-Chun Zhu, and Ying Nian Wu. Learning descriptor networks for 3d shape synthesis and analysis. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 8629–8638, 2018.
-  Jianwen Xie, Song-Chun Zhu, and Ying Nian Wu. Synthesizing dynamic patterns by spatial-temporal generative convnet. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 7093–7101, 2017.
-  Laurent Younes. Maximum likelihood estimation for gibbsian fields. Lecture Notes-Monograph Series, pages 403–426, 1991.
-  Junbo Zhao, Michael Mathieu, and Yann LeCun. Energy-based generative adversarial network. arXiv preprint arXiv:1609.03126, 2016.
-  Song-Chun Zhu and David Mumford. Grade: Gibbs reaction and diffusion equations. In International Conference on Computer Vision, pages 847–854, 1998.
-  Song-Chun Zhu, Ying Nian Wu, and David Mumford. Filters, random fields and maximum entropy (frame): Toward a unified theory for texture modeling. International Journal of Computer Vision, 27(2):107–126, 1998.
-  Brian D Ziebart, Andrew L Maas, J Andrew Bagnell, and Anind K Dey. Maximum entropy inverse reinforcement learning. 2008.
7.1 Proof of Pythagorean Identity
For , let .
where is the entropy of .
7.2 Estimating Equation and Cramer-Rao Theory
For a model , we can estimate by solving the estimating equation . Assume the solution exists and let it be . Assume there exists so that . Let . We can change . Then , and the estimating equation becomes . A Taylor expansion around gives us the asymptotic linear equation , where . Thus the estimate , i.e., one-step Newton-Raphson update from . Since for any , including , the estimator is asymptotically unbiased. The Cramer-Rao theory establishes that has an asymptotic normal distribution, , where . is minimized if we take , which leads to the maximum likelihood estimating equation, and the corresponding , where is the Fisher information.
7.4 Model Architecture
We use the following notation. Convolutional operation with
output feature maps and bias term. Leaky-ReLU nonlinearitywith default leaky factor . We set .