1 Introduction
Deep generative models have seen many applications such as image and video synthesis, and unsupervised or semisupervised learning. Such models usually consist of one or more layers of latent variables organized in topdown architectures. Learning such latent variable models from training examples is a fundamental problem, and this is the problem that this paper is concerned with.
Learning latent variable models usually requires inferring the latent variables based on their joint posterior distribution, i.e., the conditional distribution of the latent variables given each observed example. The inference typically requires Markov chain Monte Carlo (MCMC) such as Langevin dynamics [27] or Hamiltonian Monte Carlo (HMC) [30]. Such MCMC posterior sampling can be time consuming and can be difficult to scale up to big training data. The convergence of MCMC sampling is also questionable even after a long running time.
An alternative to MCMC posterior sampling is variational inference, which employs a simple factorized distribution to approximate the posterior distribution. Classical variational inference methods assume an approximate posterior distribution for each observed example, where the approximate posterior distribution is parametrized by some variational parameters that are specific to each example and that can be optimized by minimizing the KullbackLeibler divergence between the approximate distribution and the true posterior.
More recently, in the context of the generator network which consists of a single layer of latent vector at the top, the variational autoencoder (VAE)
[25, 35]learns an extra inference network that maps each input example to the approximate posterior distribution, in particular, the mean vector and the diagonal variancecovariance matrix of the approximate multivariate Gaussian posterior. Unlike classical variational methods, the parameters of the inference network are shared by all the training and testing examples, and they are learned together with the parameters of the generator network by jointly maximizing a lower bound of the loglikelihood.
Despite the success of VAE, it has the following shortcomings. (1) It requires a separate inference model with a separate set of parameters. These parameters are to be learned together with the parameters of the generative model. (2) The design of the inference model is not automatic, especially for generative models with multiple layers of latent vectors, which may have complex relationships under their joint posterior distribution. It is a highly nontrivial task to design an inference model to adequately capture the explainingaway competitions and bottomup and topdown interactions between multiple layers of latent variables.
In this article, we propose a simple method that lies in between MCMC posterior sampling and variational inference. Specifically, we propose to use a short run inference dynamics guided by the posterior distribution of the latent variables as an approximate inference engine. To make it more concrete, we employ a finitestep gradient descent on the negative logposterior distribution of the latent variables. For each training example, we initialize such a short run gradient flow from the prior distribution such as Gaussian or uniform noise distribution, and run a finite number (e.g., 20) of steps of gradient descent updates. This amounts to a residual network or a recurrent neural network (RNN) that transforms the initial noise distribution to an approximate posterior distribution. We optimize the step size of the gradient flow by minimizing the KullbackLeibler divergence between the approximate distribution produced by this short run inference dynamics and the true posterior distribution. Thanks to the computing capacities of the modern deep learning platforms, it is possible to compute the approximate distribution and its entropy, and optimize the aforementioned KullbackLeibler divergence. This is similar to variational inference, except that the variational parameter is the step size of the gradient descent, or in general, the tuning parameters of the short run inference dynamics. Our experiments show that the proposed method outperforms the VAE in terms of reconstruction error and synthesis quality.
The proposed method is similar to MCMC posterior sampling except that we focus on the optimization of a short run version of MCMC sampling. Such short run inference dynamics is easily affordable on current deep learning platforms and there is no much difficulty to scale it up to big datasets. The proposed method is also similar to variational inference except that the inference model is simply a noise initialized finitestep gradient flow, where the only extra parameter is the step sizes of the gradient flow or in general some tuning parameters of the short run inference dynamics.
One major advantage of the proposed method is that it is natural and automatic. For models with multiple layers of latent variables that may be organized in complex topdown architectures, the gradient descent update of the logposterior of the latent variables can be automatically obtained on modern deep learning platforms. Such gradient descent update naturally integrates explainingaway competitions and bottomup and topdown interactions between multiple layers of latent variables. By optimizing the step size of the gradient descent update using the variational criterion, it is possible to have a good approximate posterior distribution and an efficient inference engine. It thus enables researchers to explore sophisticated generative models without worrying about constructing the presumably more sophisticated inference models.
2 Contributions and related work
The following are contributions of our paper.

We propose short run inference dynamics for inferring latent variables in deep generative models.

We provide a method to determine the optimal step size or in general tuning parameters of the short run inference dynamics.

We demonstrate learning of multilayer latent variable models with high quality samples and reconstructions.
The following are themes related to our work.
(1) Variational inference. As mentioned above, VAE [25, 35, 38, 11] is the prominent method for learning generator network. The original inference model has been generalized to flowbased inference models such as normalizing flows. Our short run gradient flow can also be viewed as a flow model. The difference is that the gradient flow is obtained automatically from the logposterior of latent variables. There is no need to design an extra inference model or flowbased model for inference, and there is no need to learn an extra set of parameters except some tuning parameters of the gradient descent algorithm such as step size. More importantly, the gradient descent update is readily available for models with multiple layers of latent vectors, whereas designing inference models for such generative models can be a highly nontrivial task.
(2) Alternating backpropagation. [12, 41]
propose to learn the generator network by maximum likelihood, and the learning algorithm iterates the following two steps: (a) inferring the latent variables by Langevin dynamics that samples from the posterior distribution of the latent variables. (b) updating the model parameters based on the inferred latent variables. Both steps involve gradient computations based on backpropagation, thus the name “alternating backpropagation”. In the training stage, in step (a), the Langevin dynamics is initialized from the values of the latent variables produced in the previous epoch. This is usually called persistent chain in the literature. In our work, in step (a), we always initialize the finitestep (e.g., 20step) gradient descent updates from the prior noise distribution. This can be called nonpersistent chain. It is also a nonconvergent chain because we do not expect the chain to converge to the target distribution. But we do optimize the step size (or in general, the tuning parameters) of the gradient descent update based on a variational criterion. This allows us to use very affordable short run inference dynamics for learning, and moreover, the same short run inference dynamics can be used for testing examples as well.
(3) Short run MCMC for energybased model. Recently [32]
proposes to learn short run MCMC for energybased model. An energybased model is in the form of an unnormalized probability density function, where the logdensity or the energy function is parametrized by a bottomup neural network.
[32] shows that it is possible to learn short run MCMC such as 100step Langevin dynamics that can generate images of high synthesis quality, and they justify the learn short run MCMC as a valid flowlike model. However, the energybased model itself is not well learned. Our method follows a similar strategy, but it is intended for approximately sampling from the posterior distribution of latent variables. Our optimization method can help reduce the gap between the short run MCMC and the corresponding energybased model, so that the energybased model may also be properly learned.(4) Attractor dynamics.
In computational neuroscience, the dynamics of the neuron activities is often modeled by attractor dynamics
[19, 2, 33]. However, the objective function of the attractor dynamics is often implicit, thus it is unclear what is the computational problem that the attractor dynamics is solving. For the attractor dynamics to be implemented in real time, the dynamics is necessarily a short run dynamics. Our short run inference dynamics is guided by a topdown model with a welldefined objective function in terms of the logposterior of the latent variables. It may be connected to the attractor dynamics and help us understand the latter. We shall explore this direction in future work.3 Deep generative models
3.1 Joint, marginal, and posterior distributions
Let be the observed example, such as an image. Let be the latent variables, which may consist of latent variables at multiple layers organized in a topdown architecture. We may consider as forming an interpretation or explanation of .
The joint distribution of
is , where consists of model parameters. The marginal distribution of is Given , the inference of can be based on the posterior distribution .3.2 Five families of models
To put our work in context, we review the following classes of models, which correspond to different ways of specifying or .
(1) Generator network. This model assumes a dimensional noise vector at the toplayer. The prior distribution is known, such as , where is the
dimensional identity matrix. Given
, , whereis a topdown convolutional neural network (sometimes called deconvolutional network due to the topdown nature), where
consists of all the weight and bias terms of this topdown network.is usually assumed to be Gaussian white noise with mean 0 and variance
. Thus is such that , where is the dimensionality of . For this model(1)  
(2) 
where is a constant independent of .
In the generator network, units in the latent vector may carry semantic meanings in that if we varying the units of , we can observe semantically meaningful variations in .
(2) Multilayer generator network. While it is computationally convenient to have a latent noise vector at the top layer, it does not account for the fact that patterns can appear at multiple layers of compositions or abstractions (e.g., face (eyes, nose, mouth) (edges, corners) pixels), where variations and randomness occur at multiple layers. To capture such a hierarchical structure, it is desirable to introduce multiple layers of latent variables organized in a topdown architecture. Specifically, we have , where layer is the top layer, and layer is the bottom layer above . For notational simplicity, we let . We can then specify as
(3) 
One concrete example is , . where and are the mean vector and the diagonal variancecovariance matrix of respectively, and they are functions of . collects all the parameters in these functions. can be obtained similarly to Equation (2).
(3) Energybased model (EBM). The EBM [29, 31, 21, 43, 42, 9, 26, 32, 8, 20, 28] specifies directly, and it does not involve any latent variables . Specifically, where is parametrized by a bottomup neural network. is called the energy function.
(4) Latent energybased model. The latent EBM [18, 37] specifies directly, instead of specifying and separately as in the generator network or multilayer generator network. Specifically, where is the joint energy function. In general, both the generator model and its multilayer version can be written in this form with closed form , thus they are special cases of latent EBM. However, unlike of the EBM in (3), in latent EBM typically only involves singleton (unitary) and pairwise (binary) terms for biological plausibility.
(5) Flowbased model. The flowbased model [6, 7, 23, 10, 3, 40] is similar to the generator network, except that and are of the same dimensionality, and with an invertible . As a result, can be obtained in closed form, similar to the EBM. But unlike EBM, the density for the flowbased model is normalized, or in other words, the normalizing constant is in closed form. Computational feasibility requires that composes a sequence of transformations, thus a flow, whose inverse and Jacobian can be efficiently computed. Here we present the flowbased model as a generative model. It can also be used as an inference model to approximate the posterior [34, 24].
To connect the above models to our work, we focus on (1) and (2). However, our method is applicable to (3) and (4) as well. For (3), our optimization method can be used to optimize the short run MCMC that samples . Our short run inference dynamics is also related to the flowbased inference model in (5).
3.3 Learning and inference
Let be the data distribution that generates the example . The learning of parameters of can be based on where is the KullbackLeibler divergence between and (or from to since is asymmetric). If we observe training examples , the above minimization can be approximated by maximizing the loglikelihood
(4) 
which leads to the maximum likelihood estimate (MLE).
The gradient of the loglikelihood, , can be computed according to the following identity:
(5)  
(6)  
(7) 
The above expectation can be approximated by Monte Carlo samples from .
For generator network or its multilayer version, is in closed form. For latent EBM, where the expectation can be approximated by Monte Carlo samples from . In this work, we shall focus on generator network and its multilayer version. We shall leave latent EBM to future work.
4 Short run inference dynamics
4.1 Langevin dynamics
Sampling from usually requires MCMC. One convenient MCMC is Langevin dynamics [27], which iterates
(8) 
where , indexes the time step of the Langevin dynamics, and is the step size. The Langevin dynamics consists of a gradient descent term on . In the case of generator network, it amounts to gradient descent on , which is penalized reconstruction error. The Langevin dynamics also consists of a white noise diffusion term to create randomness for sampling from . For small step size , the marginal distribution of will converge to as regardless of the initial distribution of . More specifically, let be the marginal distribution of of the Langevin dynamics, then monotonically, that is, by increasing , we reduce .
Thus MCMC is consistent with variational approximation. Both seek to minimize over within a certain class.
4.2 Noise initialized short run dynamics
It is impractical to run long chains to sample from . We thus propose the following short run inference dynamics.
(9)  
(10) 
where is the prior distribution of . In the case of generator network, is . Starting from , we run (e.g., ) steps of gradient descent on with step size , and we take to be an approximate sample from .
Compared to the Langevin dynamics, we remove the white noise term . In practice, the noise term has minimal usefulness for the following two reasons. (1) In the beginning stage, when the reconstruction error is big relative to , the gradient of dominates the noise term, which is negligible. (2) When the Langevin dynamics get to a local mode, the noise term can hardly get the chain jump out of the local mode, if the mode is deep. That is, running long chain (e.g., = 10,000 steps) is not only impractical, even if we can afford it, it still cannot guarantee mode traversing and convergence to the target distribution. Thus we opt to run short chain without noise term, starting from the noise prior distribution , so that the randomness or possible multimodality comes from the initial instead of the almost useless noise term, and we optimize the step size to let the distribution of approximate the true posterior .
We can write the above dynamics as
(11) 
where , where we omit and in for the simplicity of notation. For finite , this dynamics is a layer residual network, or step RNN. In a recent paper, [3] studies the residual network as a flowbased model.
4.3 Flowlike inference model
To further simplify the notation, we may write the dynamics as
(12) 
where composes the steps of gradient descent. The above model can be considered a flowbased model, where consists of step flow of gradient updates. Let the distribution of be , where we include the notation to make it explicit that the distribution of depends on the step size . Recall that the distribution of also depends on and , so that in full notation, we may write as .
By change of variable,
(13) 
We may treat as a flowlike variational inference model.
4.4 Variational optimization of step size
We want to optimize the step size so that best approximates the true posterior . This can be accomplished by
(14) 
This is similar to variational approximation, with step size being the variational parameter.
(15) 
where the last term is independent of , and for the first two terms on the right hand side,
(16) 
and
(17)  
(18) 
In the above two equations, can be approximated by random samples from , e.g., in the case of generator network.
In the above computations, we do not need to invert , as is typical in flowbased variational approximation, but we need to compute the log determinant of the Jacobian . Fortunately, on modern deep learning platforms, such computation is easily feasible even if the dimension of is very high. For a sampled , after computing the matrix
, we can compute the eigenvalues of
, so that the logdeterminant is the sum of the log of the eigenvalues. Then we can optimize the step size by minimizing via a grid search or gradient descent.While we can optimize the step size for each example , in our work, we optimize over an overall that is shared by all the examples. Reverting to the full notation for , this means we minimizes
(19) 
over .
Instead of using a constant step size , gradient descent with respect to may also be generalized to optimize over varying step sizes . We leave it to future work.
4.5 Learning with short run inference dynamics
Each learning iteration consists of the following two steps. (1) Update by minimizing (19). (2) Update by
(20) 
where is the step size or learning rate, (here we use the full notation instead of the abbreviated notation ) can be approximated by sampling from using the noise initialized step gradient descent.
Given , the above iteration seeks to maximize
(21) 
by a single gradient ascent step in . In fact, in an approximate Monte Carlo EM algorithm, we may maximize with multiple gradient ascent steps. is an approximation to the completedata loglikelihood in the EM algorithm [5, 36].
Compared to the loglikelihood function , we have
(22)  
(23)  
(24) 
Since the last term has nothing to do with , thus maximizing is equivalent to maximizing , which is a lower bound of .
Thus each learning iteration can be interpreted as a joint maximization of over and . Specifically, step (1) maximizes over given , and step (2) seeks to maximize over given . This is similar to variational inference with being the variational parameter. The learning procedure is summarized in Algorithm 1.
5 General theoretical formulation for short run inference and synthesis
To put our work in the big picture of generative modeling and computing, let us consider a latent EBM model , which includes generator model and its multilayer version as well as regular EBM as special cases. The maximum likelihood learning is to minimize , where , and is the data distribution that generates the data . Expectation with respect to can be approximated by the average over .
The learning gradient is based on
(25)  
(26)  
(27)  
(28) 
requires the inference sampling from
in the socalled positive or wake phase (or clamped sampling) in the terminology of Boltzmann machine
[1] and Helmholtz machine [4], while requires the synthesis sampling from in the socalled negative or sleep phase (or unclamped sampling) [17]. Both can be replaced by short run dynamics. Let be the sortrun inference dynamics with tuning parameters , and let be the short run synthesis dynamics with tuning parameters , then at learning iteration with , the learning gradient forseeks to minimize the following generalized version of contrastive divergence:
(29)  
(30) 
Compared to maximum likelihood,
(31) 
where
(32)  
(33) 
is an approximation to . The learning gradient is
(34)  
(35) 
Compared (35) with (28), we replace by the short run inference , and we replace by the short run synthesis . In fact, (28) can be derived based on (35).
As to and , we seek to solve which amounts to
(36)  
(37) 
Both are variational optimizations with respect to the tuning parameters in the short run inference and short run synthesis respectively. Overall, the learning iteration seeks to solve .
Compared to the original contrastive divergence [16], we initialize the short run synthesis from a fixed noise distribution, instead of the observed examples.
For generator model or a purely topdown model, the second KLdivergence in the above generalized contrastive divergence disappears, since synthesis sampling can be accomplished by direct ancestral sampling. This is the case treated in this paper. For regular EBM, disappears, and with it the short run inference, so that we only have a short run synthesis , and the learning is based on , and the learning seeks to solve . See the recent work [32] for short run synthesis for regular EBM. See also the recent work [13] for a divergence triangle formulation that is related to the above generalized contrastive divergence formulation.
6 Experiments
In this section, we will demonstrate (1) realistic synthesis, (2) faithful reconstructions of observed examples, (3) inpainting of occluded images, (4) variational grid search and gradient descent on the step size, and, (5) ablation on the number of latent layers and Langevin steps. We emphasize the simplicity of the short run inference algorithm.
All the training image datasets are resized and scaled to with no further preprocessing. We train the models with parameter updates optimized by Adam [22]. The learning rate decays stepwise (, , ) for each iterations. If not stated otherwise, we use short run inference steps and is gradually annealed from to .
The baselines are trained with ladder variational autoencoder
[38] to improve the performance of multilayer latent variable models. We refer to the supplementary material for detailed model specifications.6.1 Synthesis
We evaluate the learned generator fidelity of generated examples quantitatively on various datasets, each reduced to observed examples. Figure 3 depicts generated samples of size pixels for various datasets with short run inference steps. Figure 1 depicts generated samples of size pixels on the CelebA dataset (see generated samples from a VAE with the same generator architecture in Figure 2). Table 1 (a) compares the Fréchet Inception Distance (FID) [15]
with Inception v3 classifier
[39] on generated examples. Despite its simplicity, short run inference dynamics is competitive to elaborate means of inference in VAE models.MNIST  SVHN  CelebA  

Models  MSE  FID  MSE  FID  MSE  FID 
VAE, L=1  0.020    0.019  46.78  0.031  69.90 
VAE, L=3  0.018    0.015  41.72  0.029  58.33 
VAE, L=5  0.018    0.014  39.26  0.028  53.40 
Ours, L=1  0.019    0.018  44.86  0.019  45.80 
Ours, L=3  0.017    0.015  39.02  0.018  41.20 
Ours, L=5  0.015    0.011  35.23  0.011  37.29 
6.2 Reconstruction
We evaluate the accuracy of the learned inference dynamics by reconstructing test images. In contrast to traditional MCMC posterior sampling with persistent chains, short run inference with small allows not only for efficient learning on training examples, but also the same dynamics can be recruited for testing examples.
6.3 Inpainting
Our method can “inpaint” occluded image regions. To recover the occluded pixels, the only required modification of (10) involves the computation of . For a fully observed image, the term is computed by the summation over all pixels. For partially observed images, we only compute the summation over the observed pixels. Figure 5 depicts test images taken from the CelebA dataset for which a mask randomly occludes pixels in various patterns.
6.4 Variational optimization of step size
The step size in (10) may be optimized such that best approximates the true posterior . That is, we can optimize the step size by minimizing via a grid search or gradient descent. As outlined in Section 4.4, we require . In reversemode autodifferentiation, we construct the Jacobian one row at a time by evaluating vectorJacobian products. Then, we evaluate the eigenvalues of . As both steps are computed in a differentiable manner, we may compute the gradient with respect to step size .
Figure 6 depicts the optimal step size over learning iterations determined by gridsearch with and gradient descent on (19). For both gridsearch and gradient descent the step size settles near after a few learning iterations. Figure 7 details the optimization objective of , , with respect to individual step sizes .
6.5 Influence of number of layers and steps
Finally, Tables 2 and 3 show the influence of the number of latent layers for the generator network and the number of steps in the short run inference dynamics (10), respectively. Increasing the number of layers improves the quality of synthesis and reconstruction. Increasing the number of inference steps up to 20 steps results in significant improvements, while appears to affect the scores only marginally.
FID  

MSE 
FID  

MSE 
7 Conclusion
This paper proposes to use short run inference dynamics to infer latent variables in deep generative models. The inference dynamics is always initialized from the prior noise distribution, followed by a small number (e.g., 20) of updates. It can be viewed as a residual network or an RNN. Similar to the variational inference, the tuning parameters such as step size of the short run inference dynamics are optimized by a variational criterion. Unlike the variational inference, there is no need to design an extra inference model, which is usually a challenging task with multiple layers of latent variables. In fact, the short run dynamics is guided by the logposterior distribution to regulate the explainingaway competitions and topdown feedbacks between the latent variables. The short run inference dynamics is easily affordable on the current computing platforms and can be easily scaled up to big training data. It will enable the researchers to design more sophisticated latent variable models without worrying about designing inference models.
The following are the directions that we shall explore in our future work.
(1) In this paper, we focus on gradient descent update, and we optimize the step size. We shall explore more general updates and other tuning parameters. A simple extension is to consider varying step sizes, i.e., difference step sizes at different iterations of the short run dynamics. Another extension is gradient descent with momentum, where the tuning parameters can be optimized. We may also consider second order optimization algorithm.
(2) In this paper, we focus on continuous latent variables or latent vectors. Continuous vector representations are commonly used in deep learning models. Discrete variables can be obtained from continuous variables. In our future work, we shall also explore short run inference for discrete latent variables explicitly.
(3) In this paper, we focus on topdown generative models. Our method can be applied to graphical models in general, including the latent energybased models. We shall also study these models using short run inference as well as short run synthesis in our future work.
Acknowledgments
The work is supported by DARPA XAI project N660011724029; ARO project W911NF1810296; and ONR MURI project N000141612007; and XSEDE grant ASC170063. We thank NVIDIA for hardware donations.
References
 [1] David H. Ackley, Geoffrey E. Hinton, and Terrence J. Sejnowski. A learning algorithm for boltzmann machines. Cognitive Science, 9(1):147–169, 1985.
 [2] Daniel J. Amit. Modeling brain function: the world of attractor neural networks, 1st Edition. Cambridge Univ. Press, 1989.

[3]
Jens Behrmann, Will Grathwohl, Ricky T. Q. Chen, David Duvenaud, and
JörnHenrik Jacobsen.
Invertible residual networks.
In
Proceedings of the 36th International Conference on Machine Learning, ICML 2019, 915 June 2019, Long Beach, California, USA
, pages 573–582, 2019.  [4] Peter Dayan, Geoffrey E. Hinton, Radford M. Neal, and Richard S. Zemel. The Helmholtz machine. Neural Computation, 7(5):889–904, 1995.
 [5] Arthur P Dempster, Nan M Laird, and Donald B Rubin. Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society: Series B (Methodological), 39(1):1–22, 1977.
 [6] Laurent Dinh, David Krueger, and Yoshua Bengio. NICE: nonlinear independent components estimation. In 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 79, 2015, Workshop Track Proceedings, 2015.
 [7] Laurent Dinh, Jascha SohlDickstein, and Samy Bengio. Density estimation using real NVP. In 5th International Conference on Learning Representations, ICLR 2017, Toulon, France, April 2426, 2017, Conference Track Proceedings, 2017.
 [8] Chelsea Finn, Paul F. Christiano, Pieter Abbeel, and Sergey Levine. A connection between generative adversarial networks, inverse reinforcement learning, and energybased models. CoRR, abs/1611.03852, 2016.

[9]
Ruiqi Gao, Yang Lu, Junpei Zhou, SongChun Zhu, and Ying Nian Wu.
Learning generative convnets via multigrid modeling and sampling.
In
2018 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2018, Salt Lake City, UT, USA, June 1822, 2018
, pages 9155–9164, 2018.  [10] Will Grathwohl, Ricky T. Q. Chen, Jesse Bettencourt, Ilya Sutskever, and David Duvenaud. FFJORD: freeform continuous dynamics for scalable reversible generative models. In 7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 69, 2019, 2019.
 [11] Karol Gregor, Ivo Danihelka, Alex Graves, Danilo Jimenez Rezende, and Daan Wierstra. DRAW: A recurrent neural network for image generation. In Proceedings of the 32nd International Conference on Machine Learning, ICML 2015, Lille, France, 611 July 2015, pages 1462–1471, 2015.

[12]
Tian Han, Yang Lu, SongChun Zhu, and Ying Nian Wu.
Alternating backpropagation for generator network.
In
Proceedings of the ThirtyFirst AAAI Conference on Artificial Intelligence, February 49, 2017, San Francisco, California, USA.
, pages 1976–1984, 2017.  [13] Tian Han, Erik Nijkamp, Xiaolin Fang, Mitch Hill, SongChun Zhu, and Ying Nian Wu. Divergence triangle for joint training of generator model, energybased model, and inferential model. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2019, Long Beach, CA, USA, June 1620, 2019, pages 8670–8679, 2019.
 [14] Dan Hendrycks and Kevin Gimpel. Bridging nonlinearities and stochastic regularizers with gaussian error linear units. CoRR, abs/1606.08415, 2016.
 [15] Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. GANs trained by a two timescale update rule converge to a local nash equilibrium. In Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, 49 December 2017, Long Beach, CA, USA, pages 6626–6637, 2017.
 [16] Geoffrey E. Hinton. Training products of experts by minimizing contrastive divergence. Neural Computation, 14(8):1771–1800, 2002.
 [17] Geoffrey E Hinton, Peter Dayan, Brendan J Frey, and Radford M Neal. The wakesleep algorithm for unsupervised neural networks. Science, 268(5214):1158–1161, 1995.
 [18] Geoffrey E. Hinton, Simon Osindero, and Yee Whye Teh. A fast learning algorithm for deep belief nets. Neural Computation, 18(7):1527–1554, 2006.
 [19] John J Hopfield. Neural networks and physical systems with emergent collective computational abilities. In Proceedings of the national academy of sciences, volume 79, pages 2554–2558. National Acad Sciences, 1982.
 [20] Long Jin, Justin Lazarow, and Zhuowen Tu. Introspective classification with convolutional nets. In Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, 49 December 2017, Long Beach, CA, USA, pages 823–833, 2017.
 [21] Taesup Kim and Yoshua Bengio. Deep directed generative models with energybased probability estimation. CoRR, abs/1606.03439, 2016.
 [22] Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 79, 2015, Conference Track Proceedings, 2015.
 [23] Diederik P. Kingma and Prafulla Dhariwal. Glow: Generative flow with invertible 1x1 convolutions. In Advances in Neural Information Processing Systems 31: Annual Conference on Neural Information Processing Systems 2018, NeurIPS 2018, 38 December 2018, Montréal, Canada., pages 10236–10245, 2018.
 [24] Diederik P. Kingma, Tim Salimans, and Max Welling. Improving variational inference with inverse autoregressive flow. CoRR, abs/1606.04934, 2016.
 [25] Diederik P. Kingma and Max Welling. Autoencoding variational bayes. In 2nd International Conference on Learning Representations, ICLR 2014, Banff, AB, Canada, April 1416, 2014, Conference Track Proceedings, 2014.
 [26] Rithesh Kumar, Anirudh Goyal, Aaron C. Courville, and Yoshua Bengio. Maximum entropy generators for energybased models. CoRR, abs/1901.08508, 2019.
 [27] Paul Langevin. On the theory of Brownian motion. 1908.
 [28] Justin Lazarow, Long Jin, and Zhuowen Tu. Introspective neural networks for generative modeling. In IEEE International Conference on Computer Vision, ICCV 2017, Venice, Italy, October 2229, 2017, pages 2793–2802, 2017.
 [29] Yann LeCun, Sumit Chopra, Raia Hadsell, M Ranzato, and F Huang. A tutorial on energybased learning. Predicting structured data, 1(0), 2006.
 [30] Radford M Neal. MCMC using hamiltonian dynamics. Handbook of Markov Chain Monte Carlo, 2, 2011.
 [31] Jiquan Ngiam, Zhenghao Chen, Pang Wei Koh, and Andrew Y. Ng. Learning deep energy models. In Proceedings of the 28th International Conference on Machine Learning, ICML 2011, Bellevue, Washington, USA, June 28  July 2, 2011, pages 1105–1112, 2011.
 [32] Erik Nijkamp, Mitch Hill, SongChun Zhu, and Ying Nian Wu. Learning nonconvergent nonpersistent shortrun MCMC toward energybased model. Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2019, NeurIPS 2019, 814 December 2019, Vancouver, Canada, 2019.
 [33] Bruno Poucet and Etienne Save. Attractors in memory. Science, 308(5723):799–800, 2005.
 [34] Danilo Jimenez Rezende and Shakir Mohamed. Variational inference with normalizing flows. In Proceedings of the 32nd International Conference on Machine Learning, ICML 2015, Lille, France, 611 July 2015, pages 1530–1538, 2015.

[35]
Danilo Jimenez Rezende, Shakir Mohamed, and Daan Wierstra.
Stochastic backpropagation and approximate inference in deep generative models.
In Proceedings of the 31th International Conference on Machine Learning, ICML 2014, Beijing, China, 2126 June 2014, pages 1278–1286, 2014.  [36] Donald B. Rubin and Dorothy T. Thayer. EM algorithms for ml factor analysis. Psychometrika, 47(1):69–76, Mar 1982.
 [37] Ruslan Salakhutdinov and Hugo Larochelle. Efficient learning of deep boltzmann machines. In Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics, AISTATS 2010, Chia Laguna Resort, Sardinia, Italy, May 1315, 2010, pages 693–700, 2010.
 [38] Casper Kaae Sønderby, Tapani Raiko, Lars Maaløe, Søren Kaae Sønderby, and Ole Winther. Ladder variational autoencoders. In Advances in Neural Information Processing Systems 29: Annual Conference on Neural Information Processing Systems 2016, December 510, 2016, Barcelona, Spain, pages 3738–3746, 2016.
 [39] Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jonathon Shlens, and Zbigniew Wojna. Rethinking the inception architecture for computer vision. In 2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016, Las Vegas, NV, USA, June 2730, 2016, pages 2818–2826, 2016.
 [40] Dustin Tran, Keyon Vafa, Kumar Krishna Agrawal, Laurent Dinh, and Ben Poole. Discrete flows: Invertible generative models of discrete data. In Deep Generative Models for Highly Structured Data, ICLR 2019 Workshop, New Orleans, Louisiana, United States, May 6, 2019, 2019.
 [41] Jianwen Xie, Ruiqi Gao, Zilong Zheng, SongChun Zhu, and Ying Nian Wu. Learning dynamic generator model by alternating backpropagation through time. In The ThirtyThird AAAI Conference on Artificial Intelligence, AAAI 2019, The ThirtyFirst Innovative Applications of Artificial Intelligence Conference, IAAI 2019, The Ninth AAAI Symposium on Educational Advances in Artificial Intelligence, EAAI 2019, Honolulu, Hawaii, USA, January 27  February 1, 2019., pages 5498–5507, 2019.
 [42] Jianwen Xie, Yang Lu, SongChun Zhu, and Ying Nian Wu. A theory of generative convnet. In Proceedings of the 33nd International Conference on Machine Learning, ICML 2016, New York City, NY, USA, June 1924, 2016, pages 2635–2644, 2016.
 [43] Junbo Jake Zhao, Michaël Mathieu, and Yann LeCun. Energybased generative adversarial networks. In 5th International Conference on Learning Representations, ICLR 2017, Toulon, France, April 2426, 2017, Conference Track Proceedings, 2017.
8 Appendix
8.1 Model Specification
For the multilayer generator model, we have for which layer is the top layer, and layer is the bottom layer close to . For simplicity, let . Then, . In our case, we have , . where and are the mean vector and the diagonal variancecovariance matrix of respectively, and they are functions of where are deterministic layers and are projection layer to preserve dimensionality. is defined as two subsequent layers with [14]activation functions and skip connection. is a linear layer with subsequent . and are a pair of and layers to project to dimensionality of . Then, where . The final deterministic block is a layer projecting to the desired dimensionality of . The range of is bounded by .
Table 4 illustrates a specification with latent layers, latent dimensions , , for , , , respectively, and channels.
operation  dimensions  

3  
2  
2  
2  
1  
1  
1  
0  
0  
0 