Learning Deep Generative Models with Short Run Inference Dynamics

12/04/2019 ∙ by Erik Nijkamp, et al. ∙ Stevens Institute of Technology 1

This paper studies the fundamental problem of learning deep generative models that consist of one or more layers of latent variables organized in top-down architectures. Learning such a generative model requires inferring the latent variables for each training example based on the posterior distribution of these latent variables. The inference typically requires Markov chain Monte Caro (MCMC) that can be time consuming. In this paper, we propose to use short run inference dynamics guided by the log-posterior, such as finite-step gradient descent algorithm initialized from the prior distribution of the latent variables, as an approximate sampler of the posterior distribution, where the step size of the gradient descent dynamics is optimized by minimizing the Kullback-Leibler divergence between the distribution produced by the short run inference dynamics and the posterior distribution. Our experiments show that the proposed method outperforms variational auto-encoder (VAE) in terms of reconstruction error and synthesis quality. The advantage of the proposed method is that it is natural and automatic, even for models with multiple layers of latent variables.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 7

page 8

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Deep generative models have seen many applications such as image and video synthesis, and unsupervised or semi-supervised learning. Such models usually consist of one or more layers of latent variables organized in top-down architectures. Learning such latent variable models from training examples is a fundamental problem, and this is the problem that this paper is concerned with.

Learning latent variable models usually requires inferring the latent variables based on their joint posterior distribution, i.e., the conditional distribution of the latent variables given each observed example. The inference typically requires Markov chain Monte Carlo (MCMC) such as Langevin dynamics [27] or Hamiltonian Monte Carlo (HMC) [30]. Such MCMC posterior sampling can be time consuming and can be difficult to scale up to big training data. The convergence of MCMC sampling is also questionable even after a long running time.

An alternative to MCMC posterior sampling is variational inference, which employs a simple factorized distribution to approximate the posterior distribution. Classical variational inference methods assume an approximate posterior distribution for each observed example, where the approximate posterior distribution is parametrized by some variational parameters that are specific to each example and that can be optimized by minimizing the Kullback-Leibler divergence between the approximate distribution and the true posterior.

More recently, in the context of the generator network which consists of a single layer of latent vector at the top, the variational auto-encoder (VAE)

[25, 35]

learns an extra inference network that maps each input example to the approximate posterior distribution, in particular, the mean vector and the diagonal variance-covariance matrix of the approximate multivariate Gaussian posterior. Unlike classical variational methods, the parameters of the inference network are shared by all the training and testing examples, and they are learned together with the parameters of the generator network by jointly maximizing a lower bound of the log-likelihood.

Despite the success of VAE, it has the following shortcomings. (1) It requires a separate inference model with a separate set of parameters. These parameters are to be learned together with the parameters of the generative model. (2) The design of the inference model is not automatic, especially for generative models with multiple layers of latent vectors, which may have complex relationships under their joint posterior distribution. It is a highly non-trivial task to design an inference model to adequately capture the explaining-away competitions and bottom-up and top-down interactions between multiple layers of latent variables.

In this article, we propose a simple method that lies in between MCMC posterior sampling and variational inference. Specifically, we propose to use a short run inference dynamics guided by the posterior distribution of the latent variables as an approximate inference engine. To make it more concrete, we employ a finite-step gradient descent on the negative log-posterior distribution of the latent variables. For each training example, we initialize such a short run gradient flow from the prior distribution such as Gaussian or uniform noise distribution, and run a finite number (e.g., 20) of steps of gradient descent updates. This amounts to a residual network or a recurrent neural network (RNN) that transforms the initial noise distribution to an approximate posterior distribution. We optimize the step size of the gradient flow by minimizing the Kullback-Leibler divergence between the approximate distribution produced by this short run inference dynamics and the true posterior distribution. Thanks to the computing capacities of the modern deep learning platforms, it is possible to compute the approximate distribution and its entropy, and optimize the aforementioned Kullback-Leibler divergence. This is similar to variational inference, except that the variational parameter is the step size of the gradient descent, or in general, the tuning parameters of the short run inference dynamics. Our experiments show that the proposed method outperforms the VAE in terms of reconstruction error and synthesis quality.

The proposed method is similar to MCMC posterior sampling except that we focus on the optimization of a short run version of MCMC sampling. Such short run inference dynamics is easily affordable on current deep learning platforms and there is no much difficulty to scale it up to big datasets. The proposed method is also similar to variational inference except that the inference model is simply a noise initialized finite-step gradient flow, where the only extra parameter is the step sizes of the gradient flow or in general some tuning parameters of the short run inference dynamics.

One major advantage of the proposed method is that it is natural and automatic. For models with multiple layers of latent variables that may be organized in complex top-down architectures, the gradient descent update of the log-posterior of the latent variables can be automatically obtained on modern deep learning platforms. Such gradient descent update naturally integrates explaining-away competitions and bottom-up and top-down interactions between multiple layers of latent variables. By optimizing the step size of the gradient descent update using the variational criterion, it is possible to have a good approximate posterior distribution and an efficient inference engine. It thus enables researchers to explore sophisticated generative models without worrying about constructing the presumably more sophisticated inference models.

2 Contributions and related work

The following are contributions of our paper.

  • We propose short run inference dynamics for inferring latent variables in deep generative models.

  • We provide a method to determine the optimal step size or in general tuning parameters of the short run inference dynamics.

  • We demonstrate learning of multi-layer latent variable models with high quality samples and reconstructions.

The following are themes related to our work.

(1) Variational inference. As mentioned above, VAE [25, 35, 38, 11] is the prominent method for learning generator network. The original inference model has been generalized to flow-based inference models such as normalizing flows. Our short run gradient flow can also be viewed as a flow model. The difference is that the gradient flow is obtained automatically from the log-posterior of latent variables. There is no need to design an extra inference model or flow-based model for inference, and there is no need to learn an extra set of parameters except some tuning parameters of the gradient descent algorithm such as step size. More importantly, the gradient descent update is readily available for models with multiple layers of latent vectors, whereas designing inference models for such generative models can be a highly non-trivial task.

(2) Alternating back-propagation. [12, 41]

propose to learn the generator network by maximum likelihood, and the learning algorithm iterates the following two steps: (a) inferring the latent variables by Langevin dynamics that samples from the posterior distribution of the latent variables. (b) updating the model parameters based on the inferred latent variables. Both steps involve gradient computations based on back-propagation, thus the name “alternating back-propagation”. In the training stage, in step (a), the Langevin dynamics is initialized from the values of the latent variables produced in the previous epoch. This is usually called persistent chain in the literature. In our work, in step (a), we always initialize the finite-step (e.g., 20-step) gradient descent updates from the prior noise distribution. This can be called non-persistent chain. It is also a non-convergent chain because we do not expect the chain to converge to the target distribution. But we do optimize the step size (or in general, the tuning parameters) of the gradient descent update based on a variational criterion. This allows us to use very affordable short run inference dynamics for learning, and moreover, the same short run inference dynamics can be used for testing examples as well.

(3) Short run MCMC for energy-based model. Recently [32]

proposes to learn short run MCMC for energy-based model. An energy-based model is in the form of an unnormalized probability density function, where the log-density or the energy function is parametrized by a bottom-up neural network.

[32] shows that it is possible to learn short run MCMC such as 100-step Langevin dynamics that can generate images of high synthesis quality, and they justify the learn short run MCMC as a valid flow-like model. However, the energy-based model itself is not well learned. Our method follows a similar strategy, but it is intended for approximately sampling from the posterior distribution of latent variables. Our optimization method can help reduce the gap between the short run MCMC and the corresponding energy-based model, so that the energy-based model may also be properly learned.

(4) Attractor dynamics.

In computational neuroscience, the dynamics of the neuron activities is often modeled by attractor dynamics

[19, 2, 33]. However, the objective function of the attractor dynamics is often implicit, thus it is unclear what is the computational problem that the attractor dynamics is solving. For the attractor dynamics to be implemented in real time, the dynamics is necessarily a short run dynamics. Our short run inference dynamics is guided by a top-down model with a well-defined objective function in terms of the log-posterior of the latent variables. It may be connected to the attractor dynamics and help us understand the latter. We shall explore this direction in future work.

3 Deep generative models

3.1 Joint, marginal, and posterior distributions

Let be the observed example, such as an image. Let be the latent variables, which may consist of latent variables at multiple layers organized in a top-down architecture. We may consider as forming an interpretation or explanation of .

The joint distribution of

is , where consists of model parameters. The marginal distribution of is Given , the inference of can be based on the posterior distribution .

3.2 Five families of models

To put our work in context, we review the following classes of models, which correspond to different ways of specifying or .

(1) Generator network. This model assumes a -dimensional noise vector at the top-layer. The prior distribution is known, such as , where is the

-dimensional identity matrix. Given

, , where

is a top-down convolutional neural network (sometimes called deconvolutional network due to the top-down nature), where

consists of all the weight and bias terms of this top-down network.

is usually assumed to be Gaussian white noise with mean 0 and variance

. Thus is such that , where is the dimensionality of . For this model

(1)
(2)

where is a constant independent of .

In the generator network, units in the latent vector may carry semantic meanings in that if we varying the units of , we can observe semantically meaningful variations in .

(2) Multi-layer generator network. While it is computationally convenient to have a latent noise vector at the top layer, it does not account for the fact that patterns can appear at multiple layers of compositions or abstractions (e.g., face (eyes, nose, mouth) (edges, corners) pixels), where variations and randomness occur at multiple layers. To capture such a hierarchical structure, it is desirable to introduce multiple layers of latent variables organized in a top-down architecture. Specifically, we have , where layer is the top layer, and layer is the bottom layer above . For notational simplicity, we let . We can then specify as

(3)

One concrete example is , . where and are the mean vector and the diagonal variance-covariance matrix of respectively, and they are functions of . collects all the parameters in these functions. can be obtained similarly to Equation (2).

(3) Energy-based model (EBM). The EBM [29, 31, 21, 43, 42, 9, 26, 32, 8, 20, 28] specifies directly, and it does not involve any latent variables . Specifically, where is parametrized by a bottom-up neural network. is called the energy function.

(4) Latent energy-based model. The latent EBM [18, 37] specifies directly, instead of specifying and separately as in the generator network or multi-layer generator network. Specifically, where is the joint energy function. In general, both the generator model and its multi-layer version can be written in this form with closed form , thus they are special cases of latent EBM. However, unlike of the EBM in (3), in latent EBM typically only involves singleton (unitary) and pairwise (binary) terms for biological plausibility.

(5) Flow-based model. The flow-based model [6, 7, 23, 10, 3, 40] is similar to the generator network, except that and are of the same dimensionality, and with an invertible . As a result, can be obtained in closed form, similar to the EBM. But unlike EBM, the density for the flow-based model is normalized, or in other words, the normalizing constant is in closed form. Computational feasibility requires that composes a sequence of transformations, thus a flow, whose inverse and Jacobian can be efficiently computed. Here we present the flow-based model as a generative model. It can also be used as an inference model to approximate the posterior [34, 24].

To connect the above models to our work, we focus on (1) and (2). However, our method is applicable to (3) and (4) as well. For (3), our optimization method can be used to optimize the short run MCMC that samples . Our short run inference dynamics is also related to the flow-based inference model in (5).

3.3 Learning and inference

Let be the data distribution that generates the example . The learning of parameters of can be based on where is the Kullback-Leibler divergence between and (or from to since is asymmetric). If we observe training examples , the above minimization can be approximated by maximizing the log-likelihood

(4)

which leads to the maximum likelihood estimate (MLE).

The gradient of the log-likelihood, , can be computed according to the following identity:

(5)
(6)
(7)

The above expectation can be approximated by Monte Carlo samples from .

For generator network or its multi-layer version, is in closed form. For latent EBM, where the expectation can be approximated by Monte Carlo samples from . In this work, we shall focus on generator network and its multi-layer version. We shall leave latent EBM to future work.

4 Short run inference dynamics

4.1 Langevin dynamics

Sampling from usually requires MCMC. One convenient MCMC is Langevin dynamics [27], which iterates

(8)

where , indexes the time step of the Langevin dynamics, and is the step size. The Langevin dynamics consists of a gradient descent term on . In the case of generator network, it amounts to gradient descent on , which is penalized reconstruction error. The Langevin dynamics also consists of a white noise diffusion term to create randomness for sampling from . For small step size , the marginal distribution of will converge to as regardless of the initial distribution of . More specifically, let be the marginal distribution of of the Langevin dynamics, then monotonically, that is, by increasing , we reduce .

Thus MCMC is consistent with variational approximation. Both seek to minimize over within a certain class.

4.2 Noise initialized short run dynamics

It is impractical to run long chains to sample from . We thus propose the following short run inference dynamics.

(9)
(10)

where is the prior distribution of . In the case of generator network, is . Starting from , we run (e.g., ) steps of gradient descent on with step size , and we take to be an approximate sample from .

Compared to the Langevin dynamics, we remove the white noise term . In practice, the noise term has minimal usefulness for the following two reasons. (1) In the beginning stage, when the reconstruction error is big relative to , the gradient of dominates the noise term, which is negligible. (2) When the Langevin dynamics get to a local mode, the noise term can hardly get the chain jump out of the local mode, if the mode is deep. That is, running long chain (e.g., = 10,000 steps) is not only impractical, even if we can afford it, it still cannot guarantee mode traversing and convergence to the target distribution. Thus we opt to run short chain without noise term, starting from the noise prior distribution , so that the randomness or possible multi-modality comes from the initial instead of the almost useless noise term, and we optimize the step size to let the distribution of approximate the true posterior .

We can write the above dynamics as

(11)

where , where we omit and in for the simplicity of notation. For finite , this dynamics is a -layer residual network, or -step RNN. In a recent paper, [3] studies the residual network as a flow-based model.

4.3 Flow-like inference model

To further simplify the notation, we may write the dynamics as

(12)

where composes the steps of gradient descent. The above model can be considered a flow-based model, where consists of -step flow of gradient updates. Let the distribution of be , where we include the notation to make it explicit that the distribution of depends on the step size . Recall that the distribution of also depends on and , so that in full notation, we may write as .

By change of variable,

(13)

We may treat as a flow-like variational inference model.

4.4 Variational optimization of step size

We want to optimize the step size so that best approximates the true posterior . This can be accomplished by

(14)

This is similar to variational approximation, with step size  being the variational parameter.

(15)

where the last term is independent of , and for the first two terms on the right hand side,

(16)

and

(17)
(18)

In the above two equations, can be approximated by random samples from , e.g., in the case of generator network.

In the above computations, we do not need to invert , as is typical in flow-based variational approximation, but we need to compute the log determinant of the Jacobian . Fortunately, on modern deep learning platforms, such computation is easily feasible even if the dimension of is very high. For a sampled , after computing the matrix

, we can compute the eigenvalues of

, so that the log-determinant is the sum of the log of the eigenvalues. Then we can optimize the step size by minimizing via a grid search or gradient descent.

While we can optimize the step size for each example , in our work, we optimize over an overall that is shared by all the examples. Reverting to the full notation for , this means we minimizes

(19)

over .

Instead of using a constant step size , gradient descent with respect to may also be generalized to optimize over varying step sizes . We leave it to future work.

4.5 Learning with short run inference dynamics

Each learning iteration consists of the following two steps. (1) Update by minimizing (19). (2) Update by

(20)

where is the step size or learning rate, (here we use the full notation instead of the abbreviated notation ) can be approximated by sampling from using the noise initialized -step gradient descent.

Given , the above iteration seeks to maximize

(21)

by a single gradient ascent step in . In fact, in an approximate Monte Carlo EM algorithm, we may maximize with multiple gradient ascent steps. is an approximation to the complete-data log-likelihood in the EM algorithm [5, 36].

Compared to the log-likelihood function , we have

(22)
(23)
(24)

Since the last term has nothing to do with , thus maximizing is equivalent to maximizing , which is a lower bound of .

Thus each learning iteration can be interpreted as a joint maximization of over and . Specifically, step (1) maximizes over given , and step (2) seeks to maximize over given . This is similar to variational inference with being the variational parameter. The learning procedure is summarized in Algorithm 1.

input : Learning iterations , step size interval , learning rate , initial weights , observed examples , batch size , number of steps , initial step size .
output : Weights .
for  do
       1. Draw observed examples . 2. Draw latent vectors . 3. Infer by -steps of dynamics (10) with step size . 4. Update according to (20). 5. Every iterations, update by minimizing (19).
Algorithm 1 Learning with short run inference dynamics.

5 General theoretical formulation for short run inference and synthesis

To put our work in the big picture of generative modeling and computing, let us consider a latent EBM model , which includes generator model and its multi-layer version as well as regular EBM as special cases. The maximum likelihood learning is to minimize , where , and is the data distribution that generates the data . Expectation with respect to can be approximated by the average over .

The learning gradient is based on

(25)
(26)
(27)
(28)

requires the inference sampling from

in the so-called positive or wake phase (or clamped sampling) in the terminology of Boltzmann machine 

[1] and Helmholtz machine [4], while requires the synthesis sampling from in the so-called negative or sleep phase (or unclamped sampling) [17]. Both can be replaced by short run dynamics. Let be the sort-run inference dynamics with tuning parameters , and let be the short run synthesis dynamics with tuning parameters , then at learning iteration with , the learning gradient for

seeks to minimize the following generalized version of contrastive divergence:

(29)
(30)

Compared to maximum likelihood,

(31)

where

(32)
(33)

is an approximation to . The learning gradient is

(34)
(35)

Compared (35) with (28), we replace by the short run inference , and we replace by the short run synthesis . In fact, (28) can be derived based on (35).

As to and , we seek to solve which amounts to

(36)
(37)

Both are variational optimizations with respect to the tuning parameters in the short run inference and short run synthesis respectively. Overall, the learning iteration seeks to solve .

Compared to the original contrastive divergence [16], we initialize the short run synthesis from a fixed noise distribution, instead of the observed examples.

For generator model or a purely top-down model, the second KL-divergence in the above generalized contrastive divergence disappears, since synthesis sampling can be accomplished by direct ancestral sampling. This is the case treated in this paper. For regular EBM, disappears, and with it the short run inference, so that we only have a short run synthesis , and the learning is based on , and the learning seeks to solve . See the recent work [32] for short run synthesis for regular EBM. See also the recent work [13] for a divergence triangle formulation that is related to the above generalized contrastive divergence formulation.

6 Experiments

In this section, we will demonstrate (1) realistic synthesis, (2) faithful reconstructions of observed examples, (3) inpainting of occluded images, (4) variational grid search and gradient descent on the step size, and, (5) ablation on the number of latent layers and Langevin steps. We emphasize the simplicity of the short run inference algorithm.

All the training image datasets are resized and scaled to with no further pre-processing. We train the models with parameter updates optimized by Adam [22]. The learning rate decays step-wise (, , ) for each iterations. If not stated otherwise, we use short run inference steps and is gradually annealed from to .

The baselines are trained with ladder variational autoencoder 

[38] to improve the performance of multi-layer latent variable models. We refer to the supplementary material for detailed model specifications.

6.1 Synthesis

We evaluate the learned generator fidelity of generated examples quantitatively on various datasets, each reduced to observed examples. Figure 3 depicts generated samples of size pixels for various datasets with short run inference steps. Figure 1 depicts generated samples of size pixels on the CelebA dataset (see generated samples from a VAE with the same generator architecture in Figure 2). Table 1 (a) compares the Fréchet Inception Distance (FID) [15]

with Inception v3 classifier 

[39] on generated examples. Despite its simplicity, short run inference dynamics is competitive to elaborate means of inference in VAE models.

MNIST SVHN CelebA
Models MSE FID MSE FID MSE FID
VAE, L=1 0.020 - 0.019 46.78 0.031 69.90
VAE, L=3 0.018 - 0.015 41.72 0.029 58.33
VAE, L=5 0.018 - 0.014 39.26 0.028 53.40
Ours, L=1 0.019 - 0.018 44.86 0.019 45.80
Ours, L=3 0.017 - 0.015 39.02 0.018 41.20
Ours, L=5 0.015 - 0.011 35.23 0.011 37.29
Table 1: Comparison of generators with varied number of latent layers learned by VAE and short run inference with respect to MSE of reconstructions and FID of generated samples of size for MNIST , SVHN , and CelebA datasets. (FID on MNIST is not considered meaningful or reflecting human judgment of synthesis quality and hence omitted.)
Figure 1: Generated samples for a generator learned by short run inference with and on CelebA ().
Figure 2: Generated samples for ladder VAE with on CelebA ().
Figure 3: Generated samples for short run inference steps with layers. From left to right: (1) MNIST (), (2) SVHN (), (3) CelebA ().

6.2 Reconstruction

We evaluate the accuracy of the learned inference dynamics by reconstructing test images. In contrast to traditional MCMC posterior sampling with persistent chains, short run inference with small allows not only for efficient learning on training examples, but also the same dynamics can be recruited for testing examples.

Figure 4: Comparison of reconstructions between VAE samples and our method on CelebA () with . Top: original test images. Middle: reconstructions from VAE. Bottom: reconstructions from short run inference.

Figure 4 compares the reconstructions of learned generators with layers by VAE and short run inference on CelebA . The fidelity of reconstructions by short run inference appears qualitatively improved over VAE, which is quantitatively confirmed by a consistently lower MSE in Table 1.

6.3 Inpainting

Figure 5: Inpainting on CelebA () with for varying occlusion masks. Top: original test images. Middle: occluded images. Bottom: inpainted test images by from short run inference.

Our method can “inpaint” occluded image regions. To recover the occluded pixels, the only required modification of (10) involves the computation of . For a fully observed image, the term is computed by the summation over all pixels. For partially observed images, we only compute the summation over the observed pixels. Figure 5 depicts test images taken from the CelebA dataset for which a mask randomly occludes pixels in various patterns.

6.4 Variational optimization of step size

The step size in (10) may be optimized such that best approximates the true posterior . That is, we can optimize the step size by minimizing via a grid search or gradient descent. As outlined in Section 4.4, we require . In reverse-mode auto-differentiation, we construct the Jacobian one row at a time by evaluating vector-Jacobian products. Then, we evaluate the eigenvalues of . As both steps are computed in a differentiable manner, we may compute the gradient with respect to step size .

Figure 6 depicts the optimal step size over learning iterations determined by grid-search with and gradient descent on (19). For both grid-search and gradient descent the step size settles near after a few learning iterations. Figure 7 details the optimization objective of , , with respect to individual step sizes .

Figure 6: Variational optimal step size over epochs for three individual runs with varying random seed. Top: Grid-search. Bottom: Gradient-descent.
Figure 7: for step sizes over epochs .

6.5 Influence of number of layers and steps

Finally, Tables 2 and 3 show the influence of the number of latent layers for the generator network and the number of steps in the short run inference dynamics (10), respectively. Increasing the number of layers improves the quality of synthesis and reconstruction. Increasing the number of inference steps up to 20 steps results in significant improvements, while appears to affect the scores only marginally.

FID
MSE
Table 2: Influence of number of layers with for CelebA ().
FID
MSE
Table 3: Influence of number of short run inference steps on models with for CelebA ().

7 Conclusion

This paper proposes to use short run inference dynamics to infer latent variables in deep generative models. The inference dynamics is always initialized from the prior noise distribution, followed by a small number (e.g., 20) of updates. It can be viewed as a residual network or an RNN. Similar to the variational inference, the tuning parameters such as step size of the short run inference dynamics are optimized by a variational criterion. Unlike the variational inference, there is no need to design an extra inference model, which is usually a challenging task with multiple layers of latent variables. In fact, the short run dynamics is guided by the log-posterior distribution to regulate the explaining-away competitions and top-down feedbacks between the latent variables. The short run inference dynamics is easily affordable on the current computing platforms and can be easily scaled up to big training data. It will enable the researchers to design more sophisticated latent variable models without worrying about designing inference models.

The following are the directions that we shall explore in our future work.

(1) In this paper, we focus on gradient descent update, and we optimize the step size. We shall explore more general updates and other tuning parameters. A simple extension is to consider varying step sizes, i.e., difference step sizes at different iterations of the short run dynamics. Another extension is gradient descent with momentum, where the tuning parameters can be optimized. We may also consider second order optimization algorithm.

(2) In this paper, we focus on continuous latent variables or latent vectors. Continuous vector representations are commonly used in deep learning models. Discrete variables can be obtained from continuous variables. In our future work, we shall also explore short run inference for discrete latent variables explicitly.

(3) In this paper, we focus on top-down generative models. Our method can be applied to graphical models in general, including the latent energy-based models. We shall also study these models using short run inference as well as short run synthesis in our future work.

Acknowledgments

The work is supported by DARPA XAI project N66001-17-2-4029; ARO project W911NF1810296; and ONR MURI project N00014-16-1-2007; and XSEDE grant ASC170063. We thank NVIDIA for hardware donations.

References

  • [1] David H. Ackley, Geoffrey E. Hinton, and Terrence J. Sejnowski. A learning algorithm for boltzmann machines. Cognitive Science, 9(1):147–169, 1985.
  • [2] Daniel J. Amit. Modeling brain function: the world of attractor neural networks, 1st Edition. Cambridge Univ. Press, 1989.
  • [3] Jens Behrmann, Will Grathwohl, Ricky T. Q. Chen, David Duvenaud, and Jörn-Henrik Jacobsen. Invertible residual networks. In

    Proceedings of the 36th International Conference on Machine Learning, ICML 2019, 9-15 June 2019, Long Beach, California, USA

    , pages 573–582, 2019.
  • [4] Peter Dayan, Geoffrey E. Hinton, Radford M. Neal, and Richard S. Zemel. The Helmholtz machine. Neural Computation, 7(5):889–904, 1995.
  • [5] Arthur P Dempster, Nan M Laird, and Donald B Rubin. Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society: Series B (Methodological), 39(1):1–22, 1977.
  • [6] Laurent Dinh, David Krueger, and Yoshua Bengio. NICE: non-linear independent components estimation. In 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Workshop Track Proceedings, 2015.
  • [7] Laurent Dinh, Jascha Sohl-Dickstein, and Samy Bengio. Density estimation using real NVP. In 5th International Conference on Learning Representations, ICLR 2017, Toulon, France, April 24-26, 2017, Conference Track Proceedings, 2017.
  • [8] Chelsea Finn, Paul F. Christiano, Pieter Abbeel, and Sergey Levine. A connection between generative adversarial networks, inverse reinforcement learning, and energy-based models. CoRR, abs/1611.03852, 2016.
  • [9] Ruiqi Gao, Yang Lu, Junpei Zhou, Song-Chun Zhu, and Ying Nian Wu. Learning generative convnets via multi-grid modeling and sampling. In

    2018 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2018, Salt Lake City, UT, USA, June 18-22, 2018

    , pages 9155–9164, 2018.
  • [10] Will Grathwohl, Ricky T. Q. Chen, Jesse Bettencourt, Ilya Sutskever, and David Duvenaud. FFJORD: free-form continuous dynamics for scalable reversible generative models. In 7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 6-9, 2019, 2019.
  • [11] Karol Gregor, Ivo Danihelka, Alex Graves, Danilo Jimenez Rezende, and Daan Wierstra. DRAW: A recurrent neural network for image generation. In Proceedings of the 32nd International Conference on Machine Learning, ICML 2015, Lille, France, 6-11 July 2015, pages 1462–1471, 2015.
  • [12] Tian Han, Yang Lu, Song-Chun Zhu, and Ying Nian Wu. Alternating back-propagation for generator network. In

    Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence, February 4-9, 2017, San Francisco, California, USA.

    , pages 1976–1984, 2017.
  • [13] Tian Han, Erik Nijkamp, Xiaolin Fang, Mitch Hill, Song-Chun Zhu, and Ying Nian Wu. Divergence triangle for joint training of generator model, energy-based model, and inferential model. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2019, Long Beach, CA, USA, June 16-20, 2019, pages 8670–8679, 2019.
  • [14] Dan Hendrycks and Kevin Gimpel. Bridging nonlinearities and stochastic regularizers with gaussian error linear units. CoRR, abs/1606.08415, 2016.
  • [15] Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. GANs trained by a two time-scale update rule converge to a local nash equilibrium. In Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, 4-9 December 2017, Long Beach, CA, USA, pages 6626–6637, 2017.
  • [16] Geoffrey E. Hinton. Training products of experts by minimizing contrastive divergence. Neural Computation, 14(8):1771–1800, 2002.
  • [17] Geoffrey E Hinton, Peter Dayan, Brendan J Frey, and Radford M Neal. The wake-sleep algorithm for unsupervised neural networks. Science, 268(5214):1158–1161, 1995.
  • [18] Geoffrey E. Hinton, Simon Osindero, and Yee Whye Teh. A fast learning algorithm for deep belief nets. Neural Computation, 18(7):1527–1554, 2006.
  • [19] John J Hopfield. Neural networks and physical systems with emergent collective computational abilities. In Proceedings of the national academy of sciences, volume 79, pages 2554–2558. National Acad Sciences, 1982.
  • [20] Long Jin, Justin Lazarow, and Zhuowen Tu. Introspective classification with convolutional nets. In Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, 4-9 December 2017, Long Beach, CA, USA, pages 823–833, 2017.
  • [21] Taesup Kim and Yoshua Bengio. Deep directed generative models with energy-based probability estimation. CoRR, abs/1606.03439, 2016.
  • [22] Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings, 2015.
  • [23] Diederik P. Kingma and Prafulla Dhariwal. Glow: Generative flow with invertible 1x1 convolutions. In Advances in Neural Information Processing Systems 31: Annual Conference on Neural Information Processing Systems 2018, NeurIPS 2018, 3-8 December 2018, Montréal, Canada., pages 10236–10245, 2018.
  • [24] Diederik P. Kingma, Tim Salimans, and Max Welling. Improving variational inference with inverse autoregressive flow. CoRR, abs/1606.04934, 2016.
  • [25] Diederik P. Kingma and Max Welling. Auto-encoding variational bayes. In 2nd International Conference on Learning Representations, ICLR 2014, Banff, AB, Canada, April 14-16, 2014, Conference Track Proceedings, 2014.
  • [26] Rithesh Kumar, Anirudh Goyal, Aaron C. Courville, and Yoshua Bengio. Maximum entropy generators for energy-based models. CoRR, abs/1901.08508, 2019.
  • [27] Paul Langevin. On the theory of Brownian motion. 1908.
  • [28] Justin Lazarow, Long Jin, and Zhuowen Tu. Introspective neural networks for generative modeling. In IEEE International Conference on Computer Vision, ICCV 2017, Venice, Italy, October 22-29, 2017, pages 2793–2802, 2017.
  • [29] Yann LeCun, Sumit Chopra, Raia Hadsell, M Ranzato, and F Huang. A tutorial on energy-based learning. Predicting structured data, 1(0), 2006.
  • [30] Radford M Neal. MCMC using hamiltonian dynamics. Handbook of Markov Chain Monte Carlo, 2, 2011.
  • [31] Jiquan Ngiam, Zhenghao Chen, Pang Wei Koh, and Andrew Y. Ng. Learning deep energy models. In Proceedings of the 28th International Conference on Machine Learning, ICML 2011, Bellevue, Washington, USA, June 28 - July 2, 2011, pages 1105–1112, 2011.
  • [32] Erik Nijkamp, Mitch Hill, Song-Chun Zhu, and Ying Nian Wu. Learning non-convergent non-persistent short-run MCMC toward energy-based model. Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2019, NeurIPS 2019, 8-14 December 2019, Vancouver, Canada, 2019.
  • [33] Bruno Poucet and Etienne Save. Attractors in memory. Science, 308(5723):799–800, 2005.
  • [34] Danilo Jimenez Rezende and Shakir Mohamed. Variational inference with normalizing flows. In Proceedings of the 32nd International Conference on Machine Learning, ICML 2015, Lille, France, 6-11 July 2015, pages 1530–1538, 2015.
  • [35] Danilo Jimenez Rezende, Shakir Mohamed, and Daan Wierstra.

    Stochastic backpropagation and approximate inference in deep generative models.

    In Proceedings of the 31th International Conference on Machine Learning, ICML 2014, Beijing, China, 21-26 June 2014, pages 1278–1286, 2014.
  • [36] Donald B. Rubin and Dorothy T. Thayer. EM algorithms for ml factor analysis. Psychometrika, 47(1):69–76, Mar 1982.
  • [37] Ruslan Salakhutdinov and Hugo Larochelle. Efficient learning of deep boltzmann machines. In Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics, AISTATS 2010, Chia Laguna Resort, Sardinia, Italy, May 13-15, 2010, pages 693–700, 2010.
  • [38] Casper Kaae Sønderby, Tapani Raiko, Lars Maaløe, Søren Kaae Sønderby, and Ole Winther. Ladder variational autoencoders. In Advances in Neural Information Processing Systems 29: Annual Conference on Neural Information Processing Systems 2016, December 5-10, 2016, Barcelona, Spain, pages 3738–3746, 2016.
  • [39] Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jonathon Shlens, and Zbigniew Wojna. Rethinking the inception architecture for computer vision. In 2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016, Las Vegas, NV, USA, June 27-30, 2016, pages 2818–2826, 2016.
  • [40] Dustin Tran, Keyon Vafa, Kumar Krishna Agrawal, Laurent Dinh, and Ben Poole. Discrete flows: Invertible generative models of discrete data. In Deep Generative Models for Highly Structured Data, ICLR 2019 Workshop, New Orleans, Louisiana, United States, May 6, 2019, 2019.
  • [41] Jianwen Xie, Ruiqi Gao, Zilong Zheng, Song-Chun Zhu, and Ying Nian Wu. Learning dynamic generator model by alternating back-propagation through time. In The Thirty-Third AAAI Conference on Artificial Intelligence, AAAI 2019, The Thirty-First Innovative Applications of Artificial Intelligence Conference, IAAI 2019, The Ninth AAAI Symposium on Educational Advances in Artificial Intelligence, EAAI 2019, Honolulu, Hawaii, USA, January 27 - February 1, 2019., pages 5498–5507, 2019.
  • [42] Jianwen Xie, Yang Lu, Song-Chun Zhu, and Ying Nian Wu. A theory of generative convnet. In Proceedings of the 33nd International Conference on Machine Learning, ICML 2016, New York City, NY, USA, June 19-24, 2016, pages 2635–2644, 2016.
  • [43] Junbo Jake Zhao, Michaël Mathieu, and Yann LeCun. Energy-based generative adversarial networks. In 5th International Conference on Learning Representations, ICLR 2017, Toulon, France, April 24-26, 2017, Conference Track Proceedings, 2017.

8 Appendix

8.1 Model Specification

For the multi-layer generator model, we have for which layer is the top layer, and layer is the bottom layer close to . For simplicity, let . Then, . In our case, we have , . where and are the mean vector and the diagonal variance-covariance matrix of respectively, and they are functions of where are deterministic layers and are projection layer to preserve dimensionality. is defined as two subsequent layers with [14]activation functions and skip connection. is a linear layer with subsequent . and are a pair of and layers to project to dimensionality of . Then, where . The final deterministic block is a layer projecting to the desired dimensionality of . The range of is bounded by .

Table 4 illustrates a specification with latent layers, latent dimensions , , for , , , respectively, and channels.

operation dimensions
3
2
2
2
1
1
1
0
0
0
Table 4: Specification of multi-layer generator model with layers, latent dimensions , , for , , , respectively, and channels.