# Learning Latent Space Energy-Based Prior Model

The generator model assumes that the observed example is generated by a low-dimensional latent vector via a top-down network, and the latent vector follows a simple and known prior distribution, such as uniform or Gaussian white noise distribution. While we can learn an expressive top-down network to map the prior distribution to the data distribution, we can also learn an expressive prior model instead of assuming a given prior distribution. This follows the philosophy of empirical Bayes where the prior model is learned from the observed data. We propose to learn an energy-based prior model for the latent vector, where the energy function is parametrized by a very simple multi-layer perceptron. Due to the low-dimensionality of the latent space, learning a latent space energy-based prior model proves to be both feasible and desirable. In this paper, we develop the maximum likelihood learning algorithm and its variation based on short-run Markov chain Monte Carlo sampling from the prior and the posterior distributions of the latent vector, and we show that the learned model exhibits strong performance in terms of image and text generation and anomaly detection.

## Authors

• 27 publications
• 12 publications
• 11 publications
• 116 publications
• 53 publications
• ### Improving Inversion and Generation Diversity in StyleGAN using a Gaussianized Latent Space

Modern Generative Adversarial Networks are capable of creating artificia...
09/14/2020 ∙ by Jonas Wulff, et al. ∙ 0

• ### Variational Inference for Latent Space Models for Dynamic Networks

Latent space models are popular for analyzing dynamic network data. We p...
05/28/2021 ∙ by Yan Liu, et al. ∙ 0

• ### Inferring Multi-Dimensional Rates of Aging from Cross-Sectional Data

Modeling how individuals evolve over time is a fundamental problem in th...
07/12/2018 ∙ by Emma Pierson, et al. ∙ 4

• ### Deep Markov Chain Monte Carlo

We propose a new computationally efficient sampling scheme for Bayesian ...
10/13/2019 ∙ by Babak Shahbaba, et al. ∙ 0

• ### Consistency of Maximum Likelihood for Continuous-Space Network Models

Network analysis needs tools to infer distributions over graphs of arbit...
11/06/2017 ∙ by Cosma Rohilla Shalizi, et al. ∙ 0

• ### Learning Energy-based Model with Flow-based Backbone by Neural Transport MCMC

Learning energy-based model (EBM) requires MCMC sampling of the learned ...
06/12/2020 ∙ by Erik Nijkamp, et al. ∙ 61

• ### Patchwise Generative ConvNet: Training Energy-Based Models from a Single Natural Image for Internal Learning

Exploiting internal statistics of a single natural image has long been r...
05/19/2021 ∙ by Zilong Zheng, et al. ∙ 0

## Code Repositories

### latent-space-EBM-prior

None

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

In recent years, deep generative models have achieved impressive successes in image and text generation. A particularly simple and powerful model is the generator model Kingma and Welling (2014); Goodfellow et al. (2014), which maps a low-dimensional latent vector to image or text via a top-down network. The generator model was proposed in the contexts of variational auto-encoder (VAE) Kingma and Welling (2014); Rezende et al. (2014) and generative adversarial networks (GAN) Goodfellow et al. (2014); Radford et al. (2016). In both frameworks, the generator model is jointly learned with a complementary model, such as the inference model in VAE and the discriminator model in GAN. More recently in Han et al. (2017); Nijkamp et al. (2019b)

, the generator model has also been learned by maximum likelihood without resorting to a complementary model, where the inference is carried out by Markov chain Monte Carlo (MCMC) such as the Langevin dynamics. In this paper, we shall adopt the the framework of maximum likelihood estimate (MLE), instead of GAN or VAE, so that the learning is simpler in the sense that we do not need to train a complementary network.

The expressive power of the generator network for image and text generation comes from the top-down network that maps a simple prior distribution to be close to the data distribution. Most of the existing papers Makhzani et al. (2015); Tolstikhin et al. (2017); Arjovsky et al. (2017); Dai et al. (2017); Turner et al. (2019); Kumar et al. (2019)

assume that the latent vector follows a given simple prior distribution, such as isotropic Gaussian white noise distribution or uniform distribution. However, such assumption may cause ineffective generator learning as observed in

Dai and Wipf (2019b); Tomczak and Welling (2018a). While we can increase the complexity of the top-down network to enhance the expressive power of the model, in this paper, we shall pursue a different strategy by following the philosophy of empirical Bayes, that is, instead of assuming a given prior distribution for the latent vector, we learn a prior model from empirical observations.

Specifically, we assume the latent vector follows an energy-based model (EBM), or more specifically, an energy-based correction of the isotropic Gaussian white noise prior distribution. We call this model the latent space energy-based prior model. Such a prior model adds more expressive power to the generator model.

The MLE learning of the generator model with a latent space EBM prior involves MCMC sampling of latent vector from both the prior and posterior distributions. Parameters of the prior model can then be updated based on the statistical difference between samples from the two distributions. Parameters of the top-down network can be updated based on the samples from the posterior distribution as well as the observed data.

Compared to GAN that involves delicate dueling between two networks, MLE learning is simpler, and does not suffer from issues such as instability or mode collapsing. As to VAE, for generator model with a latent space EBM prior, VAE is not easily applicable because of the intractability of the normalizing constant of the latent space EBM.

Although MLE learning does not require training a complementary model, it requires MCMC sampling from prior and posterior distributions of the learned model. However, because MCMC sampling is carried out in the low-dimensional latent space instead of the high-dimensional data space, it is easily affordable on modern computing platforms. Compared to EBM built directly on image or text, the latent space EBM prior can be much less multi-modal, because it can rely on the top-down network to map the prior distribution to the highly multi-modal data distribution. A less multi-modal EBM is more amendable to MCMC sampling.

Furthermore, in this paper, we propose to use short-run MCMC sampling Nijkamp et al. (2019a, 2020, c), i.e., we always initialize MCMC from the fixed Gaussian white noise distribution, and we always run a fixed and small number of steps, in both training and testing stages. Such a learning algorithm is simple and efficient. We formulate this learning algorithm as a perturbation of MLE learning in terms of both objective function and estimating equation, so that the learning algorithm has a soild theoretical foundation.

We test the proposed modeling, learning and computing method on tasks such as image synthesis, text generation, as well as anomaly detection. We show that our method is competitive with prior art.

Contributions. (1) We propose a generator model with a latent space energy-based prior model by following the empirical Bayes philosophy. (2) We develop the maximum likelihood learning algorithm based on MCMC sampling of the latent vector from the prior and posterior distributions. (3) We further develop an efficient modification of MLE learning based on short-run MCMC sampling Nijkamp et al. (2019a, 2020, c). (4) We provide theoretical foundation for learning driven by short-run MCMC. (5) We provide strong empirical results to corroborate the proposed method.

## 2 Modeling strategies and related work

We now put our work within the bigger picture of modeling and learning, and discuss related work.

Energy-based model and top-down generation model. A top-down model or a directed acyclic graphical (DAG) model is of a simple factorized form that is capable of ancestral sampling. The prototype of such a model is factor analysis Rubin and Thayer (1982)

, which has been generalized to independent component analysis

Hyvärinen et al. (2004), sparse coding Olshausen and Field (1997), non-negative matrix factorization Lee and Seung (2001), etc. An early example of a multi-layer top-down model is the generation model of Helmholtz machine Hinton et al. (1995)

. An energy-based model defines an unnormalized density or a Gibbs distribution. The prototype of such a model is exponential family distribution, the Boltzmann machine

Ackley et al. (1985), and the FRAME (Filters, Random field, And Maximum Entropy) model Zhu et al. (1998). Zhu (2003) contrasted these two classes of models, calling the top-down latent variable model the generative model, and the energy-based model the descriptive model. Guo et al. (2003) proposed to integrate the two models, where the top-down generation model generates textons, while the EBM prior accounts for the spatial placement and arrangement of textons. Our model follows such a scheme.

The energy function in the EBM can be viewed as the objective function, the cost function, the constraints, or a critic Sutton and Barto (2018). It is easy to specify, although optimizing or sampling the energy function can be hard, and may require iterative algorithm such as MCMC. The maximum likelihood learning of EBM can be interpreted as an adversarial scheme Wu et al. (2019); Han et al. (2020); Lazarow et al. (2017); Xie et al. (2017), where the MCMC serves as the generator and the energy function serves as an evaluator. However, unlike GAN, the maximum likelihood learning of EBM does not suffer from issues such as mode collapsing.

The top-down generation model can be viewed as an actor Sutton and Barto (2018) that directly generates the samples. It is easy to sample from, although one may need a complex top-down model to generate high quality samples. Comparing the two models, the EBM can be more expressive than a top-down model of the same complexity, while a top-down model is much easier to sample from. Therefore, it is desirable to let EBM take over the top layers of the top-down model to make the model more expressive, while EBM learning is still feasible.

Energy-based correction of top-down model. The top-down model usually assumes independent nodes at the top layer and conditional independent nodes at subsequent layers. We can introduce energy terms at multiple layers to correct for the independence or conditional independence assumptions. This leads to a latent energy-based model. However, unlike undirected latent EBM, the energy-based correction is learned on top of a directed top-down model, and this can be easier than learning an undirected latent EBM from scratch. Our work is a simple example of this scheme where we correct the prior distribution. We can also correct the generation model.

From data space EBM to latent space EBM. EBM learned in data space such as image space Xie et al. (2016); Lu et al. (2016); Han et al. (2019); Nijkamp et al. (2019a); Du and Mordatch (2019) can be highly multi-modal, and MCMC sampling can be difficult. In that case, we can introduce latent variables and learn an EBM in latent space, while also learning a mapping from the latent space to the data space. Our work follows such a strategy. Earlier papers on this strategy are Zhu (2003); Guo et al. (2003); Bengio et al. (2013); Brock et al. (2018); Kumar et al. (2019). Learning EBM in latent space can be much feasible than learning EBM in data space in terms of MCMC sampling, and much of past work on EBM can be re-casted in the latent space.

Short-run MCMC. Recently, Nijkamp et al. (2019a) proposed to use short-run MCMC to sample from the EBM in data space. Nijkamp et al. (2019c) proposed to use short-run MCMC to sample the latent variables of a top-down generation model from their posterior distribution. Our work adopts short-run MCMC to sample from both the prior and the posterior of the latent variables. We also provide theoretical foundation for the learning algorithm with short-run MCMC sampling.

Generator model with flexible prior. A few variants of VAE attempt to address the mismatch between the prior and the aggregate posterior. VampPrior Tomczak and Welling (2018b) parameterizes the prior based on the posterior inference model, while Bauer and Mnih (2019) proposes to construct rich priors using rejection sampling with a learned acceptance function, both yielding improved performance on grey scale images. ARAE Zhao et al. (2018) learns an implicit prior distribution in the latent space with adversarial training and demonstrates superior performance on text generation. Recently, some papers resort to a two-stage approach Dai and Wipf (2019a); Ghosh et al. (2020)

. They first train a VAE or deterministic autoencoder (with some form of regularization) in the data space. To enable generation from the model, they then fit a VAE or Gaussian mixture to the posterior samples inferred by the first stage model. An earlier model related to the two-stage approach is GLO

Bojanowski et al. (2017) where a generator is trained paired with inference conducted by gradient descent on the latent variables instead of a separate inference network. All of these prior models by and large follow the empirical Bayes philosophy, which is also one motivation of our work.

## 3 Model and algorithm

### 3.1 Model

Let be an observed example such as an image or a piece of text, and let be the latent variables, where

. The joint distribution of

is

 pθ(x,z) =pα(z)pβ(x|z), (1)

where is the prior model with parameters , is the top-down generation model with parameters , and .

The prior model is formulated as an energy-based model,

 pα(z)=1Z(α)exp(fα(z))p0(z). (2)

where is a reference distribution, assumed to be isotropic Gaussian in this paper. is the negative energy and is parameterized by a small multi-layer perceptron with parameters . is the normalizing constant or partition function.

The prior model (2) can be interpreted as an energy-based correction or exponential tilting of the original prior distribution , which is the prior distribution in the generator model in VAE.

The generation model is the same as the top-down network in VAE. For image modeling,

 x=gβ(z)+ϵ, (3)

where , so that . As in VAE, takes an assumed value. For text modeling, let where each is a token. Following previous text VAE model Bowman et al. (2016), we define

as a conditional autoregressive model,

 pβ(x|z)=T∏t=1pβ(x(t)|x(1),...,x(t−1),z) (4)

which is often parameterized by a recurrent network with parameters .

In the original generator model, the top-down network maps the unimodal prior distribution to be close to the usually highly multi-modal data distribution. The prior model in (2) refines so that maps the prior model to be closer to the data distribution. The prior model does not need to be highly multi-modal because of the expressiveness of .

The marginal distribution is The posterior distribution is

In the above model, we exponentially tilt . We can also exponentially tilt to . Equivalently, we may also exponentially tilt , as the mapping from to is a change of variable. This leads to an EBM in both the latent space and data space, which makes learning and sampling more complex. Therefore, we choose to only tilt and leave as a directed top-down generation model.

### 3.2 Maximum likelihood

Suppose we observe training examples . The log-likelihood function is

 L(θ)=n∑i=1logpθ(xi). (5)

The learning gradient can be calculated according to

 ∇θlogpθ(x) =Epθ(z|x)[∇θlogpθ(x,z)]=Epθ(z|x)[∇θ(logpα(z)+logpβ(x|z))]. (6)

See Appendix 6.3 and 6.4 for a detailed derivation.

For the prior model, Thus the learning gradient for an example is

 δα(x)=∇αlogpθ(x)=Epθ(z|x)[∇αfα(z)]−Epα(z)[∇αfα(z)]. (7)

The above equation has an empirical Bayes nature. is based on the empirical observation , while is the prior model. is updated based on the difference between inferred from empirical observation , and sampled from the current prior.

For the generation model,

 δβ(x)=∇βlogpθ(x)=Epθ(z|x)[∇βlogpβ(x|z)], (8)

where or for image and text modeling respectively, which is about the reconstruction error.

Expectations in (7) and (8) require MCMC sampling of the prior model and the posterior distribution . We can use Langevin dynamics Langevin (1908); Zhu and Mumford (1998). For a target distribution , the dynamics iterates

 zk+1=zk+s∇zlogπ(zk)+√2sϵk, (9)

where indexes the time step of the Langevin dynamics, is a small step size, and is the Gaussian white noise. can be either or . In either case, can be efficiently computed by back-propagation.

It is worth noting that VAE is not conveniently applicable here. Even if we have a tractable approximation to in the form of an inference network, we still need to compute , which requires MCMC.

### 3.3 Short-run MCMC

Convergence of Langevin dynamics to the target distribution requires infinite steps with infinitesimal step size, which is impractical. We thus propose to use short-run MCMC Nijkamp et al. (2019a, 2020, c) for approximate sampling. This is in agreement with the philosophy of variational inference, which accepts the intractability of the target distribution and seeks to approximate it by a simpler distribution. The difference is that we adopt short-run Langevin dynamics instead of learning a separate network for approximation.

The short-run Langevin dynamics is always initialized from the fixed initial distribution , and only runs a fixed number of steps, e.g., ,

 z0∼p0(z),zk+1=zk+s∇zlogπ(zk)+√2sϵk,k=1,...,K. (10)

Denote the distribution of to be . Because of fixed and fixed and , the distribution is well defined. In this paper, we put sign on top of the symbols to denote distributions or quantities produced by short-run MCMC, and for simplicity, we omit the dependence on and in notation. As shown in Cover and Thomas (2006)

decreases to zero monotonically as .

Specifically, denote the distribution of to be if the target , and denote the distribution of to be if . We can then replace by and replace by in equations (7) and (8), so that the learning gradients in equations (7) and (8) are modified to

 ~δα(x)=E~pθ(z|x)[∇αfα(z)]−E~pα(z)[∇αfα(z)], (11) ~δβ(x)=E~pθ(z|x)[∇βlogpβ(x|z)]. (12)

We then update and based on (62) and (63), where the expectations can be approximated by Monte Carlo samples.

The short-run MCMC sampling is always initialized from the same initial distribution , and always runs a fixed number of steps. This is the case for both training and testing stages, which share the same short-run MCMC sampling.

### 3.4 Algorithm

The learning and sampling algorithm is described in Algorithm 1.

The prior sampling and posterior sampling correspond to the positive phase and negative phase of latent EBM Ackley et al. (1985). Learning prior model and learning generation model are based on mini-batch Monte Carlo approximations to (62) and (63) respectively.

### 3.5 Theoretical understanding

The learning algorithm based on short-run MCMC sampling in Algorithm 1 is a modification or perturbation of maximum likelihood learning, where we replace and by and respectively. For theoretical underpinning, we should also understand this perturbation in terms of objective function and estimating equation.

In terms of objective function, define the Kullback-Leibler divergence . At iteration , with fixed , consider the following perturbation of the log-likelihood function of for an observation ,

 log~pθ(x)=logpθ(x)−DKL(~pθt(z|x)∥pθ(z|x))+DKL(~pαt(z)∥pα(z)). (13)

The above is a function of , while is fixed. Then

 ~δα(x)=∇αlog~pθ(x),~δβ(x)=∇βlog~pθ(x), (14)

where the derivative is taken at . See Appendix 6.8 for details. Thus the updating rule of Algorithm 1 follows the stochastic gradient (i.e., Monte Carlo approximation of the gradient) of a perturbation of the log-likelihood ( above is not necessarily a normalized density function any more). Equivalently, because is fixed, we can drop the entropies of and in the above Kullback-Leibler divergences, hence the updating rule follows the stochastic gradient of

 Q(θ)=L(θ)+n∑i=1[E~pθt(zi|xi)[logpθ(zi|xi)]−E~pαt(z)[logpα(z)]], (15)

where is the total log-likelihood defined in equation (5), and the gradient is taken at .

In equation (13), the first term is related to variational inference, although we do not learn a separate inference model. The second

term is related to contrastive divergence

Tieleman (2008), except that is initialized from . It serves to cancel the intractable term.

In terms of estimating equation, the stochastic gradient descent in Algorithm

1 is a Robbins-Monro stochastic approximation algorithm Robbins and Monro (1951) that solves the following estimating equation:

 1nn∑i=1~δα(xi)=1nn∑i=1E~pθ(zi|xi)[∇αfα(zi)]−E~pα(z)[∇αfα(z)]=0, (16) 1nn∑i=1~δβ(xi)=1nn∑i=1E~pθ(zi|xi)[∇βlogpβ(xi|zi)]=0. (17)

The solution to the above estimating equation defines an estimator of the parameters. Algorithm 1 converges to this estimator under the usual regularity conditions of Robbins-Monro Robbins and Monro (1951). If we replace by , and by , then the above estimating equation is the maximum likelihood estimating equation.

The above theoretical understanding in terms of objective function and estimating equation is more general than that of maximum likelihood, which is a special case where the number of steps and the step size in the Langevin dynamics in equation (10). Our theoretical understanding is clearly more relevant in practice where we can only afford finite with non-zero .

As to the step size , we currently treat it as a tuning parameter. can be more formally optimized by maximizing in equation (15) or maximizing the average of defined by equation (13). We may also allow different step sizes for different steps in the short-run Langevin dynamics. We leave this issue to future investigations.

## 4 Experiments

We present a set of experiments which highlight the effectiveness of our proposed model with (1) excellent synthesis for both visual and textual data outperforming state-of-the-art baselines, (2) high expressiveness of the learned prior model for both data modalities, and (3) strong performance in anomaly detection.

For image data, we include SVHN Netzer et al. (2011), CelebA Liu et al. (2015), and CIFAR-10 Krizhevsky et al. . For text data, we include PTB Marcus et al. (1993), Yahoo Yang et al. (2017), and SNLI Bowman et al. (2015). We refer to Appendix 7.1 for details.

### 4.1 Image modeling

We evaluate the quality of the generated and reconstructed images. If the model is well-learned, the latent EBM will fit the generator posterior which in turn renders realistic generated samples as well as faithful reconstructions. We compare our model with VAE Kingma and Welling (2014) and SRI Nijkamp et al. (2019b) which assume a fixed Gaussian prior distribution for the latent vector and two recent strong VAE variants, 2sVAE Dai and Wipf (2019a) and RAE Ghosh et al. (2020), whose prior distributions are learned with posterior samples in a second stage. We also compare with multi-layer generator (i.e., 5 layers of latent vectors) model Nijkamp et al. (2019b) which admits a powerful learned prior on the bottom layer of latent vector. We follow the protocol as in Nijkamp et al. (2019b).

Generation. The generator network in our framework is well-learned to generate samples that are realistic and share visual similarities as the training data. The qualitative results are shown in Figure 2. We further evaluate our model quantitatively by using Fréchet Inception Distance (FID) Lucic et al. (2017) in Table 1. It can be seen that our model achieves superior generation performance compared to listed baseline models.

Reconstruction. We then evaluate the accuracy of the posterior Langevin process by testing image reconstruction. The well-formed posterior Langevin should not only help to learn the latent EBM model but also learn to match the true posterior of the generator model. We quantitatively compare reconstructions of test images with the above baseline models Mean Square Error (MSE). From Table 1, our proposed model could achieve not only high generation quality but also accurate reconstructions.

### 4.2 Text modeling

We compare our model to related baselines, SA-VAE Kim et al. (2018), FB-VAE Li et al. (2019), and ARAE Zhao et al. (2018). SA-VAE involves optimizing posterior samples with gradient descent guided by EBLO, resembling the short run dynamics in our model. FB-VAE is the SOTA VAE for text modeling. While SA-VAE and FB-VAE assume a fixed Gaussian prior, ARAE learns a latent sample generator as an implicit prior distribution, which paired with a discriminator is adversarially trained. To evaluate the quality of the generated samples, we follow Zhao et al. (2018); Cífka et al. (2018) and recruit Forward Perplexity (FPPL) and Reverse Perplexity (RPPL). FPPL is the perplexity of the generated samples evaluated under a language model trained with real data and measures the fluency of the synthesized sentences. RPPL is the perplexity of real data (the test data partition) computed under a language model trained with the model-generated samples. Prior work employs it to measure the distributional coverage of a learned model, in our case, since a model with a mode-collapsing issue results in a high RPPL. FPPL and RPPL are displayed in Table 2. Our model outperforms all the baselines on the two metrics, demonstrating the high fluency and diversity of the samples from our model. We also evaluate the reconstruction of our model against the baselines using negative log-likelihood (NLL). Our model has a similar performance as that of FB-VAE and ARAE, while they all outperform SA-VAE.

### 4.3 Analysis of latent space

Short-run chains. We examine the exponential tilting of the reference prior through Langevin samples initialized from with target distribution . As the reference distribution is in the form of an isotropic Gaussian, we expect the energy-based correction to tilt into an irregular shape. In particular, learning equation 62 may form shallow local modes for . Therefore, the trajectory of a Markov chain initialized from the reference distribution with well-learned target should depict the transition towards synthesized examples of high quality while the energy fluctuates around some constant. Figure 3 and Table 3 depict such transitions for image and textual data, respectively, which are both based on models trained with steps. For image data the quality of synthesis improve significantly with increasing number of steps. For textual data, there is an enhancement in semantics and syntax along the chain, which is especially clear from step 0 to 40 (see Table 3).

Long-run chains. While the learning algorithm 1 recruits short-run MCMC with steps to sample from target distribution , a well-learned should allows for Markov chains with realistic synthesis for steps. We demonstrate such long-run Markov chain with and in Figure 4. The long-run chain samples in the data space are reasonable and do not exhibit the oversaturating issue of the long-run chain samples of recent EBM in the data space (see oversaturing examples in Figure 3 in Nijkamp et al. (2020)).

### 4.4 Anomaly detection

We evaluate our model through the lens of anomaly detection. If the generator and EBM are well learned, then the posterior

would form a discriminative latent space that has separated probability density for normal and anomalous data, respectively. Samples from such latent space can then be used as discriminative features to detect anomalies. We perform posterior sampling on the learned model to obtain the latent samples, and use the unnormalized log-posterior

as our decision function.

Following the protocol as in Kumar et al. (2019); Zenati et al. (2018), we make each digit class an anomaly and consider the remaining 9 digits as normal examples. Our model is trained with only normal data and tested with both normal and anomalous data. We compare with the BiGAN-based anomaly detection Zenati et al. (2018), MEG Kumar et al. (2019) and VAE using area under the precision-recall curve (AUPRC) as in Zenati et al. (2018). Table 4 shows the results.

### 4.5 Ablation study

We investigate a range of factors that are potentially affecting the model performance with SVHN as an example. The highlighted number in Tables 5, 6, and 7

is the FID score reported in the main text and compared to other baseline models. It is obtained from the model with the architecture and hyperparameters specified in Table

8 and Table 9 which serve as the reference configuration for the ablation study.

Fixed prior. We examine the expressivity endowed with the EBM prior by comparing it to models with a fixed isotropic Gaussian prior. The results are displayed in Table 5. The model with an EBM prior clearly outperforms the model with a fixed Gaussian prior and the same generator as the reference model. The fixed Gaussian models exhibit an enhancement in performance as the generator complexity increases. They however still have an inferior performance compared to the model with an EBM prior even when the fixed Gaussian prior model has a generator with four times more parameters than that of the reference model.

MCMC steps. We also study how the number of short run MCMC steps for prior inference () and posterior inference (). The left panel of Table 6 shows the results for and the right panel for . As the number of MCMC steps increases, we observe improved quality of synthesis in terms of FID.

Prior EBM and generator complexity. Table 7 displays the FID scores as a function of the number of hidden features of the prior EBM (nef) and the factor of the number of channels of the generator (ngf, also see Table 9). In general, enhanced model complexity leads to improved generation.

### 4.6 Computational cost

We note that our method involving MCMC sampling is more computationally costly compared to those with amortized inference such as VAE which however bears on issues like inaccurate inference and the mismatch between the prior and aggregate posterior. Several works involve MCMC sampling attempting to improve VAE by either enhancing the posterior inference (SA-VAE Kim et al. (2018)) or constructing a flexible prior with rejection sampling (LARS Bauer and Mnih (2019)) within the original VAE framework. In contrast, we adopt a maximum likelihood learning approach with short run MCMC sampling and follow the philosophy of empirical Bayes. Our approach trades feasible computational cost for expressive prior and simple and accurate inference. Consider SVHN as an example. Training our model on a single NVIDIA 1080Ti needs approximately hours to converge and VAE training needs hours. Thus our method is 4 times slower. Bearing with the feasible cost our method leads to performance improving over strong baselines such as 2sVAE Dai and Wipf (2019b) and RAE Ghosh et al. (2020) on image and text modeling and anomaly detection.

We have also explored avenues to improve training speed and found that a PyTorch extension, NVIDIA Apex

, is able to improve the training time of our model by a factor of 2.5. We test our method with Apex training on a larger scale dataset, CelebA . The learned model is able to synthesize examples with high fidelity (see Figure 1 for examples).

## 5 Conclusion

This paper proposes a generalization of the generator model, where the latent vector follows a latent space EBM, which is a refinement or correction of the independent Gaussian or uniform noise prior in the original generator model. We adopt a simple maximum likelihood framework for learning, and develop a practical modification of the maximum likelihood learning algorithm based on short-run MCMC sampling from the prior and posterior distributions of the latent vector. We also provide a theoretical underpinning of the resulting algorithm as a perturbation of the maximum likelihood learning in terms of objective function and estimating equation. Our method combines the best of both top-down generative model and undirected EBM.

EBM has many applications, however, its soundness and its power are limited by the difficulty with MCMC sampling. By moving from data space to latent space, MCMC-based learning of EBM becomes sound and feasible, and we may release the power of EBM in the latent space for many applications.

## 6 Appendix A: Theoretical derivations

In this section, we shall derive most of the equations in the main text. We take a step by step approach, starting from simple identities or results, and gradually reaching the main results. Our derivations are unconventional, but they pertain more to our model and learning method.

### 6.1 A simple identity

Let . A useful identity is

 Eθ[∇θlogpθ(x)]=0, (18)

where (or ) is the expectation with respect to .

The proof is one liner:

 Eθ[∇θlogpθ(x)]=∫[∇θlogpθ(x)]pθ(x)dx=∫∇θpθ(x)dx=∇θ∫pθ(x)dx=∇θ1=0. (19)

The above identity has generalized versions, such as the one underlying the policy gradient Sutton et al. (2000), . By letting , we get (18).

### 6.2 Maximum likelihood estimating equation

The simple identity (18) also underlies the consistency of MLE. Suppose we observe independently, where is the true value of . The log-likelihood is

 L(θ)=1nn∑i=1logpθ(xi). (20)

The maximum likelihood estimating equation is

 L′(θ)=1nn∑i=1∇θlogpθ(xi)=0. (21)

According to the law of large number, as

, the above estimating equation converges to

 Eθtrue[∇θlogpθ(x)]=0, (22)

where is the unknown value to be solved, while is fixed. According to the simple identity (18), is the solution to the above estimating equation (22), no matter what is. Thus with regularity conditions, such as identifiability of the model, the MLE converges to in probability.

The optimality of the maximum likelihood estimating equation among all the asymptotically unbiased estimating equations can be established based on a further generalization of the simple identity (

18).

We shall justify our learning method with short run MCMC in terms of an estimating equation, which is a perturbation of the maximum likelihood estimating equation.

### 6.3 MLE learning gradient for θ

Recall that , where . The learning gradient for an observation is as follows:

 ∇θlogpθ(x) =Epθ(z|x)[∇θlogpθ(x,z)]=Epθ(z|x)[∇θ(logpα(z)+logpβ(x|z))]. (23)

The above identity is a simple consequence of the simple identity (18).

 Epθ(z|x)[∇θlogpθ(x,z)] =Epθ(z|x)[∇θlogpθ(z|x)+∇θlogpθ(x)] (24) =Epθ(z|x)[∇θlogpθ(z|x)]+Epθ(z|x)[∇θlogpθ(x)] (25) =0+∇θlogpθ(x), (26)

because of the fact that according to the simple identity (18), while because what is inside the expectation only depends on , but does not depend on .

The above identity (23) is related to the EM algorithm Dempster et al. (1977), where is the observed data, is the missing data, and is the complete-data log-likelihood.

### 6.4 MLE learning gradient for α

For the prior model , we have . Applying the simple identity (18), we have

 Eα[∇αlogpα(z)]=Eα[∇αfα(z)−∇αlogZ(α)]=Eα[∇αfα(z)]−∇αlogZ(α)=0. (27)

Thus

 ∇αlogZ(α)=Eα[∇αfα(z)]. (28)

Hence the derivative of the log-likelihood is

 ∇αlogpα(x)=∇αfα(z)−∇αlogZ(α)=∇αfα(z)−Eα[∇αfα(z)]. (29)

According to equation (23) in the previous subsection, the learning gradient for is

 ∇αlogpθ(x) =Epθ(z|x)[∇αlogpα(z)] (30) =Epθ(z|x)[∇αfα(z)−Epα(z)[∇αfα(z))]] (31) =Epθ(z|x)[∇αfα(z)]−Epα(z)[∇αfα(z)]. (32)

### 6.5 Re-deriving simple identity in terms of Dkl

We shall provide a theoretical understanding of the learning method with short run MCMC in terms of Kullback-Leibler divergences. We start from some simple results.

The simple identity (18) also follows from Kullback-Leibler divergence. Consider

 D(θ)=DKL(pθ∗(x)∥pθ(x)), (33)

as a function of with fixed. Suppose the model is identifiable, then achieves its minimum 0 at , thus . Meanwhile,

 D′(θ)=−Eθ∗[∇θlogpθ(x)]. (34)

Thus

 Eθ∗[∇θlogpθ∗(x)]=0. (35)

Since is arbitrary in the above derivation, we can replace it by a generic , i.e.,

 Eθ[∇θlogpθ(x)]=0, (36)

which is the simple identity (18).

As a notational convention, for a function , we write , i.e., the derivative of at .

### 6.6 Re-deriving MLE learning gradient in terms of perturbation by Dkl terms

We now re-derive MLE learning gradient in terms of perturbation of log-likelihood by Kullback-Leibler divergence terms. Then the learning method with short run MCMC can be easily understood.

At iteration , fixing , we want to calculate the gradient of the log-likelihood function for an observation , , at . Consider the following perturbation of the log-likelihood

 l(θ)=logpθ(x)−DKL(pθt(z|x)∥pθ(z|x))+DKL(pαt(z)∥pα(z)). (37)

In the above, as a function of , with fixed, is minimized at , thus its derivative at is 0. As a function of , with fixed, is minimized at , thus its derivative at is 0. Thus

 ∇θlogpθt(x)=l′(θt). (38)

We now unpack so that we can obtain its derivative at .

 l(θ) =logpθ(x)+Epθt(z|x)[logpθ(z|x)]−Epαt(z)[logpα(z)]+c (39) =Epθt(z|x)[logpθ(x,z)]−Epαt(z)[logpα(z)]+c (40) =Epθt(z|x)[logpα(z)+logpβ(x|z)]−Epαt(z)[logpα(z)]+c (41) =Epθt(z|x)[logpα(z)]−Epαt(z)[logpα(z)]+Epθt(z|x)[logpβ(x|z)]+c (42) =Epθt(z|x)[fα(z)]−Epαt(z)[fα(z)]+Epθt(z|x)[logpβ(x|z)]+c+c′, (43)

where term gets canceled,

 c =−Epθt(z|x)[logpθt(z|x)]+Epαt(z)[logpαt(z)], (44) c′ =Epθt(z|x)[logp0(z)]−Epαt(z)[logp0(z)], (45)

do not depend on . consists of two entropy terms. Now taking derivative at , we have

 δαt(x)=∇αl(θt)=Epθt(z|x)[∇αfαt(z)]−Epαt(z)[∇αfαt(z)], (46) δβt(x)=∇βl(θt)=Epθt(z|x)[∇βlogpβt(x|z)]. (47)

In the above, we calculate the gradient of at . Since is arbitrary in the above derivation, if we replace by a generic , we get the gradient of at a generic , i.e.,

 δα(x)=∇αlogpθ(x)=Epθ(z|x)[∇αfα(z)]−Epα(z)[∇αfα(z)], (48) δβ(x)=∇βlogpθ(x)=Epθ(z|x)[∇βlogpβ(x|z)]. (49)

The above calculations are related to the EM algorithm and the learning of energy-based model.

In EM algorithm Dempster et al. (1977), the complete-data log-likelihood serves as a surrogate for the observed-data log-likelihood , where

 Q(θ|θt)=logpθ(x)−DKL(pθt(z|x)∥pθ(z|x)), (50)

and , where is a lower-bound of or minorizes the latter. and touch each other at , and they are co-tangent at . Thus the derivative of at is the same as the derivative of at .

In EBM, serves to cancel term in the EBM prior, and is related to the second divergence term in contrastive divergence.

### 6.7 Maximum likelihood estimating equation for θ=(α,β)

The MLE estimating equation is

 1nn∑i=1∇θlogpθ(xi)=0. (51)

Based on (48) and (49), the estimating equation is

 1nn∑i=1δα(xi)=1nn∑i=1Epθ(zi|xi)[∇αfα(zi)]−Epα(z)[∇αfα(z)]=0, (52) 1nn∑i=1δβ(xi)=1n