1 Introduction
Recently, flowbased models (henceforth simply called flow models) have gained popularity as a type of deep generative model dinh2014nice ; dinh2016density ; kingma2018Glow ; grathwohl2018ffjord ; behrmann2018invertible ; kumar2019videoflow ; tran2019discrete and for use in variational inference kingma2013auto ; rezende2015variational ; kingma2016improved .
Flow models have two properties that set them apart from other types of deep generative models: (1) they allow for efficient evaluation of the density function, and (2) they allow for efficient sampling from the model. Efficient evaluation of the logdensity allows flow models to be directly optimized towards the loglikelihood objective, unlike variational autoencoders (VAEs)
kingma2013auto ; rezende2014stochastic , which are optimized towards a bound on the loglikelihood, and generative adversarial networks (GANs) goodfellow2014generative . Autoregressive models graves2013generating ; oord2016wavenet ; salimans2017pixelcnn , on the other hand, are (in principle) inefficient to sample from, since synthesis requires computation that is proportional to the dimensionality of the data.These properties of efficient density evaluation and efficient sampling are typically viewed as advantageous. However, they have a potential downside: these properties also acts as assumptions on the true data distribution that they are trying to model. By choosing a flow model, one is making the assumption that the true data distribution is one that is in principle simple to sample from, and is computationally efficient to normalize. In addition, flow models assume that the data is generated by a finite sequence of invertible functions. If these assumptions do not hold, flowbased models can result in a poor fit.
On the other end of the spectrum of deep generative models lies the family of energybased models (EBMs) lecun2006tutorial ; ngiam2011learning ; kim2016deep ; zhao2016energy ; xie2016theory ; gao2018learning ; kumar2019maximum ; nijkamp2019learning ; du2019implicit ; finn2016connection . Energybased models define an unnormalized density that is the exponential of the negative energy function
. The energy function is directly defined as a (learned) scalar function of the input, and is often parameterized by a neural network, such as a convolutional network
lecun1998gradient ; krizhevsky2012imagenet. Evaluation of the density function for a given datapoint involves calculating a normalizing constant, which requires an intractable integral. Sampling from EBMs is expensive and requires approximation as well, such as computationally expensive Markov Chain Monte Carlo (MCMC) sampling. EBMs therefore do not make any of the two assumptions above: they do not assume that the density of data is easily normalized, and they do not assume efficient synthesis. Moreover, they do not constrain the data distribution by invertible functions.
Contrasting an EBM with a flow model, the former is on the side of representation where different layers represent features of different complexities, whereas the latter is on the side of learned computation, where each layer, or each transformation is like a step in the computation. The EBM is like an objective function or a target distribution whereas the flow model is like a finite step iterative algorithm or a sampler. The EBM can be simpler and more flexible in form than the flow model which is highly constrained, and thus the EBM may capture the modes of the data distribution more accurately than the flow model.
In contrast, the flow model is capable of direct generation via ancestral sampling, which is sorely lacking in an EBM. It may thus be desirable to train the two models jointly, combining the best of both worlds. This is the goal of this paper.
Our joint training method is inspired by the noise contrastive estimation (NCE) of gutmann2010noise
, where an EBM is learned discriminatively by classifying the real data and the data generated by a noise model. In NCE, the noise model must have an explicit normalized density function. Moreover, it is desirable for the noise distribution to be close to the data distribution for accurate estimation of the EBM. However, the noise distribution can be far away from the data distribution. The flow model can potentially transform or transport the noise distribution to a distribution closer to the data distribution. With the advent of strong flowbased generative models
dinh2014nice ; dinh2016density ; kingma2018Glow , it is natural to recruit the flow model as the contrast distribution for noise contrastive estimation of the EBM.However, even with the flowbased model pretrained by maximum likelihood estimation (MLE) on the data distribution, it may still not be strong enough as a contrast distribution, in the sense that the synthesized examples generated by the pretrained flow model may still be distinguished from the real examples by a classifier based on an EBM. Thus, we want the flow model to be a stronger contrast or a stronger training opponent for EBM. To achieve this goal, we can simply use the same objective function of NCE, which is the loglikelihood of the logistic regression for classification. While NCE updates the EBM by maximizing this objective function, we can also update the flow model by minimizing the same objective function to make the classification task harder for the EBM. Such update of flow model combines MLE and variational approximation, and helps correct the overdispersion of MLE. If the EBM is close to the data distribution, this amounts to minimizing the JensenShannon divergence (JSD)
goodfellow2014generative between the data distribution and the flow model. In this sense, the learning scheme relates closely to GANs goodfellow2014generative. However, unlike GANs, which learns a generator model that defines an implicit probability density function via a lowdimensional latent vector, our method learns two probabilistic models with explicit probability densities (a normalized one and an unnormalized one).
The contributions of our paper are as follows. We explore a parameter estimation method that couples estimation of an EBM and a flow model using a shared objective function. It improves NCE with a flowtransformed noise distribution, and it modifies MLE of the flow model to approximate JSD minimization, and helps correct the overdispersion of MLE. Experiments on 2D synthetic data show that the learned EBM achieves accurate density estimation with a much simpler network structure than the flow model. On real image datasets, we demonstrate a significant improvement on the synthesis quality of the flow model, and the effectiveness of unsupervised feature learning by the energybased model. Furthermore, we show that the proposed method can be easily adapted to semisupervised learning, achieving performance comparable to stateoftheart semisupervised methods.
2 Related work
For learning the energybased model by MLE, the main difficulty lies in drawing fair samples from the current model. A prominent approximation of MLE is the contrastive divergence (CD)
hinton2002training framework, requiring MCMC initialized from the data distribution. CD has been generalized to persistent CD tieleman2008training , and has more recently been generalized to modified CD gao2018learning , adversarial CD kim2016deep ; dai2017calibrating ; han2018divergence with modern CNN structure. nijkamp2019learning ; du2019implicitscale up samplingbased methods to large image datasets with white noise as the starting point of sampling. However, these sampling based methods may still have difficulty traversing different modes of the learned model, which may result in biased model, and may take a long time to converge. An advantage of noise contrastive estimation (NCE), and our adaptive version of it, is that it avoids MCMC sampling in estimation of the energybased model, by turning the estimation problem into a classification problem.
Generalizing from tu2007learning , jin2017introspective ; lazarow2017introspective ; lee2018wasserstein developed an introspective parameter estimation method, where the EBM is discriminatively learned and composed of a sequence of discriminative models obtained through the learning process.
NCE and it variants has gained popularity in natural language processing (NLP)
he2016training ; oualil2017batch ; baltescu2014pragmatic ; bose2018adversarial . mnih2012fast ; mnih2013learning applied NCE to logbilinear models and in vaswani2013decoding NCE is applied to neural probabilistic language models. NCE shows effectiveness in typical NLP tasks such as word embeddings mikolov2013distributed and order embeddings vendrov2015order .In the context of inverse reinforcement learning,
levine2013guided proposes a guided policy search method, and finn2016connection connects it to GAN. Our method is closely related to this method, where the energy function can be viewed as the cost function, and the flow model can be viewed as the unrolled policy.3 Learning method
3.1 Energybased model
Let be the input variable, such as an image. We use to denote a model’s probability density function of with parameter . The energybased model (EBM) is defined as follows:
(1) 
where
is defined by a bottomup convolutional neural network whose parameters are denoted by
. The normalizing constant is intractable to compute exactly for highdimensional .3.1.1 Maximum likelihood estimation
The energybased model in eqn. 1 can be estimated from unlabeled data by maximum likelihood estimation (MLE). Suppose we observe training examples from unknown true distribution . We can view this dataset as forming empirical data distribution, and thus expectation with respect to can be approximated by averaging over the training examples. In MLE, we seek to maximize the loglikelihood function
(2) 
Maximizing the loglikelihood function is equivalent to minimizing the KullbackLeibler divergence
for large . Its gradient can be written as:(3) 
which is the difference between the expectations of the gradient of under and respectively. The expectations can be approximated by averaging over the observed examples and synthesized samples generated from the current model respectively. The difficulty lies in the fact that sampling from requires MCMC such as Hamiltonian monte carlo or Langevin dynamics behrmann2018invertible ; zhu1998grade , which may take a long time to converge, especially on high dimensional and multimodal space such as image space.
The MLE of seeks to cover all the models of . Given the flexibility of model form of , the MLE of has the chance to approximate reasonably well.
3.1.2 Noise contrastive estimation
Noise contrastive estimation (NCE) gutmann2010noise can be used to learn the EBM, by including the normalizing constant as another learnable parameter. Specifically, for an energybased model , we define , where . is now treated as a free parameter, and is included into . Suppose we observe training examples , and we have generated examples from a noise distribution . Then can be estimated by maximizing the following objective function:
(4) 
which transforms estimation of EBM into a classification problem.
The objective function connects to logistic regression in supervised learning in the following sense. Suppose for each training or generated examples we assign a binary class label : if is from training dataset and if is generated from
. In logistic regression, the posterior probabilities of classes given the data
are estimated. As the data distribution is unknown, the classconditional probability is modeled with . And is modeled by . Suppose we assume equal probabilities for the two class labels, i.e., . Then we obtain the posterior probabilities:(5) 
The classlabels
are Bernoullidistributed, so that the loglikelihood of the parameter
becomes(6) 
which is, up to a factor of , an approximation of eqn. 4.
The choice of the noise distribution is a design issue. Generally speaking, we expect
to satisfy the following: (1) analytically tractable expression of normalized density; (2) easy to draw samples from; (3) close to data distribution. In practice, (3) is important for learning a model over high dimensional data. If
is not close to the data distribution, the classification problem would be too easy and would not require to learn much about the modality of the data.3.2 Flowbased model
A flow model is of the form
(7) 
where is a known noise distribution. is a composition of a sequence of invertible transformations where the logdeterminants of the Jacobians of the transformations can be explicitly obtained. denotes the parameters. Let be the probability density of the model given a datapoint with parameter . Then under the change of variables can be expressed as
(8) 
More specifically, suppose is composed of a sequence of transformations . The relation between and can be written as . And thus we have
(9) 
where we define and for conciseness. With carefully designed transformations, as explored in flowbased methods, the determinant of the Jacobian matrix can be incredibly simple to compute. The key idea is to choose transformations whose Jacobian is a triangle matrix, so that the determinant becomes
(10) 
The following are the two scenarios for estimating :
(1) Generative modeling by MLE dinh2014nice ; dinh2016density ; kingma2018Glow ; grathwohl2018ffjord ; behrmann2018invertible ; kumar2019videoflow ; tran2019discrete , based on , where again can be approximated by average over observed examples.
(2) Variational approximation to an unnormalized target density kingma2013auto ; rezende2015variational ; kingma2016improved ; kingma2014efficient ; khemakhem2019variational , based on , where
(11) 
is the difference between energy and entropy, i.e., we want to have low energy but high entropy. can be calculated without inversion of .
When appears on the right of KLdivergence, as in (1), it is forced to cover most of the modes of , When appears on the left of KLdivergence, as in (2), it tends to chase the major modes of while ignoring the minor modes murphy2012machine ; fox2012tutorial . As shown in the following section, our proposed method learns a flow model by combining (1) and (2).
3.3 Flow Contrastive Estimation
A natural improvement to NCE is to transform the noise so that the resulting distribution is closer to the data distribution. This is exactly what the flow model achieves. A flow model is of the form , where , which is a known noise distribution. is a composition of a sequence of invertible transformations, and denotes the parameters. Let be the probability density of . It fulfills (1) and (2) of the requirements of NCE. However, in practice, we find that a pretrained , such as learned by MLE, is not strong enough for learning an EBM because the synthesized data from the MLE of can still be easily distinguished from the real data by an EBM. Thus, we propose to iteratively train the EBM and flow model, in which case the flow model is adaptively adjusted to become a stronger contrast distribution or a stronger training opponent for EBM. This is achieved by a parameter estimation scheme similar to GAN, where and play a minimax game with a unified value function: ,
(12) 
where is approximated by averaging over observed samples , while is approximated by averaging over negative samples drawn from , with independently for . In the experiments, we choose Glow kingma2018Glow
as the flowbased model. The algorithm can either start from a randomly initialized Glow model or a pretrained one by MLE. Here we assume equal prior probabilities for observed samples and negative samples. It can be easily modified to the situation where we assign a higher prior probability to the negative samples, given the fact we have access to infinite amount of free negative samples.
The objective function can be interpreted from the following perspectives:
(1) Noise contrastive estimation for EBM. The update of can be seen as noise contrastive estimation of , but with a flowtransformed noise distribution which is adaptively updated. The training is essentially a logistic regression. However, unlike regular logistic regression for classification, for each or , we must include or as an exampledependent bias term. This forces to replicate in addition to distinguishing between and , so that is in general larger than , and is in general smaller than .
(2) Minimization of JensenShannon divergence for the flow model. If is close to the data distribution, then the update of is approximately minimizing the JensenShannon divergence between the flow model and data distribution :
(13) 
Its gradient w.r.t. equals the gradient of . The gradient of the first term resembles MLE, which forces to cover the modes of data distribution, and tends to lead to an overdispersed model, which is also pointed out in kingma2018Glow . The gradient of the second term is similar to reverse KullbackLeibler divergence between and , or variational approximation of by , which forces to chase the modes of murphy2012machine ; fox2012tutorial . This may help correct the overdispersion of MLE, and combines the two scenarios of estimating the flowbased model as described in section 3.2.
(3) Connection with GAN. Our parameter estimation scheme is closely related to GAN. In GAN, the discriminator and generator play a minimax game: ,
(14) 
The discriminator is learning the probability ratio , which is about the difference between and finn2016connection . In the end, if the generator learns to perfectly replicate , then the discriminator ends up with a random guess. However, in our method, the ratio is explicitly modeled by and . must contain all the learned knowledge in , in addition to the difference between and . In the end, we learn two explicit probability distributions and as approximations to .
Henceforth we simply refer to the proposed method as flow constrastive estimation, or FCE.
3.4 Semisupervised learning
A classconditional energybased model can be transformed into a discriminative model in the following sense. Suppose there are categories , and the model learns a distinct density for each . The networks for may share common lower layers, but with different top layers. Let be the prior probability of category , for . Then the posterior probability for classifying to the category is a softmax multiclass classifier
(15) 
where .
Given this correspondence, we can modify FCE to do semisupervised learning. Specifically, assume are observed examples with labels known, and are observed unlabeled examples. For each category , we can assume that classconditional EBM is in the form
(16) 
where share all the weights except for the top layer. And we assume equal prior probability for each category. Let denotes all the parameters from classconditional EBMs . For labeled examples, we can maximize the conditional posterior probability of label , given and the fact that is an observed example (instead of a generated example from ). By Bayes rule, this leads to maximizing the following objective function over :
(17) 
which is similar to a classifier in the form.
For unlabeled examples, the probability can be defined by an unconditional EBM, which is in the form of a mixture model:
(18) 
Together with the generated examples from , we can define the same value function as eqn. 12 for the unlabeled examples. The joint estimation algorithm alternate the following two steps: (1) update by ; (2) update by . Due to the flexibility of EBM, can be defined by any existing stateoftheart network structures designed for semisupervised learning.
4 Experiments
For FCE, we adaptively adjust the numbers of updates for EBM and Glow: we first update EBM for a few iterations until the classification accuracy is above , and then we update Glow until the classification accuracy is below . We use Adam kingma2014adam with learning rate for the EBM and Adamax kingma2014adam with learning rate for the Glow model.
4.1 Density estimation on 2D synthetic data
Figure 1 demonstrates the results of FCE on several 2D distributions, where FCE starts from a randomly initialized Glow. The learned EBM can fit multimodal distributions accurately, and forms a better fit than Glow learned by either FCE or MLE. Notably, the EBM is defined by a much simpler network structure than Glow: for Glow we use affine coupling layers, which amount to fullyconnected layers, while the energybased model is defined by a layer fullyconnected network with the same width as Glow. Another interesting finding is that the EBM can fit the distributions well, even if the flow model is not a perfect contrastive distribution.
Data  GlowMLE  GlowFCE  EBMFCE 

For the distribution depicted in the first row of Figure 1, which is a mixture of eight Gaussian distributions, we can compare the estimated densities by the learned models with the ground truth densities. Figure 2 shows the mean squared error of the estimated logdensity over numbers of training iterations of EBMs. We show the results of FCE either starting from a randomly initialized Glow (’rand’) or a Glow model pretrained by MLE (’trained’), and compare with NCE with a Gaussian noise distribution. FCE starting from a randomly initialized Glow converges in fewer iterations. And both settings of FCE achieve a lower error rate than NCE.
4.2 Learning on real image datasets
We conduct experiments on the Street View House Numbers (SVHN) netzer2011reading , CIFAR10 krizhevsky2009learning and CelebA liu2015faceattributes datasets. We resized the CelebA images to pixels, and used images as a test set. We initialize FCE with a pretrained Glow model, trained by MLE, for the sake of efficiency. We again emphasize the simplicity of the EBM model structure compared to Glow. See Supplementary A for detailed model architectures. For Glow, depth per level kingma2018Glow is set as , , for SVHN, CelebA and CIFAR10 respectively. Figure 3 depicts synthesized examples from learned Glow models. To evaluate the fidelity of synthesized examples, Table 1 summarizes the Fréchet Inception Distance (FID) heusel2017gans of the synthesized examples computed with the Inception V3 szegedy2016rethinking classifier. The fidelity is significantly improved compared to Glow trained by MLE (see Supplementary B for qualitative comparisons), and is competitive to the other generative models. In Table 2, we report the average negative loglikelihood (bits per dimension) on the testing sets. The loglikelihood of the learned EBM is based on the estimated normalizing constant (i.e., a parameter of the model) and should be taken with a grain of salt. For the learned Glow model, the loglikelihood of the Glow model estimated with FCE is slightly lower than the loglikelihood of the Glow model trained with MLE.
Method  SVHN  CIFAR10  CelebA 

VAE kingma2013auto  57.25  78.41  38.76 
DCGAN radford2015unsupervised  21.40  37.70  12.50 
Glow kingma2018Glow  41.70  45.99  23.32 
FCE (Ours)  20.19  37.30  12.21 
Model  SVHN  CIFAR10  CelebA 

GlowMLE  2.17  3.35  3.49 
GlowFCE (Ours)  2.25  3.45  3.54 
EBMFCE (Ours)  2.15  3.27  3.40 
4.3 Unsupervised feature learning
To further explore the EBM learned with FCE, we perform unsupervised feature learning with features from a learned EBM. Specifically, we first conduct FCE on the entire training set of SVHN in an unsupervised way. Then, we extract the top layer feature maps from the learned EBM, and train a linear classifier on top of the extracted features using only a subset of the training images and their corresponding labels. Figure 4 shows the classification accuracy as a function of the number of labeled examples. Meanwhile, we compare our method with a supervised model with the same model structure as the EBM, and is trained only on the same subset of labeled examples each time. We observe that FCE outperforms the supervised model when the number of labeled examples is small (less than ).
Next we try to combine features from multiple layers together. Specifically, following the same procedure outlined in radford2015unsupervised
, the features from the top three convolutional layers are max pooled and concatenated to form a
dimensional vector of feature. A regularized L2SVM is then trained on these features with a subset of training examples and the corresponding labels. Table 3 summarizes the results of using , and labeled examples from the training set. At the top part of the table, we compare with methods that estimate an EBM or a discriminative model coupled with a generator network. At the middle part of the table, we compare with methods that learn an EBM with contrastive divergence (CD) and modified versions of CD. For fair comparison, we use the same model structure for the EBMs or discriminative models used in all the methods. The results indicate that FCE outperforms these methods in terms of the effectiveness of learned features.Method  # of labeled data  

Wasserstein GAN wasserstein  43.15  38.00  32.56 
DDGM kim2016deep  44.99  34.26  27.44 
DCGAN radford2015unsupervised  38.59  32.51  29.37 
Persistent CD tieleman2008training  45.74  39.47  34.18 
Onestep CD hinton2002training  44.38  35.87  30.45 
Multigrid sampling gao2018learning  30.23  26.54  22.83 
FCE (Ours)  27.07  24.12  22.05 
4.4 Semisupervised learning
Recall that in section 3.4 we show that FCE can be generalized to perform semisupervised learning. We emphasize that for semisupervised learning, FCE not only learns a classification boundary or a posterior label distribution . Instead, the algorithm ends up with estimated probabilistic distributions for observed examples belonging to categories respectively. Figure 5 illustrates this point by showing the learning process on a 2D example, where the data distribution consists of two twisted spirals belonging to two categories. Seven labeled points are provided for each category. As the training goes, the unconditional EBM learns to capture all the modes of the data distribution, which is in the form of a mixture of classconditional EBMs and . Meanwhile, by maximizing the objective function (eqn. 17), is forced to project the learned modes into different spaces, resulting in two wellseparated classconditional EBMs. As shown in Figure 5, within a single mode of one category, the EBM tends to learn a smoothly connected cluster, which is often what we desire in semisupervised learning.
Then we test the proposed method on a dataset of real images. Following the setting in miyato2018virtual , we use two types of CNN structures (‘Convsmall’and ‘Convlarge’) for EBMs, which are commonly used in stateoftheart semisupervised learning methods. See Supplementary A for detailed model structures. We start FCE from a pretrained Glow model. Before the joint training starts, EBMs are firstly trained for iterations with the Glow model fixed. In practice, this helps EBMs keep pace with the pretrained Glow model, and equips EBMs with reasonable classification ability. We report the performance at this stage as ‘FCEinit’. Also, since virtual adversarial training (VAT) miyato2018virtual has been demonstrated as an effective regularization method for semisupervised learning, we consider adopting it as an additional loss for learning the EBMs. More specifically, the loss is defined as the robustness of the conditional label distribution around each input data point against local purturbation. ‘FCE + VAT’ indicates the training with VAT.
Table 4
summarizes the results of semisupervised learning on SVHN dataset without data augmentation. We report the mean error rates and standard deviations over three runs. All the methods listed in the table belong to the family of semisupervised learning methods. Our method achieve competitive performance to these stateoftheart methods. ‘FCE + VAT’ results show that the effectiveness of FCE does not overlap much with existing semisupervised method, and thus they can be combined to further boost the performance.
Method  # of labeled data  

SWWAE zhao2015stacked  23.56  
Skip DGM maaloe2016auxiliary  16.61  
Auxiliary DGM maaloe2016auxiliary  22.86  
GAN with FM salimans2016improved  18.44  8.11 
VATConvsmall miyato2018virtual  6.83  
on Convsmall used in salimans2016improved ; miyato2018virtual  
FCEinit  9.42  8.50 
FCE  7.05  6.35 
model laine2016temporal  7.05  5.43 
VATConvlarge miyato2018virtual  8.98  5.77 
on Convlarge used in laine2016temporal ; miyato2018virtual  
FCEinit  8.86  7.60 
FCE  6.86  5.54 
FCE + VAT  4.47  3.87 
5 Conclusion
This paper explores joint training of an energybased model with a flowbased model, by combining the representational flexibility of the energybased model and the computational tractability of the flowbased model. We may consider the learned energybased model as the learned representation, while the learned flowbased model as the learned computation. This method can be considered as an adaptive version of noise contrastive estimation where the noise is transformed by a flow model to make its distribution closer to the data distribution and to make it a stronger contrast to the energybased model. Meanwhile, the flowbased model is updated adaptively through the learning process, under the same adversarial value function.
In future work, we intend to generalize the joint training method by combining the energybased model with other normalized probabilistic models, such as autoregressive models. We also intend to explore flow contrastive estimation of energybased models with more interpretable energy functions, e.g., by incorporating sparsity constraints or latent variables into the energy function.
Acknowledgments
The work is partially supported by DARPA XAI project N660011724029. We thank Pavel Sountsov, Alex Alemi and Srinivas Vasudevan for their helpful discussions.
References
 [1] Martin Arjovsky, Soumith Chintala, and Bottou. Wasserstein gan. arXiv preprint arXiv:1701.07875, 2017.
 [2] Paul Baltescu and Phil Blunsom. Pragmatic neural language modelling in machine translation. arXiv preprint arXiv:1412.7119, 2014.
 [3] Jens Behrmann, David Duvenaud, and JörnHenrik Jacobsen. Invertible residual networks. arXiv preprint arXiv:1811.00995, 2018.
 [4] Avishek Joey Bose, Huan Ling, and Yanshuai Cao. Adversarial contrastive estimation. arXiv preprint arXiv:1805.03642, 2018.
 [5] Zihang Dai, Amjad Almahairi, Philip Bachman, Eduard Hovy, and Aaron Courville. Calibrating energybased generative adversarial networks. arXiv preprint arXiv:1702.01691, 2017.
 [6] Laurent Dinh, David Krueger, and Yoshua Bengio. Nice: Nonlinear independent components estimation. arXiv preprint arXiv:1410.8516, 2014.
 [7] Laurent Dinh, Jascha SohlDickstein, and Samy Bengio. Density estimation using real nvp. arXiv preprint arXiv:1605.08803, 2016.
 [8] Yilun Du and Igor Mordatch. Implicit generation and generalization in energybased models. arXiv preprint arXiv:1903.08689, 2019.
 [9] Chelsea Finn, Paul Christiano, Pieter Abbeel, and Sergey Levine. A connection between generative adversarial networks, inverse reinforcement learning, and energybased models. arXiv preprint arXiv:1611.03852, 2016.

[10]
Charles W Fox and Stephen J Roberts.
A tutorial on variational bayesian inference.
Artificial intelligence review, 38(2):85–95, 2012. 
[11]
Ruiqi Gao, Yang Lu, Junpei Zhou, SongChun Zhu, and Ying Nian Wu.
Learning generative convnets via multigrid modeling and sampling.
In
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
, pages 9155–9164, 2018.  [12] Ian Goodfellow, Jean PougetAbadie, Mehdi Mirza, Bing Xu, David WardeFarley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In Advances in neural information processing systems, pages 2672–2680, 2014.
 [13] Will Grathwohl, Ricky TQ Chen, Jesse Betterncourt, Ilya Sutskever, and David Duvenaud. Ffjord: Freeform continuous dynamics for scalable reversible generative models. arXiv preprint arXiv:1810.01367, 2018.
 [14] Alex Graves. Generating sequences with recurrent neural networks. arXiv preprint arXiv:1308.0850, 2013.
 [15] Michael Gutmann and Aapo Hyvärinen. Noisecontrastive estimation: A new estimation principle for unnormalized statistical models. In Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics, pages 297–304, 2010.
 [16] Tian Han, Erik Nijkamp, Xiaolin Fang, Mitch Hill, SongChun Zhu, and Ying Nian Wu. Divergence triangle for joint training of generator model, energybased model, and inference model. arXiv preprint arXiv:1812.10907, 2018.
 [17] Tianxing He, Yu Zhang, Jasha Droppo, and Kai Yu. On training bidirectional neural network language model with noise contrastive estimation. In 2016 10th International Symposium on Chinese Spoken Language Processing (ISCSLP), pages 1–5. IEEE, 2016.
 [18] Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. Gans trained by a two timescale update rule converge to a local nash equilibrium. In Advances in Neural Information Processing Systems, pages 6626–6637, 2017.
 [19] Geoffrey E Hinton. Training products of experts by minimizing contrastive divergence. Neural computation, 14(8):1771–1800, 2002.
 [20] Long Jin, Justin Lazarow, and Zhuowen Tu. Introspective classification with convolutional nets. In Advances in Neural Information Processing Systems, pages 823–833, 2017.
 [21] Ilyes Khemakhem, Diederik P Kingma, and Aapo Hyvärinen. Variational autoencoders and nonlinear ica: A unifying framework. arXiv preprint arXiv:1907.04809, 2019.
 [22] Taesup Kim and Yoshua Bengio. Deep directed generative models with energybased probability estimation. arXiv preprint arXiv:1606.03439, 2016.

[23]
Diederik Kingma and Max Welling.
In
International Conference on Machine Learning
, pages 1782–1790, 2014.  [24] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
 [25] Diederik P Kingma and Prafulla Dhariwal. Glow: Generative flow with invertible 1x1 convolutions. In Advances in Neural Information Processing Systems, pages 10215–10224, 2018.
 [26] Diederik P Kingma, Tim Salimans, Rafal Jozefowicz, Xi Chen, Ilya Sutskever, and Max Welling. Improved variational inference with inverse autoregressive flow. In Advances in neural information processing systems, pages 4743–4751, 2016.
 [27] Diederik P Kingma and Max Welling. Autoencoding variational bayes. arXiv preprint arXiv:1312.6114, 2013.
 [28] Alex Krizhevsky, Geoffrey Hinton, et al. Learning multiple layers of features from tiny images. Technical report, Citeseer, 2009.
 [29] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pages 1097–1105, 2012.
 [30] Manoj Kumar, Mohammad Babaeizadeh, Dumitru Erhan, Chelsea Finn, Sergey Levine, Laurent Dinh, and Durk Kingma. Videoflow: A flowbased generative model for video. arXiv preprint arXiv:1903.01434, 2019.
 [31] Rithesh Kumar, Anirudh Goyal, Aaron Courville, and Yoshua Bengio. Maximum entropy generators for energybased models. arXiv preprint arXiv:1901.08508, 2019.
 [32] Samuli Laine and Timo Aila. Temporal ensembling for semisupervised learning. arXiv preprint arXiv:1610.02242, 2016.
 [33] Justin Lazarow, Long Jin, and Zhuowen Tu. Introspective neural networks for generative modeling. In Proceedings of the IEEE International Conference on Computer Vision, pages 2774–2783, 2017.
 [34] Yann LeCun, Léon Bottou, Yoshua Bengio, Patrick Haffner, et al. Gradientbased learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998.
 [35] Yann LeCun, Sumit Chopra, Raia Hadsell, M Ranzato, and F Huang. A tutorial on energybased learning. Predicting structured data, 1(0), 2006.
 [36] Kwonjoon Lee, Weijian Xu, Fan Fan, and Zhuowen Tu. Wasserstein introspective neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3702–3711, 2018.
 [37] Sergey Levine and Vladlen Koltun. Guided policy search. In International Conference on Machine Learning, pages 1–9, 2013.
 [38] Ziwei Liu, Ping Luo, Xiaogang Wang, and Xiaoou Tang. Largescale celebfaces attributes (celeba) dataset. Retrieved August, 15:2018, 2018.
 [39] Lars Maaløe, Casper Kaae Sønderby, Søren Kaae Sønderby, and Ole Winther. Auxiliary deep generative models. arXiv preprint arXiv:1602.05473, 2016.
 [40] Andrew L Maas, Awni Y Hannun, and Andrew Y Ng. Rectifier nonlinearities improve neural network acoustic models. In Proc. icml, volume 30, page 3, 2013.
 [41] Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. Distributed representations of words and phrases and their compositionality. In Advances in neural information processing systems, pages 3111–3119, 2013.
 [42] Takeru Miyato, Shinichi Maeda, Masanori Koyama, and Shin Ishii. Virtual adversarial training: a regularization method for supervised and semisupervised learning. IEEE transactions on pattern analysis and machine intelligence, 41(8):1979–1993, 2018.
 [43] Andriy Mnih and Koray Kavukcuoglu. Learning word embeddings efficiently with noisecontrastive estimation. In Advances in neural information processing systems, pages 2265–2273, 2013.
 [44] Andriy Mnih and Yee Whye Teh. A fast and simple algorithm for training neural probabilistic language models. arXiv preprint arXiv:1206.6426, 2012.
 [45] Kevin P Murphy. Machine learning: a probabilistic perspective. MIT press, 2012.
 [46] Yuval Netzer, Tao Wang, Adam Coates, Alessandro Bissacco, Bo Wu, and Andrew Y Ng. Reading digits in natural images with unsupervised feature learning. 2011.
 [47] Jiquan Ngiam, Zhenghao Chen, Pang W Koh, and Andrew Y Ng. Learning deep energy models. In Proceedings of the 28th international conference on machine learning (ICML11), pages 1105–1112, 2011.
 [48] Erik Nijkamp, SongChun Zhu, and Ying Nian Wu. On learning nonconvergent shortrun mcmc toward energybased model. arXiv preprint arXiv:1904.09770, 2019.
 [49] Aaron van den Oord, Sander Dieleman, Heiga Zen, Karen Simonyan, Oriol Vinyals, Alex Graves, Nal Kalchbrenner, Andrew Senior, and Koray Kavukcuoglu. Wavenet: A generative model for raw audio. arXiv preprint arXiv:1609.03499, 2016.
 [50] Youssef Oualil and Dietrich Klakow. A batch noise contrastive estimation approach for training large vocabulary language models. arXiv preprint arXiv:1708.05997, 2017.
 [51] Alec Radford, Luke Metz, and Soumith Chintala. Unsupervised representation learning with deep convolutional generative adversarial networks. arXiv preprint arXiv:1511.06434, 2015.
 [52] Danilo Jimenez Rezende and Shakir Mohamed. Variational inference with normalizing flows. arXiv preprint arXiv:1505.05770, 2015.
 [53] Danilo Jimenez Rezende, Shakir Mohamed, and Daan Wierstra. Stochastic backpropagation and approximate inference in deep generative models. arXiv preprint arXiv:1401.4082, 2014.
 [54] Tim Salimans, Ian Goodfellow, Wojciech Zaremba, Vicki Cheung, Alec Radford, and Xi Chen. Improved techniques for training gans. In Advances in neural information processing systems, pages 2234–2242, 2016.
 [55] Tim Salimans, Andrej Karpathy, Xi Chen, and Diederik P Kingma. Pixelcnn++: Improving the pixelcnn with discretized logistic mixture likelihood and other modifications. arXiv preprint arXiv:1701.05517, 2017.
 [56] Tim Salimans and Durk P Kingma. Weight normalization: A simple reparameterization to accelerate training of deep neural networks. In Advances in Neural Information Processing Systems, pages 901–909, 2016.
 [57] Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jon Shlens, and Zbigniew Wojna. Rethinking the inception architecture for computer vision. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2818–2826, 2016.

[58]
Tijmen Tieleman.
Training restricted boltzmann machines using approximations to the likelihood gradient.
In Proceedings of the 25th international conference on Machine learning, pages 1064–1071. ACM, 2008.  [59] Dustin Tran, Keyon Vafa, Kumar Krishna Agrawal, Laurent Dinh, and Ben Poole. Discrete flows: Invertible generative models of discrete data. arXiv preprint arXiv:1905.10347, 2019.
 [60] Zhuowen Tu. Learning generative models via discriminative approaches. In 2007 IEEE Conference on Computer Vision and Pattern Recognition, pages 1–8. IEEE, 2007.
 [61] Ashish Vaswani, Yinggong Zhao, Victoria Fossum, and David Chiang. Decoding with largescale neural language models improves translation. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, pages 1387–1392, 2013.
 [62] Ivan Vendrov, Ryan Kiros, Sanja Fidler, and Raquel Urtasun. Orderembeddings of images and language. arXiv preprint arXiv:1511.06361, 2015.
 [63] Jianwen Xie, Yang Lu, SongChun Zhu, and Yingnian Wu. A theory of generative convnet. In International Conference on Machine Learning, pages 2635–2644, 2016.
 [64] Junbo Zhao, Michael Mathieu, Ross Goroshin, and Yann Lecun. Stacked whatwhere autoencoders. arXiv preprint arXiv:1506.02351, 2015.
 [65] Junbo Zhao, Michael Mathieu, and Yann LeCun. Energybased generative adversarial network. arXiv preprint arXiv:1609.03126, 2016.
 [66] Song Chun Zhu and David Mumford. Grade: Gibbs reaction and diffusion equations. In Sixth International Conference on Computer Vision (IEEE Cat. No. 98CH36271), pages 847–854. IEEE, 1998.
Appendix A Model architectures
Table 5
summarizes the EBM architectures used in unsupervised learning (subsections 4.14.3). The slope of all leaky ReLU (lReLU)
maas2013rectifier functions are set to . For semisupervised learning from a 2D example (subsection 4.4), we use the same EBM structure as the one used in unsupervised learning from 2D examples, except that for the top fully connect layer, we change the number of output channels to , to model EBMs of two categories respectively. Table 6 summarizes the EBM architectures used in semisupervised learning from SVHN (subsection 4.4). After each convolutional layer, a weight normalization salimans2016weight layer and a leaky ReLU layer is added. The slope of leaky ReLU functions is set to . A weight normalization layer is added after the top fully connected layer.2D data  SVHN / CIFAR10 

fc. lReLU 
conv. lReLU, stride 
fc. lReLU  conv. lReLU, stride 
fc. lReLU  conv. lReLU, stride 
fc.  conv. , stride 
Convsmall  Convlarge 

dropout,  
conv. , stride  conv. , stride 
conv. , stride  conv. , stride 
conv. , stride  conv. , stride 
dropout,  
conv. , stride  conv. , stride 
conv. , stride  conv. , stride 
conv. , stride  conv. , stride 
dropout,  
conv. , stride  conv. , stride 
conv. , stride  conv. , stride 
conv. , stride  conv. , stride 
global max pool,  
fc. 
For Glow model, we follow the setting of kingma2018Glow . The architecture has multiscales with levels . Within each level, there are flow blocks. Each block has three convolutional layers (or fullyconnected layers) with a width of channels. After the first two layers, a ReLU activation is added. Table 7
summarizes the hyperparameters for different datasets.
Dataset  Levels  Blocks per level  Width  Layer type  Coupling 

2D data  1  10  128  fc  affine 
SVHN  3  8  512  conv  additive 
CelebA  3  16  512  conv  additive 
CIFAR10  3  32  512  conv  additive 
Comments
There are no comments yet.