1 Introduction
The exponential family is one of the most important classes of distributions in statistics and machine learning. It possesses a number of useful properties
(Brown, 1986) and includes many commonly used distributions, and the notably powerful undirected graphical models (Wainwright and Jordan, 2008)(LeCun et al., 2006; Wu et al., 2018), , Markov random fields (Kinderman and Snell, 1980), conditional random fields (Lafferty et al., 2001), and language models (Mnih and Teh, 2012; Mnih and Kavukcuoglu, 2013). Despite the flexibility of the exponential family, a majority of models in this family are intractable due to the partition function having no known analytic form. This leads to difficulties in evaluating, sampling and learning of the model, which hinders its application in practice. In this paper, we try to provide an answer to the longstanding fundamental question:How to design a simple yet effective algorithm for estimating the exponential family distributions?
There have been many attempts in the literature to answer this question. Since the maximum likelihood estimator (MLE) has already been wellstudied with many desirable statistical properties, , consistency and asymptotic unbiasedness and normality (Brown, 1986), a major research direction lies in approximating the MLE of the general exponential family. Contrastive divergence (CD) (Hinton, 2002) and its variants (Tieleman and Hinton, 2009; Du and Mordatch, 2019)
are representative algorithms: rather than evaluating the gradient of the partition function, which also involves an intractable integral, CD exploits a stochastic gradient estimator with samples generated from a few Markov chain Monte Carlo (MCMC) steps. The major demerits of the CD algorithm and its variants are twofold:
i), it requires careful design of the transition kernel in MCMC, which is not a trivial task itself; more severely, ii), the stochastic gradient is biased, which may lead to an inferior estimate.Because of these difficulties with MLE, many other learning approaches have been introduced, including pseudolikelihood (Besag, 1975), score matching (Hyvärinen, 2005), noisecontrastive estimation (Gutmann and Hyvärinen, 2010; Ceylan and Gutmann, 2018), and minimum probability flow (SohlDickstein et al., 2011). Instead of approximating the MLE, these algorithms introduce alternative learning objectives that naturally sidestep the calculation of the partition function.
Pseudolikelihood estimators (Besag, 1975)
factorize the joint distribution by the product of conditional distributions, in which each factor is the distribution of a single random variable conditioned on the others. Even these conditional distributions can lack partition functions in analytic form for general exponential family distributions, however.
Score matching (Hyvärinen, 2005) instead minimizes the Fisher divergence between the empirical distribution and the model, in order to learn the parameters. The Fisher divergence involves only the derivatives of the potential function in the exponential family, and does not require computation of the partition function. Fisher score matching method requires the calculation of third order derivatives for the optimization procedure, however, which can be prohibitive for complex models (Kingma and LeCun, 2010).
Noisecontrastive estimation (Gutmann and Hyvärinen, 2010; Ceylan and Gutmann, 2018)
recasts the problem as ratio estimation between the target distribution and a predefined auxiliary noise distribution. The ratio can then be estimated by logistic regression. The quality of the estimator and its statistical properties strongly rely on the choice of the auxiliary noise distribution, which needs to cover the support of the data, have an analytical expression, and be easy to sample. Such requirements are unlikely to be satisfied for many applications, especially in high dimensions.
Minimum probability flow (SohlDickstein et al., 2011) leverages the observation that in an ideal case, the empirical distribution is the stationary distribution of the dynamics constructed from the optimal exponential family. MPF then estimates the model by matching these two distributions. Since the parameter in the model is implicitly represented through the dynamics, the objective also avoids the calculation of the partition function. While the idea behind the MPF is inspiring, the challenge remains to construct sampling dynamics for efficient learning.
In this paper, we introduce a novel algorithm, Adversarial Dynamics Embedding (ADE), for learning and sampling from a maximum likelihood estimator, with a view to both computational and statistical efficiency. Our algorithm is largely motivated by the recent work of Dai et al. (2018a), who exploit a primaldual view of the MLE that allows to bypass the computation of the intractable partition function. Such a view provides a natural objective for jointly learning both a sampler and a model, as an alternative to the expensive and biased MCMC steps in the CD algorithm. The parametrization of the dual sampler should be flexible to reduce the extra approximation error and density tractable for entropy computation in the objective. Dai et al. (2018a) consider the transport mapping for the dual distribution, leading to difficulty in entropy estimation, and thus requiring extra auxiliary distribution to be learned with additional computational and memory cost.
Inspired by the properties of Hamiltonian MonteCarlo (HMC) (Neal et al., 2011), ,

HMC forms stationary distribution with independent potential and kinetic variables;

HMC can approximate the exponential family arbitrary closely,
we consider the augmented distribution with auxiliary kinetic variable as latent variable subsec:argumented_mle and introduce dynamics embedding to parametrize the sampler by mimicking the sampling algorithms in subsec:dyna_embed. Due to the independence between potential and kinetic variables, the posterior of latent kinetic variable has explicit closedform, eliminating the extra variational inference for latent kinetic variable. The dynamic embedding represents the dual samplers with tractable entropy via the primal model parameters, and thus enrich the flexibility of sampler without introducing extra parameters. Finally, the coupled primal model and dual sampler will be learned adversarially, promoting the learning sample efficiency as described in subsec:coupled_learning.
We further demonstrate that the proposed estimator contains CD, pseudolikelihood, score matching, nonlocal contrastive objectives, noisecontrastive estimation, and minimum probability flow as special cases with handdesigned dual samplers in sec:connection. Our numerical experiments in sec:experiments show that it also outperforms existing stateoftheart estimators empirically.
2 Preliminaries
We provide a brief introduction to the technical background that is needed in the derivation of the new algorithm, including exponential family, dynamicsbased MCMC sampler, and primaldual view of MLE.
2.1 Exponential Family and Energybased Model
The natural form of the exponential family over is defined as
(1) 
where and . The sufficient statistic
can be general parametric model, , neural networks. The
are the parameters that can be learned from observed data. The exponential family has be generalized to infinite dimensional case by settinig belongs to some RKHSs (Sriperumbudur et al., 2017; Sutherland et al., 2017; Dai et al., 2018a). The energybased model (LeCun et al., 2006) can be viewed as a special case of the exponential family by setting with .Traditionally, the is prefixed and covers the support , which is difficult in practical highdimension problems. In this paper, we focus on learning with together, ,
(2) 
Given samples and denoting as the valid parametrization family, the maximum loglikelihood estimation of the exponential family can be written as
(3) 
which leads to the gradient as
(4) 
Since and are intractable, solving the MLE of general exponential family is extremely difficult.
2.2 Dynamicsbased MCMC
The dynamicsbased MCMC is a general and effective tool for inference. The basic idea of dynamicsbased MCMC is to represent the target distribution as the solution to some (stochastic) differential equations. Therefore, samples of the target distribution can be obtained by simulating along the dynamics defined by the differential equations.
Hamiltonian MonteCarlo (HMC) (Neal et al., 2011) is a representative algorithm in this category, which exploits the wellknown Hamiltonian dynamics. Specifically, given a target distribution , the Hamiltonian is defined as
(5) 
where is the kinetic energy. The Hamiltonian dynamics generate over time , following
(6) 
Asymptotically as , the follows the target distribution. In practical HMC, to reduce the error from discretization, an acceptancerejection step can be introduced.
2.3 The PrimalDual View of MLE
To address the intractability of the partition function in MLE (3), the Fenchel duality of has been exploited (Rockafellar, 1970; Wainwright and Jordan, 2008; Dai et al., 2018a), , [Fenchel dual of logpartition, Proposition 5.1 (Wainwright and Jordan, 2008)] Let , we have:
(7)  
(8) 
where denotes the space of distributions, .
Plugging the Fenchel dual of into the MLE (3), we achieve a saddlepoint reformulation of as
(9) 
which bypasses the explicit computation of the partition function. Another byproduct of the primaldual view is that the dual distribution can be used in the inference stage, which usually requires executing relative expensive sampling algorithms in vanilla estimators.
The dual sampler plays a vital role in the primaldual formulation of the MLE for exponential family estimation in (9). To achieve better performance, we have several principal requirements in parameterizing the dual distribution:

the parametrization family needs to be flexible enough to diminish the error from solving the inner minimization problem;

the entropy of the parametrized dual distribution should be tractable.
Moreover, as shown in thm:fenchel_log_partition, the optimal dual sampler is totally decided by primal model ,

the parametrized dual sampler should explicitly incorporate the primal model .
Such dependence will reduce both the memory cost and sample complexity in the learning procedure. In the ideal case, the dual sampler is solely represented via , and thus, no extra auxiliary component for dual sampler need to be learned.
A wide spectrum of techniques for distribution parametrization has been developed, such as the reparametrized latent variable models (Kingma and Welling, 2013; Rezende et al., 2014), transport mapping (Goodfellow et al., 2014), and normalizing flows (Rezende and Mohamed, 2015; Kingma et al., 2016; Dinh et al., 2016), . However, none of these offtheshelf parametrization satisfies the requirements about flexibility and density tractability simultaneously, not even to mention the dependency requirement. Specifically, the density value in transport mapping is intractable. On the other hand, the reparametrized trick restricts the distributions lie in known density form and the flowbased models rely on invertible operations, both limiting the expressive ability. Meanwhile, the reparametrized latent variable model requires extra variational posterior to be learned.
3 Adversarial Dynamics Embedding
In this section, we consider augmenting the original exponential family with kinetic variables. The primaldual view of MLE for such new augmented model will naturally introduce an adversarial dual sampler with momentums as the latent variables that can be parametrized by dynamic embedding, satisfying all three requirements above without extra variational distribution introduced.
We start with the embeddings of classical Hamiltonian dynamics (Neal et al., 2011; Caterini et al., 2018) for dual distribution parametrization as an example, and discuss its generalized dynamics in latent space and the stochastic Langevin dynamics. This technique can be extended to other dynamics, each with its own advantages. We provide these extensions in Appendix B.
3.1 PrimalDual View of Augmented MLE
As we discussed, it is difficult to find a parametrization of in (9) satisfying all the three requirements simultaneously. Therefore, instead of directly working on (9) for original model, inspired with HMC, we consider the augmented exponential family with auxiliary momentum variable, ,
(10)  
The MLE of such model is
(11)  
where the last equation comes from the definition of true posterior . By exploiting the independence between and in defined in (10), we have Applying the Fenchel dual to of the augmented model (10), we derive the primaldual view of (11), leading to the objective,
(12) 
Remark (momentum as latent variable):
The in (12) contains latent variable . One can exploit latent variable model for in (9). However, the in (9) requires marginalization, which is intractable in general cases, and usually estimated via variational inference with extra posterior introduced. By directly considering the specifically designed augmented model, (12) eliminates such extra variational steps.
Similarly, one can introduce multiple momentums as the hierarchical latent variable augmented model, ,
leading to the objective as
(13) 
3.2 Representing Dual Sampler via Primal Model
In this section, we introduce the Hamiltonian dynamic embedding representation of distributions for dual sampler, as well as its generalization and special instantiation.
The vanilla HMC is derived by discretizing the Hamiltonian dynamics defined in (6) with leapfrog integrator. Specifically, in a single time step, the sample moves according to
(14) 
where is defined as the leapfrog stepsize. Denote the onestep leapfrog as and assume the , after iterations, we obtain
(15) 
which can be viewed as a neural network with special architecture. We term this Hamiltonian (HMC) dynamics embedding, by which the dual sampler is represented by the primal model, , the potential function .
The flexibility of the HMC embedding for parameterizing the distribution actually comes from the nature of the dynamicsbased samplers. In the limit the case, the proposed neural network (15) reduces to a gradient flow, whose stationary distribution is exactly the model distribution
Formally, the approximation ability of the HMC embedding can be justified, [HMC embeddings as gradient flow] For a continuous time with infinitesimal stepsize , the density of the particles , denoted as , follows FokkerPlanck equation
(16) 
with . Then as . Due to the space limit, the details of the proofs are postpone in appendix:derivation_sec_mle. As we can see, the neural network composed by HMC embedding can approximate well to exponential family distributions on continuous variables.
Remark (Generalized HMC dynamics in latent space):
the leapfrog operation in the vanilla HMC is directly working in the original observation space, which is highdimensional and noisy. We generalize the leapfrog update rule in latent space and form the new dynamics, ,
(17)  
(18)  
(19) 
where denotes the elementwise product. Specifically, the and rescales and coordinately. The can be understood as projecting the gradient information to the an essential latent space. Then, for updateing , the latent momentum is projected back to original space by . With such generalized leapfrog updates, the dynamic system avoids operating the highdimensional noisy input and become more efficient. We emphasize that the proposed generalized leapfrog parametrization (17) is different from the one used in Levy et al. (2017), which is inspired by the realNVP flow (Dinh et al., 2016).
By embedding the generalized HMC (17), we have a flexible layer . The will be learned in addition to the stepsize. In fact, the vanilla HMC layer is a special case of the by setting as zero and as identity function.
Remark (Stochastic Langevin dynamics):
we can recover the stochastic Langevin dynamics from leapfrog step by resampling momentum in every step. Specifically, the sample moves according to
(20) 
where . As we can see, the stochastic Langevin dynamics resample to replace the momentum in leapfrog (14), ignoring the gradient accumulation. After iteration, we obtain
(21) 
as the derived neural network. Similarly, we can also generalize the stochastic Langevin updates by introducing and in corresponding positions.
On the other hand, the density value of the proposed neural network is also tractable, leading to tractable entropy estimation in (12) and (13). In particular, we have [Density value evaluation] If , after vanilla HMC steps (14), we have
(22) 
For the from the generalized leapfrog steps (17), we have
(23) 
where and denotes
(24) 
For the from the stochastic Langevin dynamics (20) with , we have
(25) 
The proof of thm:hmc_density can be found in appendix:derivation_sec_mle.
As we can see, the proposed dynamics embedding satisfies all three requirements: it covers a flexible family of distributions with tractable density value, meanwhile, it couples the dual sampler with primal model, which leads to memory and sample efficient learning algorithm, as we will introduce in next section.
3.3 Coupled Model and Sampler Learning
Plug the step Hamiltonian dynamics embedded parametrization (14) or (17) into the primaldual MLE of the augmented model (12) and apply the density value evaluation in (22) and (23), respectively, we obtain our ultimate adversarial objective,
(26) 
Here denotes the learnable components in the dynamics embedding, , initialization , the stepsize in the HMC/Langevin updates, and the adaptive part in the generalized HMC, which is learned adversarially. The parametrization of initial distribution is discussed in appendix:practical. For the step stochastic Langevin dynamics embedded parametrization (20), we apply the density value (25) to (13), which leads to a similar saddlepoint optimization with hierarchical momentums.
We can now use stochastic gradient descent to estimate
for the exponential families as well as the the parameters in the dynamics embedding adversarially. Note that the generated sample is a function of , the gradient w.r.t. should also take these variables into account as backpropagation through time (BPTT), ,(27) 
We illustrate the MLE via HMC adversarial dynamics embedding (ADE) in alg:mle_de.
The same technique can be applied to other parametrization of dual sampler via alternative dynamics embeddings, as listed in appendix:variants, leading to different architectures and the entropy term evaluations. Consider the dynamics embedding as an adaptive sampler that can be automatically learned w.r.t. different models and datasets, then, the updates for in can be understood as learning to sample.
4 Connections to Other Estimators
We establish the connections between the proposed estimator with existing approaches. In summary, the proposed algorithm is equipped with an adaptive sampler; while the existing algorithms can be recast as the special cases of our estimator with some handdesigned samplers, and thus, may lead to extra error and inferior solution.
4.1 Connection to Contrastive Divergence
The CD algorithm (Hinton, 2002) is a special case of the proposed algorithm. By thm:fenchel_log_partition, the optimal solution to the inner optimization is . Applying Danskin’s theorem (Bertsekas, 1995), the gradient of w.r.t. is
(28) 
To estimate the integral , the CD algorithm approximates the negative term in (28) stochastically with a finite MCMC step away from empirical data.
In the proposed dual sampler, by setting to be the empirical distribution and eliminating the sampling learning, the dynamic embedding will collapse to CD with HMC steps if we remove gradient through the sampler, , ignoring the third term in (27). Similarly, the persistent CD (PCD) (Tieleman, 2008) and recent ensemble CD (Du and Mordatch, 2019) can also be recast as special cases by setting the negative sampler to be MCMC with initial samples from previous model and ensemble of MCMC samplers, respectively.
From this perspective, the CD and PCD algorithms induce errors not only from the sampler, but also from the gradient backpropagation truncation. The proposed algorithm escapes these sources of bias by learning to sample, and by adopting true gradients, respectively. Therefore, the proposed estimator is expected to achieve better performance than CD as demonstrated in the empirical experiments sec:real_exp.
4.2 Connection to Score Matching
The score matching (Hyvärinen, 2005) estimates the exponential family by minimizing the Fisher divergence, leading to optimization,
(29) 
As explained in Hyvärinen (2007), the objective (29) can be derived as the ndorder Taylor approximation of the MLE with step Langevin Monte Carlo as the dual sampler. Specifically, the Langevin Monte Carlo generates samples via
then, a simple Taylor expansion gives
Plug such into the negative expectation in , leading to
which is exactly the defined in (29).
Therefore, the score matching can be viewed as applying Taylar expansion approximation with fixed step Langevin sampler in our framework, which is compared in sec:synthetic_exp.
4.3 Connection to PseudoLikelihood and Conditional Composite Likelihood
The pseudolikelihood estimation (Besag, 1975) is a special case of the proposed algorithm by restricting the parametrization of the dual distribution. Specifically, denote the with , instead of directly maximizing likelihood, the pseudolikelihood estimator is maximizing
(30) 
Then, the is updated by the following the gradient of , ,
The pseudolikelihood estimator can be recast as a special case of the proposed framework if we fix the dual sampler as i), sample uniformly; ii), sample and mask ; iii), sample and compose .
The conditional composite likelihood (Lindsay, 1988) is a generalization of pseudolikelihood by maximizing
(31) 
where and . Similarly, the composite likelihood is updating with prefixed conditional block sampler for negative sampling.
Same as CD, the prefixed sampler and the biased gradient in pseudolikelihood and composite likelihood estimator will induce extra errors and lead to inferior solution. Moreover, the pseudolikelihood may not applicable to the general exponential family with continuous variables, whose conditional distribution is also intractable.
4.4 Connection to Nonlocal Contrastive Objectives
The nonlocal contrastive estimator (Vickrey et al., 2010) is obtained by maximizing
(32) 
where denotes some prefixed partition of , , and with . The objective (32) leads to the update direction as
(33) 
where with as the empirical distribution and , . Therefore, the nonlocal contrastive objective is a special case of the proposed framework with the dual sampler as i), sample uniformly from ; ii), sample conditional on according to ; iii), sample within . Such negative sampling method is also not applicable to the general exponential family with continuous variables.
4.5 Connection to Minimum Probability Flow
In the continuous state model, the minimum probability flow (SohlDickstein et al., 2011) estimates the exponential family by maximizing
where is a handdesigned symmetric transition kernel based on the potential function , , Hamiltonian or Langevin simulation. Then, the MPF update direction can be rewritten as
(34) 
where . The probability flow operator actually defines a Markov chain sampler that achieves the following balance equation,
Similar to CD and score matching, the MPF exploits the step MCMC. Moreover, the gradient in MPF also considers the effects in sampler as the third term in (34). Therefore, the MPF can be recast as a special case of our algorithm with the prefixed dual sampler as and .
4.6 Connection to NoiseContrastive Estimator
Instead of directly estimating the in the exponential family, Gutmann and Hyvärinen (2010) propose the noisecontrastive estimation (NCE) for the density ratio between the exponential family and some user defined reference distribution , from which the parameter can be reconstructed. Specifically, the NCE considers an alternative representation of exponential family distribution as , which explicitly enforces . The NCE is obtained by maximizing
(35) 
where . Then, we have the gradient of as
(36) 
The negative sampler in the eq:nce_grad can be understood as an approximate importance sampling algorithm where the proposal is and the rewighting part is . As the approaching , the will approach the true ratio , and thus, the negative samples will converge to true model samples.
The NCE can be understood as learning an important sampler. However, the performance of NCE highly relies on the quality , , the choice of . It is required to cover the support of , which is nontrivial in practical highdimensional applications.
5 Related Work
The parametrization of the dual sampler should be both flexible enough and density tractable to achieve better performance. Pioneering works are limited in either one aspect or other. Kim and Bengio (2016) parameterize the sampler via a deep directed graphical model, whose approximation ability is limited to known distributions. Meanwhile, they fit by minimizing the reverse divergence to avoid the entropy calculation, leading to unclear relationship to MLE. Due to the difficulty of the entropy term for general transport mapping parametrization, a variety of approximate surrogates have been proposed to relax the density value tractability requirement. Liu and Wang (2017) learn the sampler to mimic the Stein variational gradient descent sampling procedure; Dai et al. (2017)
propose algorithms relying on either a heuristic approximation or a upper bound of the entropy, with extra auxiliary component introduced to be learned;
Dai et al. (2018a)apply a second Fenchel dual representation to reformulate the entropy term, at the cost of introducing another auxiliary function to be estimated. Meanwhile, the second Fenchel duality parametrization relies on a proposal distribution with the same support for numerical stability, which is impractical for highdimensional data. In contrast to these existing methods, the proposed dynamics embedding achieves both flexibility and tractability of entropy estimation with less independent auxiliary parameters introduced. Moreover, the dynamics embedding avoids the design of proposal, which is naturally suitable for highdimension data.
One of our major contributions is learning a sampling strategy for the exponential family estimation through the primaldual view of MLE. The proposed algorithm shares some similarities with recent advances in meta learning for sampling (Levy et al., 2017; Feng et al., 2017; Song et al., 2017; Gong et al., 2018), in which the sampler is parametrized via neural network and will be learned through certain objectives. However, we emphasize that the most significant difference lies in the ultimate goals: we focus on exponential family model estimation and the learned sampler is only introduced to assist with this objective. By contrast, the learning to sample techniques are targeting on learning a fixed model that is already given. This fundamentally distinguishes the proposed ADE from methods that only learns samplers, leading to totally different learning criterion and algorithm updates, , the primal model will be learned back through the learned sampler, from which perspective the proposed algorithm can be understood as metalearning.
6 Experiments
6.1 Synthetic experiments
(a) 2spirals  (b) Banana  (c) circles  (d) cos  (e) Cosine  (f) Funnel  (g) swissroll 
(h) line  (i) moons  (j) Multiring  (k) pinwheel  (l) Ring  (m) Spiral  (n) Uniform 
Learned samplers in odd row and potential function
in even row from different synthetic datasets.We parametrize the potential function
with fully connected multilayer perceptron with 3 hidden layers. Each hidden layer has
hidden units. We use ReLU to do the nonlinear activation in each hidden layer. For baseline methods, we compare with score matching (SM) estimator
(Hyvärinen, 2005) and primaldual MLE with the normalizing flow (Rezende and Mohamed, 2015) parametrized dual sampler (NF). Note that although the normalizing flow can be used to perform maximum likelihood estimation (MLE) to fit the data (Rezende and Mohamed, 2015), here the flow is asked to help train the exponential model . Using normalizing flow to learn the exponential family model can be viewed as the special case of our method, which has only parameterized as flow, but no HMC dynamics is involved. Thus this is also served as one ablation study for the proposed ADE.For our method, we use normalizing flow as the initial proposal distribution . Then we perform several steps of stochastic Langevin operation to get the final samples . is the distribution for sampling momentum. We clip the norm of when updating , and clip when updating . The coefficient in (26) is tuned in . For the NF baseline, we tune the number of layers in . For our ADE, we fix the number of normalizing flow layers to be , and then perform at most steps of dynamics updates. So finally, the number of steps for sampling is comparable, while the ADE maintains less memory cost.
In this section, we compare the proposed ADE with other baseline methods on several synthetic datasets. For visualization purpose, all the samples are in space. The dataset generators are collect from several opensource projects ^{1}^{1}1https://github.com/rtqichen/ffjord ^{2}^{2}2https://github.com/kevinwli/deepkexpfam. During training, we use this generator to generate the data from the true distribution on the fly. To get a quantitative comparison, we also generate 1,000 data samples for heldout evaluation.
Dataset  SM  NF  ADE 

2spirals  5.09  0.69  0.76 
Banana  8.10  0.88  1.13 
circles  4.90  0.76  1.30 
cos  10.36  0.91  0.58 
Cosine  8.34  2.15  1.00 
Funnel  13.07  0.92  1.02 
swissroll  19.93  1.97  0.97 
line  10.28  0.39  0.53 
moons  41.34  0.80  1.04 
Multiring  2.01  0.30  1.05 
pinwheel  18.41  3.01  0.83 
Ring  9.22  161.89  1.38 
Spiral  9.48  5.96  0.62 
Uniform  5.88  0.00  1.08 
In fig:synthetic, we visualize the learned distribution using both the learned dual sampler and also the exponential model , where is a constant that is tuned within [0.01, 10]. Overall one can see the sampler almost perfectly recover the distribution, and the learned almost captures the landscape of the distribution. We also visualize the convergence behavior in fig:synthetic_sampler. We can see the samples are converging to the true data distribution smoothly. As the learned sampler depends on the , so this figure also suggests the nice convergence behavior of indirectly.
The quantitative comparison of the sampler is illustrated in tab:synthetic_mmd. For NF and ADE, we use 1,000 samples from their sampler to calculate MMD with gaussian kernel. The kernel bandwidth is chosen using median trick (Dai et al., 2016). For SM, since there is no such sampler available, we directly use vanilla HMC to get sample from the learned model , and use them to estimate MMD. From the table we can see, our approach gets the best MMD in all cases. This demonstrates the effectiveness of dynamics embedding, as comparing to only using the normalizing flow model.
6.2 Realworld Image Datasets
(a) Samples on MNIST  (b) Histogram on MNIST  (c) Samples on CIFAR10  (d) Histogram on CIFAR10 
To further verify that the proposed adversarial dynamics embedding can learn complicated exponential family and the corresponding sampler well for realworld images, we trained models on MNIST and CIFAR10 dataset. For both datasets, we use the CNN architecture for discriminator as used in Miyato et al. (2018), and spectral normalization is added to all discriminator layers. Specifically, for the discriminator in CIFAR10 experiments, we replace all downsampling operations by average pooling, as in Du and Mordatch (2019). The detailed architectures are shown in subsec:real_archi.
We parametrize the initial distribution with neural nets as deep Gaussian latent variable model (Deep LVM) specified in appendix:practical. After several HMC steps on this initial distribution, instead of clipping the images, we pass the sample to function to obtain the final sampled images.
We used the standard spectral normalization on the discriminator to stabilize the training process, and Adam with learning rate to optimize our model. We trained the models with iterations with batch size being . The step sizes for our HMC sampler are independently learned for all HMC dimensions but shared among all time steps, and the values are all initialized to . We set the number of HMC steps to . The coefficient of the entropy regularization term is set to and that of the
regularization on the momentum vector in the last HMC step is set to
.Model  Inception Score 

WGANGP (Gulrajani et al., 2017)  6.50 
Spectral GAN (Miyato et al., 2018)  7.42 
Langevin PCD (Du and Mordatch, 2019)  6.02 
Langevin PCD (10 ensemble) (Du and Mordatch, 2019)  6.78 
ADE: Deep LVM init w/o HMC  6.51 
ADE: Deep LVM init w/ HMC  7.43 
We report the inception scores in tab:inception_score. For our ADE, we trained two models with Deep LVM as the initial dual sampler, one with HMC samplers and one without that. We observed that the HMC sampler greatly improves the performance of the samples generated by initial Deep LVM along and enables the generator to be comparable with Spectral GAN. Moreover, we note that the initial Deep LVM helps sample generation, compared to the Langevin PCD models reported in (Du and Mordatch, 2019).
We show some samples of generated images in fig:real_images(a) and (c). More sampled images can be found in subsec:more_images. In addition, we also plot the potential distribution (unnormalized) of generated samples and that of the real images on MNIST and CIFAR10
(using 1000 data points for each) in fig:real_images(b) and (d). The distributions of energies of both generated images and real iamges mostly overlap with each other, showing that the discriminators learn the desirable distributions. Also, with simple importance sampling and proposal distribution being uniform distribution on
( is the dimension of images), the log likelihood (in nats) on CIFAR10 is estimated to be around .7 Conclusion
We proposed a novel method to efficiently construct maximum likelihood estimator (MLE) for general exponential families. Specifically, by utilizing the nice properties of primaldual formulation of the MLE of the augmented exponential family distribution with kinetic variables, we incorporate dynamicsbased distribution parametrization for dual sampler into the estimation process in a fully differentiable way. Such coupling of the primal model and the dual sampler benefits each other, and thus, achieving both better model estimation and efficient inference at the same time. Our empirical results on both synthetic and real data validate the feasibility of our proposed dynamics embedding. In future work, we will extend this method to other graphical model learning problems.
Acknowledgements
NH is supported in part by NSFCRII1755829, NSFCMMI1761699, and NCSA Faculty Fellowship. LS is supported in part by NSF IIS1218749, NIH BIGDATA 1R01GM108341, NSF CAREER IIS1350983, NSF IIS1639792 EAGER, NSF IIS1841351 EAGER, NSF CCF1836822, NSF CNS1704701, ONR N000141512340, Intel ISTC, NVIDIA, Amazon AWS, Siemens and Google Cloud.
References
 Bertsekas (1995) D. P. Bertsekas. Nonlinear Programming. Athena Scientific, Belmont, MA, 1995.
 Besag (1975) J. Besag. Statistical analysis of nonlattice data. The Statistician, 24:179–195, 1975.
 Boutsidis et al. (2017) Christos Boutsidis, Petros Drineas, Prabhanjan Kambadur, EugeniaMaria Kontopoulou, and Anastasios Zouzias. A randomized algorithm for approximating the log determinant of a symmetric positive definite matrix. Linear Algebra and its Applications, 533:95–117, 2017.
 Brown (1986) Lawrence D. Brown. Fundamentals of Statistical Exponential Families, volume 9 of Lecture notesmonograph series. Institute of Mathematical Statistics, Hayward, Calif, 1986.
 Caterini et al. (2018) Anthony L Caterini, Arnaud Doucet, and Dino Sejdinovic. Hamiltonian variational autoencoder. arXiv preprint arXiv:1805.11328, 2018.
 Ceylan and Gutmann (2018) Ciwan Ceylan and Michael U Gutmann. Conditional noisecontrastive estimation of unnormalised models. arXiv preprint arXiv:1806.03664, 2018.

Dai et al. (2016)
Bo Dai, Niao He, Hanjun Dai, and Le Song.
Provable bayesian inference via particle mirror descent.
InProceedings of the 19th International Conference on Artificial Intelligence and Statistics
, pages 985–994, 2016.  Dai et al. (2018a) Bo Dai, Hanjun Dai, Arthur Gretton, Le Song, Dale Schuurmans, and Niao He. Kernel exponential family estimation via doubly dual embedding. 2018a. URL http://arxiv.org/abs/1811.02228.
 Dai et al. (2018b) Bo Dai, Hanjun Dai, Niao He, Weiyang Liu, Zhen Liu, Jianshu Chen, Lin Xiao Xiao, and Le Song. Coupled variational bayes via optimization embedding. In NIPS, 2018b.
 Dai et al. (2017) Zihang Dai, Amjad Almahairi, Philip Bachman, Eduard Hovy, and Aaron Courville. Calibrating energybased generative adversarial networks. arXiv preprint arXiv:1702.01691, 2017.
 Dinh et al. (2016) Laurent Dinh, Jascha SohlDickstein, and Samy Bengio. Density estimation using real nvp. arXiv preprint arXiv:1605.08803, 2016.
 Du and Mordatch (2019) Yilun Du and Igor Mordatch. Implicit generation and generalization in energybased models. arXiv preprint arXiv:1903.08689, 2019.
 Feng et al. (2017) Yihao Feng, Dilin Wang, and Qiang Liu. Learning to draw samples with amortized stein variational gradient descent. arXiv preprint arXiv:1707.06626, 2017.
 Gong et al. (2018) Wenbo Gong, Yingzhen Li, and José Miguel HernándezLobato. Metalearning for stochastic gradient mcmc. arXiv preprint arXiv:1806.04522, 2018.
 Goodfellow et al. (2014) Ian Goodfellow, Jean PougetAbadie, Mehdi Mirza, Bing Xu, David WardeFarley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In Advances in Neural Information Processing Systems, pages 2672–2680, 2014.
 Grathwohl et al. (2018) Will Grathwohl, Ricky TQ Chen, Jesse Betterncourt, Ilya Sutskever, and David Duvenaud. Ffjord: Freeform continuous dynamics for scalable reversible generative models. arXiv preprint arXiv:1810.01367, 2018.
 Gulrajani et al. (2017) Ishaan Gulrajani, Faruk Ahmed, Martin Arjovsky, Vincent Dumoulin, and Aaron C Courville. Improved training of wasserstein gans. In Advances in Neural Information Processing Systems, pages 5767–5777, 2017.
 Gutmann and Hyvärinen (2010) Michael Gutmann and Aapo Hyvärinen. Noisecontrastive estimation: A new estimation principle for unnormalized statistical models. In Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics, pages 297–304, 2010.
 Han et al. (2015) Insu Han, Dmitry Malioutov, and Jinwoo Shin. Largescale logdeterminant computation through stochastic chebyshev expansions. In International Conference on Machine Learning, pages 908–917, 2015.
 Hinton (2002) Geoffrey E. Hinton. Training products of experts by minimizing contrastive divergence. Neural Computation, 14(8):1771–1800, 2002.
 Hyvärinen (2005) A. Hyvärinen. Estimation of nonnormalized statistical models using score matching. Journal of Machine Learning Research, 6:695–709, 2005.
 Hyvärinen (2007) Aapo Hyvärinen. Some extensions of score matching. Computational statistics & data analysis, 51(5):2499–2512, 2007.
 Kim and Bengio (2016) Taesup Kim and Yoshua Bengio. Deep directed generative models with energybased probability estimation. arXiv preprint arXiv:1606.03439, 2016.
 Kinderman and Snell (1980) R. Kinderman and J. L. Snell. Markov Random Fields and their applications. Amer. Math. Soc., Providence, RI, 1980.
 Kingma and Dhariwal (2018) Diederik P Kingma and Prafulla Dhariwal. Glow: Generative flow with invertible 1x1 convolutions. arXiv preprint arXiv:1807.03039, 2018.
 Kingma and LeCun (2010) Diederik P Kingma and Yann LeCun. Regularized estimation of image statistics by score matching. In NIPS, 2010.
 Kingma and Welling (2013) Diederik P Kingma and Max Welling. Autoencoding variational bayes. arXiv preprint arXiv:1312.6114, 2013.
 Kingma et al. (2016) Diederik P Kingma, Tim Salimans, Rafal Jozefowicz, Xi Chen, Ilya Sutskever, and Max Welling. Improved variational inference with inverse autoregressive flow. In Advances in Neural Information Processing Systems, pages 4743–4751, 2016.
 Lafferty et al. (2001) J. D. Lafferty, A. McCallum, and F. Pereira. Conditional random fields: Probabilistic modeling for segmenting and labeling sequence data. In Proceedings of International Conference on Machine Learning, volume 18, pages 282–289, San Francisco, CA, 2001. Morgan Kaufmann.
 LeCun et al. (2006) Yann LeCun, Sumit Chopra, Raia Hadsell, M Ranzato, and F Huang. A tutorial on energybased learning. Predicting structured data, 1(0), 2006.
 Levy et al. (2017) Daniel Levy, Matthew D Hoffman, and Jascha SohlDickstein. Generalizing hamiltonian monte carlo with neural networks. arXiv preprint arXiv:1711.09268, 2017.
 Lindsay (1988) B. G. Lindsay. Composite likelihood methods. Contemporary Mathematics, 80(1):221–239, 1988.
 Liu and Wang (2017) Qiang Liu and Dilin Wang. Learning deep energy models: Contrastive divergence vs. amortized mle. arXiv preprint arXiv:1707.00797, 2017.
 Miyato et al. (2018) Takeru Miyato, Toshiki Kataoka, Masanori Koyama, and Yuichi Yoshida. Spectral normalization for generative adversarial networks. arXiv preprint arXiv:1802.05957, 2018.
 Mnih and Kavukcuoglu (2013) Andriy Mnih and Koray Kavukcuoglu. Learning word embeddings efficiently with noisecontrastive estimation. In Advances in Neural Information Processing Systems, pages 2265–2273, 2013.
 Mnih and Teh (2012) Andriy Mnih and Yee Whye Teh. A fast and simple algorithm for training neural probabilistic language models. arXiv preprint arXiv:1206.6426, 2012.
 Neal et al. (2011) Radford M Neal et al. Mcmc using hamiltonian dynamics. Handbook of Markov Chain Monte Carlo, 2(11), 2011.

Rezende et al. (2014)
Danilo J Rezende, Shakir Mohamed, and Daan Wierstra.
Stochastic backpropagation and approximate inference in deep generative models.
In Proceedings of the 31st International Conference on Machine Learning (ICML14), pages 1278–1286, 2014.  Rezende and Mohamed (2015) Danilo Jimenez Rezende and Shakir Mohamed. Variational inference with normalizing flows. arXiv preprint arXiv:1505.05770, 2015.
 Rockafellar (1970) R. T. Rockafellar. Convex Analysis, volume 28 of Princeton Mathematics Series. Princeton University Press, Princeton, NJ, 1970.
 SohlDickstein et al. (2011) Jascha SohlDickstein, Peter Battaglino, and Michael R DeWeese. Minimum probability flow learning. In Proceedings of the 28th International Conference on International Conference on Machine Learning, pages 905–912. Omnipress, 2011.
 Song et al. (2017) Jiaming Song, Shengjia Zhao, and Stefano Ermon. Anicemc: Adversarial training for mcmc. In Advances in Neural Information Processing Systems, pages 5140–5150, 2017.
 Sriperumbudur et al. (2017) Bharath Sriperumbudur, Kenji Fukumizu, Arthur Gretton, Aapo Hyvärinen, and Revant Kumar. Density estimation in infinite dimensional exponential families. The Journal of Machine Learning Research, 18(1):1830–1888, 2017.
 Sutherland et al. (2017) Dougal J Sutherland, Heiko Strathmann, Michael Arbel, and Arthur Gretton. Efficient and principled score estimation with nystr” om kernel exponential families. arXiv preprint arXiv:1705.08360, 2017.

Tieleman (2008)
Tijmen Tieleman.
Training restricted Boltzmann machines using approximations to the likelihood gradient.
In Proceedings of the International Conference on Machine Learning, 2008.  Tieleman and Hinton (2009) Tijmen Tieleman and Geoffrey Hinton. Using fast weights to improve persistent contrastive divergence. In Proceedings of the 26th Annual International Conference on Machine Learning, pages 1033–1040. ACM, 2009.
 Vickrey et al. (2010) David Vickrey, Cliff ChiungYu Lin, and Daphne Koller. Nonlocal contrastive objectives. In Proceedings of the International Conference on Machine Learning, 2010.
 Wainwright and Jordan (2008) M. J. Wainwright and M. I. Jordan. Graphical models, exponential families, and variational inference. Foundations and Trends in Machine Learning, 1(1 – 2):1–305, 2008.
 Wu et al. (2018) Ying Nian Wu, Jianwen Xie, Yang Lu, and SongChun Zhu. Sparse and deep generalizations of the frame model. Annals of Mathematical Sciences and Applications, 3(1):211–254, 2018.
 Zhang et al. (2018) Linfeng Zhang, Weinan E, and Lei Wang. MongeAmpère flow for generative modeling. arXiv preprint arXiv:1809.10188, 2018.

Zhu and Mumford (1998)
Song Chun Zhu and David Mumford.
Grade: Gibbs reaction and diffusion equations.
In
Sixth International Conference on Computer Vision (IEEE Cat. No. 98CH36271)
, pages 847–854. IEEE, 1998.
Appendix A Proof of Theorems in sec:mle
thm:grad_flow (HMC embeddings as gradient flow) For a continuous time with infinitesimal stepsize , the density of the particles , denoted as , follows FokkerPlanck equation
(37) 
with . Then as . The first part of the theorem is trivial. When , the HMC follows the dynamical system
By applying the FokkerPlanck equation, we obtain
(38) 
To show that the stationary distribution of such dynamical system converges to , recall the fact that
(39) 
The FokkerPlanck equation can be rewritten as
(40) 
Substitute into (40) and notice
we have , , is the stationary distribution.