The exponential family is one of the most important classes of distributions in statistics and machine learning. It possesses a number of useful properties(Brown, 1986) and includes many commonly used distributions, and the notably powerful undirected graphical models (Wainwright and Jordan, 2008)2006; Wu et al., 2018), , Markov random fields (Kinderman and Snell, 1980), conditional random fields (Lafferty et al., 2001), and language models (Mnih and Teh, 2012; Mnih and Kavukcuoglu, 2013). Despite the flexibility of the exponential family, a majority of models in this family are intractable due to the partition function having no known analytic form. This leads to difficulties in evaluating, sampling and learning of the model, which hinders its application in practice. In this paper, we try to provide an answer to the longstanding fundamental question:
How to design a simple yet effective algorithm for estimating the exponential family distributions?
There have been many attempts in the literature to answer this question. Since the maximum likelihood estimator (MLE) has already been well-studied with many desirable statistical properties, , consistency and asymptotic unbiasedness and normality (Brown, 1986), a major research direction lies in approximating the MLE of the general exponential family. Contrastive divergence (CD) (Hinton, 2002) and its variants (Tieleman and Hinton, 2009; Du and Mordatch, 2019)
are representative algorithms: rather than evaluating the gradient of the partition function, which also involves an intractable integral, CD exploits a stochastic gradient estimator with samples generated from a few Markov chain Monte Carlo (MCMC) steps. The major demerits of the CD algorithm and its variants are twofold:i), it requires careful design of the transition kernel in MCMC, which is not a trivial task itself; more severely, ii), the stochastic gradient is biased, which may lead to an inferior estimate.
Because of these difficulties with MLE, many other learning approaches have been introduced, including pseudo-likelihood (Besag, 1975), score matching (Hyvärinen, 2005), noise-contrastive estimation (Gutmann and Hyvärinen, 2010; Ceylan and Gutmann, 2018), and minimum probability flow (Sohl-Dickstein et al., 2011). Instead of approximating the MLE, these algorithms introduce alternative learning objectives that naturally sidestep the calculation of the partition function.
Pseudo-likelihood estimators (Besag, 1975)
factorize the joint distribution by the product of conditional distributions, in which each factor is the distribution of a single random variable conditioned on the others. Even these conditional distributions can lack partition functions in analytic form for general exponential family distributions, however.
Score matching (Hyvärinen, 2005) instead minimizes the Fisher divergence between the empirical distribution and the model, in order to learn the parameters. The Fisher divergence involves only the derivatives of the potential function in the exponential family, and does not require computation of the partition function. Fisher score matching method requires the calculation of third order derivatives for the optimization procedure, however, which can be prohibitive for complex models (Kingma and LeCun, 2010).
recasts the problem as ratio estimation between the target distribution and a pre-defined auxiliary noise distribution. The ratio can then be estimated by logistic regression. The quality of the estimator and its statistical properties strongly rely on the choice of the auxiliary noise distribution, which needs to cover the support of the data, have an analytical expression, and be easy to sample. Such requirements are unlikely to be satisfied for many applications, especially in high dimensions.
Minimum probability flow (Sohl-Dickstein et al., 2011) leverages the observation that in an ideal case, the empirical distribution is the stationary distribution of the dynamics constructed from the optimal exponential family. MPF then estimates the model by matching these two distributions. Since the parameter in the model is implicitly represented through the dynamics, the objective also avoids the calculation of the partition function. While the idea behind the MPF is inspiring, the challenge remains to construct sampling dynamics for efficient learning.
In this paper, we introduce a novel algorithm, Adversarial Dynamics Embedding (ADE), for learning and sampling from a maximum likelihood estimator, with a view to both computational and statistical efficiency. Our algorithm is largely motivated by the recent work of Dai et al. (2018a), who exploit a primal-dual view of the MLE that allows to bypass the computation of the intractable -partition function. Such a view provides a natural objective for jointly learning both a sampler and a model, as an alternative to the expensive and biased MCMC steps in the CD algorithm. The parametrization of the dual sampler should be flexible to reduce the extra approximation error and density tractable for entropy computation in the objective. Dai et al. (2018a) consider the transport mapping for the dual distribution, leading to difficulty in entropy estimation, and thus requiring extra auxiliary distribution to be learned with additional computational and memory cost.
Inspired by the properties of Hamiltonian Monte-Carlo (HMC) (Neal et al., 2011), ,
HMC forms stationary distribution with independent potential and kinetic variables;
HMC can approximate the exponential family arbitrary closely,
we consider the augmented distribution with auxiliary kinetic variable as latent variable subsec:argumented_mle and introduce dynamics embedding to parametrize the sampler by mimicking the sampling algorithms in subsec:dyna_embed. Due to the independence between potential and kinetic variables, the posterior of latent kinetic variable has explicit closed-form, eliminating the extra variational inference for latent kinetic variable. The dynamic embedding represents the dual samplers with tractable entropy via the primal model parameters, and thus enrich the flexibility of sampler without introducing extra parameters. Finally, the coupled primal model and dual sampler will be learned adversarially, promoting the learning sample efficiency as described in subsec:coupled_learning.
We further demonstrate that the proposed estimator contains CD, pseudo-likelihood, score matching, non-local contrastive objectives, noise-contrastive estimation, and minimum probability flow as special cases with hand-designed dual samplers in sec:connection. Our numerical experiments in sec:experiments show that it also outperforms existing state-of-the-art estimators empirically.
We provide a brief introduction to the technical background that is needed in the derivation of the new algorithm, including exponential family, dynamics-based MCMC sampler, and primal-dual view of MLE.
2.1 Exponential Family and Energy-based Model
The natural form of the exponential family over is defined as
where and . The sufficient statistic
can be general parametric model, , neural networks. Theare the parameters that can be learned from observed data. The exponential family has be generalized to infinite dimensional case by settinig belongs to some RKHSs (Sriperumbudur et al., 2017; Sutherland et al., 2017; Dai et al., 2018a). The energy-based model (LeCun et al., 2006) can be viewed as a special case of the exponential family by setting with .
Traditionally, the is prefixed and covers the support , which is difficult in practical high-dimension problems. In this paper, we focus on learning with together, ,
Given samples and denoting as the valid parametrization family, the maximum log-likelihood estimation of the exponential family can be written as
which leads to the gradient as
Since and are intractable, solving the MLE of general exponential family is extremely difficult.
2.2 Dynamics-based MCMC
The dynamics-based MCMC is a general and effective tool for inference. The basic idea of dynamics-based MCMC is to represent the target distribution as the solution to some (stochastic) differential equations. Therefore, samples of the target distribution can be obtained by simulating along the dynamics defined by the differential equations.
Hamiltonian Monte-Carlo (HMC) (Neal et al., 2011) is a representative algorithm in this category, which exploits the well-known Hamiltonian dynamics. Specifically, given a target distribution , the Hamiltonian is defined as
where is the kinetic energy. The Hamiltonian dynamics generate over time , following
Asymptotically as , the follows the target distribution. In practical HMC, to reduce the error from discretization, an acceptance-rejection step can be introduced.
2.3 The Primal-Dual View of MLE
To address the intractability of the -partition function in MLE (3), the Fenchel duality of has been exploited (Rockafellar, 1970; Wainwright and Jordan, 2008; Dai et al., 2018a), , [Fenchel dual of log-partition, Proposition 5.1 (Wainwright and Jordan, 2008)] Let , we have:
where denotes the space of distributions, .
Plugging the Fenchel dual of into the MLE (3), we achieve a saddle-point reformulation of as
which bypasses the explicit computation of the partition function. Another by-product of the primal-dual view is that the dual distribution can be used in the inference stage, which usually requires executing relative expensive sampling algorithms in vanilla estimators.
The dual sampler plays a vital role in the primal-dual formulation of the MLE for exponential family estimation in (9). To achieve better performance, we have several principal requirements in parameterizing the dual distribution:
the parametrization family needs to be flexible enough to diminish the error from solving the inner minimization problem;
the entropy of the parametrized dual distribution should be tractable.
Moreover, as shown in thm:fenchel_log_partition, the optimal dual sampler is totally decided by primal model ,
the parametrized dual sampler should explicitly incorporate the primal model .
Such dependence will reduce both the memory cost and sample complexity in the learning procedure. In the ideal case, the dual sampler is solely represented via , and thus, no extra auxiliary component for dual sampler need to be learned.
A wide spectrum of techniques for distribution parametrization has been developed, such as the reparametrized latent variable models (Kingma and Welling, 2013; Rezende et al., 2014), transport mapping (Goodfellow et al., 2014), and normalizing flows (Rezende and Mohamed, 2015; Kingma et al., 2016; Dinh et al., 2016), . However, none of these off-the-shelf parametrization satisfies the requirements about flexibility and density tractability simultaneously, not even to mention the dependency requirement. Specifically, the density value in transport mapping is intractable. On the other hand, the reparametrized trick restricts the distributions lie in known density form and the flow-based models rely on invertible operations, both limiting the expressive ability. Meanwhile, the reparametrized latent variable model requires extra variational posterior to be learned.
3 Adversarial Dynamics Embedding
In this section, we consider augmenting the original exponential family with kinetic variables. The primal-dual view of MLE for such new augmented model will naturally introduce an adversarial dual sampler with momentums as the latent variables that can be parametrized by dynamic embedding, satisfying all three requirements above without extra variational distribution introduced.
We start with the embeddings of classical Hamiltonian dynamics (Neal et al., 2011; Caterini et al., 2018) for dual distribution parametrization as an example, and discuss its generalized dynamics in latent space and the stochastic Langevin dynamics. This technique can be extended to other dynamics, each with its own advantages. We provide these extensions in Appendix B.
3.1 Primal-Dual View of Augmented MLE
As we discussed, it is difficult to find a parametrization of in (9) satisfying all the three requirements simultaneously. Therefore, instead of directly working on (9) for original model, inspired with HMC, we consider the augmented exponential family with auxiliary momentum variable, ,
The MLE of such model is
where the last equation comes from the definition of true posterior . By exploiting the independence between and in defined in (10), we have Applying the Fenchel dual to of the augmented model (10), we derive the primal-dual view of (11), leading to the objective,
Remark (momentum as latent variable):
The in (12) contains latent variable . One can exploit latent variable model for in (9). However, the in (9) requires marginalization, which is intractable in general cases, and usually estimated via variational inference with extra posterior introduced. By directly considering the specifically designed augmented model, (12) eliminates such extra variational steps.
Similarly, one can introduce multiple momentums as the hierarchical latent variable augmented model, ,
leading to the objective as
3.2 Representing Dual Sampler via Primal Model
In this section, we introduce the Hamiltonian dynamic embedding representation of distributions for dual sampler, as well as its generalization and special instantiation.
The vanilla HMC is derived by discretizing the Hamiltonian dynamics defined in (6) with leapfrog integrator. Specifically, in a single time step, the sample moves according to
where is defined as the leapfrog stepsize. Denote the one-step leapfrog as and assume the , after iterations, we obtain
which can be viewed as a neural network with special architecture. We term this Hamiltonian (HMC) dynamics embedding, by which the dual sampler is represented by the primal model, , the potential function .
The flexibility of the HMC embedding for parameterizing the distribution actually comes from the nature of the dynamics-based samplers. In the limit the case, the proposed neural network (15) reduces to a gradient flow, whose stationary distribution is exactly the model distribution
Formally, the approximation ability of the HMC embedding can be justified, [HMC embeddings as gradient flow] For a continuous time with infinitesimal stepsize , the density of the particles , denoted as , follows Fokker-Planck equation
with . Then as . Due to the space limit, the details of the proofs are postpone in appendix:derivation_sec_mle. As we can see, the neural network composed by HMC embedding can approximate well to exponential family distributions on continuous variables.
Remark (Generalized HMC dynamics in latent space):
the leapfrog operation in the vanilla HMC is directly working in the original observation space, which is high-dimensional and noisy. We generalize the leapfrog update rule in latent space and form the new dynamics, ,
where denotes the element-wise product. Specifically, the and rescales and coordinately. The can be understood as projecting the gradient information to the an essential latent space. Then, for updateing , the latent momentum is projected back to original space by . With such generalized leapfrog updates, the dynamic system avoids operating the high-dimensional noisy input and become more efficient. We emphasize that the proposed generalized leapfrog parametrization (17) is different from the one used in Levy et al. (2017), which is inspired by the real-NVP flow (Dinh et al., 2016).
By embedding the generalized HMC (17), we have a flexible layer . The will be learned in addition to the stepsize. In fact, the vanilla HMC layer is a special case of the by setting as zero and as identity function.
Remark (Stochastic Langevin dynamics):
we can recover the stochastic Langevin dynamics from leapfrog step by resampling momentum in every step. Specifically, the sample moves according to
where . As we can see, the stochastic Langevin dynamics resample to replace the momentum in leapfrog (14), ignoring the gradient accumulation. After -iteration, we obtain
as the derived neural network. Similarly, we can also generalize the stochastic Langevin updates by introducing and in corresponding positions.
On the other hand, the density value of the proposed neural network is also tractable, leading to tractable entropy estimation in (12) and (13). In particular, we have [Density value evaluation] If , after vanilla HMC steps (14), we have
For the from the generalized leapfrog steps (17), we have
where and denotes
For the from the stochastic Langevin dynamics (20) with , we have
The proof of thm:hmc_density can be found in appendix:derivation_sec_mle.
As we can see, the proposed dynamics embedding satisfies all three requirements: it covers a flexible family of distributions with tractable density value, meanwhile, it couples the dual sampler with primal model, which leads to memory and sample efficient learning algorithm, as we will introduce in next section.
3.3 Coupled Model and Sampler Learning
Plug the -step Hamiltonian dynamics embedded parametrization (14) or (17) into the primal-dual MLE of the augmented model (12) and apply the density value evaluation in (22) and (23), respectively, we obtain our ultimate adversarial objective,
Here denotes the learnable components in the dynamics embedding, , initialization , the stepsize in the HMC/Langevin updates, and the adaptive part in the generalized HMC, which is learned adversarially. The parametrization of initial distribution is discussed in appendix:practical. For the -step stochastic Langevin dynamics embedded parametrization (20), we apply the density value (25) to (13), which leads to a similar saddle-point optimization with hierarchical momentums.
We can now use stochastic gradient descent to estimatefor the exponential families as well as the the parameters in the dynamics embedding adversarially. Note that the generated sample is a function of , the gradient w.r.t. should also take these variables into account as back-propagation through time (BPTT), ,
We illustrate the MLE via HMC adversarial dynamics embedding (ADE) in alg:mle_de.
The same technique can be applied to other parametrization of dual sampler via alternative dynamics embeddings, as listed in appendix:variants, leading to different architectures and the entropy term evaluations. Consider the dynamics embedding as an adaptive sampler that can be automatically learned w.r.t. different models and datasets, then, the updates for in can be understood as learning to sample.
4 Connections to Other Estimators
We establish the connections between the proposed estimator with existing approaches. In summary, the proposed algorithm is equipped with an adaptive sampler; while the existing algorithms can be recast as the special cases of our estimator with some hand-designed samplers, and thus, may lead to extra error and inferior solution.
4.1 Connection to Contrastive Divergence
The CD algorithm (Hinton, 2002) is a special case of the proposed algorithm. By thm:fenchel_log_partition, the optimal solution to the inner optimization is . Applying Danskin’s theorem (Bertsekas, 1995), the gradient of w.r.t. is
To estimate the integral , the CD algorithm approximates the negative term in (28) stochastically with a finite MCMC step away from empirical data.
In the proposed dual sampler, by setting to be the empirical distribution and eliminating the sampling learning, the dynamic embedding will collapse to CD with -HMC steps if we remove gradient through the sampler, , ignoring the third term in (27). Similarly, the persistent CD (PCD) (Tieleman, 2008) and recent ensemble CD (Du and Mordatch, 2019) can also be recast as special cases by setting the negative sampler to be MCMC with initial samples from previous model and ensemble of MCMC samplers, respectively.
From this perspective, the CD and PCD algorithms induce errors not only from the sampler, but also from the gradient back-propagation truncation. The proposed algorithm escapes these sources of bias by learning to sample, and by adopting true gradients, respectively. Therefore, the proposed estimator is expected to achieve better performance than CD as demonstrated in the empirical experiments sec:real_exp.
4.2 Connection to Score Matching
The score matching (Hyvärinen, 2005) estimates the exponential family by minimizing the Fisher divergence, leading to optimization,
As explained in Hyvärinen (2007), the objective (29) can be derived as the nd-order Taylor approximation of the MLE with -step Langevin Monte Carlo as the dual sampler. Specifically, the Langevin Monte Carlo generates samples via
then, a simple Taylor expansion gives
Plug such into the negative expectation in , leading to
which is exactly the defined in (29).
Therefore, the score matching can be viewed as applying Taylar expansion approximation with fixed -step Langevin sampler in our framework, which is compared in sec:synthetic_exp.
4.3 Connection to Pseudo-Likelihood and Conditional Composite Likelihood
The pseudo-likelihood estimation (Besag, 1975) is a special case of the proposed algorithm by restricting the parametrization of the dual distribution. Specifically, denote the with , instead of directly maximizing likelihood, the pseudo-likelihood estimator is maximizing
Then, the is updated by the following the gradient of , ,
The pseudo-likelihood estimator can be recast as a special case of the proposed framework if we fix the dual sampler as i), sample uniformly; ii), sample and mask ; iii), sample and compose .
The conditional composite likelihood (Lindsay, 1988) is a generalization of pseudo-likelihood by maximizing
where and . Similarly, the composite likelihood is updating with prefixed conditional block sampler for negative sampling.
Same as CD, the prefixed sampler and the biased gradient in pseudo-likelihood and composite likelihood estimator will induce extra errors and lead to inferior solution. Moreover, the pseudo-likelihood may not applicable to the general exponential family with continuous variables, whose conditional distribution is also intractable.
4.4 Connection to Non-local Contrastive Objectives
The non-local contrastive estimator (Vickrey et al., 2010) is obtained by maximizing
where denotes some prefixed partition of , , and with . The objective (32) leads to the update direction as
where with as the empirical distribution and , . Therefore, the non-local contrastive objective is a special case of the proposed framework with the dual sampler as i), sample uniformly from ; ii), sample conditional on according to ; iii), sample within . Such negative sampling method is also not applicable to the general exponential family with continuous variables.
4.5 Connection to Minimum Probability Flow
In the continuous state model, the minimum probability flow (Sohl-Dickstein et al., 2011) estimates the exponential family by maximizing
where is a hand-designed symmetric transition kernel based on the potential function , , Hamiltonian or Langevin simulation. Then, the MPF update direction can be rewritten as
where . The probability flow operator actually defines a Markov chain sampler that achieves the following balance equation,
Similar to CD and score matching, the MPF exploits the -step MCMC. Moreover, the gradient in MPF also considers the effects in sampler as the third term in (34). Therefore, the MPF can be recast as a special case of our algorithm with the prefixed dual sampler as and .
4.6 Connection to Noise-Contrastive Estimator
Instead of directly estimating the in the exponential family, Gutmann and Hyvärinen (2010) propose the noise-contrastive estimation (NCE) for the density ratio between the exponential family and some user defined reference distribution , from which the parameter can be reconstructed. Specifically, the NCE considers an alternative representation of exponential family distribution as , which explicitly enforces . The NCE is obtained by maximizing
where . Then, we have the gradient of as
The negative sampler in the eq:nce_grad can be understood as an approximate importance sampling algorithm where the proposal is and the rewighting part is . As the approaching , the will approach the true ratio , and thus, the negative samples will converge to true model samples.
The NCE can be understood as learning an important sampler. However, the performance of NCE highly relies on the quality , , the choice of . It is required to cover the support of , which is non-trivial in practical high-dimensional applications.
5 Related Work
The parametrization of the dual sampler should be both flexible enough and density tractable to achieve better performance. Pioneering works are limited in either one aspect or other. Kim and Bengio (2016) parameterize the sampler via a deep directed graphical model, whose approximation ability is limited to known distributions. Meanwhile, they fit by minimizing the reverse -divergence to avoid the entropy calculation, leading to unclear relationship to MLE. Due to the difficulty of the entropy term for general transport mapping parametrization, a variety of approximate surrogates have been proposed to relax the density value tractability requirement. Liu and Wang (2017) learn the sampler to mimic the Stein variational gradient descent sampling procedure; Dai et al. (2017)
propose algorithms relying on either a heuristic approximation or a upper bound of the entropy, with extra auxiliary component introduced to be learned;Dai et al. (2018a)
apply a second Fenchel dual representation to reformulate the entropy term, at the cost of introducing another auxiliary function to be estimated. Meanwhile, the second Fenchel duality parametrization relies on a proposal distribution with the same support for numerical stability, which is impractical for high-dimensional data. In contrast to these existing methods, the proposed dynamics embedding achieves both flexibility and tractability of entropy estimation with less independent auxiliary parameters introduced. Moreover, the dynamics embedding avoids the design of proposal, which is naturally suitable for high-dimension data.
One of our major contributions is learning a sampling strategy for the exponential family estimation through the primal-dual view of MLE. The proposed algorithm shares some similarities with recent advances in meta learning for sampling (Levy et al., 2017; Feng et al., 2017; Song et al., 2017; Gong et al., 2018), in which the sampler is parametrized via neural network and will be learned through certain objectives. However, we emphasize that the most significant difference lies in the ultimate goals: we focus on exponential family model estimation and the learned sampler is only introduced to assist with this objective. By contrast, the learning to sample techniques are targeting on learning a fixed model that is already given. This fundamentally distinguishes the proposed ADE from methods that only learns samplers, leading to totally different learning criterion and algorithm updates, , the primal model will be learned back through the learned sampler, from which perspective the proposed algorithm can be understood as meta-learning.
6.1 Synthetic experiments
|(a) 2spirals||(b) Banana||(c) circles||(d) cos||(e) Cosine||(f) Funnel||(g) swissroll|
|(h) line||(i) moons||(j) Multiring||(k) pinwheel||(l) Ring||(m) Spiral||(n) Uniform|
Learned samplers in odd row and potential functionin even row from different synthetic datasets.
We parametrize the potential function
with fully connected multi-layer perceptron with 3 hidden layers. Each hidden layer has
hidden units. We use ReLU to do the nonlinear activation in each hidden layer. For baseline methods, we compare with score matching (SM) estimator(Hyvärinen, 2005) and primal-dual MLE with the normalizing flow (Rezende and Mohamed, 2015) parametrized dual sampler (NF). Note that although the normalizing flow can be used to perform maximum likelihood estimation (MLE) to fit the data (Rezende and Mohamed, 2015), here the flow is asked to help train the exponential model . Using normalizing flow to learn the exponential family model can be viewed as the special case of our method, which has only parameterized as flow, but no HMC dynamics is involved. Thus this is also served as one ablation study for the proposed ADE.
For our method, we use normalizing flow as the initial proposal distribution . Then we perform several steps of stochastic Langevin operation to get the final samples . is the distribution for sampling momentum. We clip the norm of when updating , and clip when updating . The coefficient in (26) is tuned in . For the NF baseline, we tune the number of layers in . For our ADE, we fix the number of normalizing flow layers to be , and then perform at most steps of dynamics updates. So finally, the number of steps for sampling is comparable, while the ADE maintains less memory cost.
In this section, we compare the proposed ADE with other baseline methods on several synthetic datasets. For visualization purpose, all the samples are in space. The dataset generators are collect from several open-source projects 111https://github.com/rtqichen/ffjord 222https://github.com/kevin-w-li/deep-kexpfam. During training, we use this generator to generate the data from the true distribution on the fly. To get a quantitative comparison, we also generate 1,000 data samples for held-out evaluation.
In fig:synthetic, we visualize the learned distribution using both the learned dual sampler and also the exponential model , where is a constant that is tuned within [0.01, 10]. Overall one can see the sampler almost perfectly recover the distribution, and the learned almost captures the landscape of the distribution. We also visualize the convergence behavior in fig:synthetic_sampler. We can see the samples are converging to the true data distribution smoothly. As the learned sampler depends on the , so this figure also suggests the nice convergence behavior of indirectly.
The quantitative comparison of the sampler is illustrated in tab:synthetic_mmd. For NF and ADE, we use 1,000 samples from their sampler to calculate MMD with gaussian kernel. The kernel bandwidth is chosen using median trick (Dai et al., 2016). For SM, since there is no such sampler available, we directly use vanilla HMC to get sample from the learned model , and use them to estimate MMD. From the table we can see, our approach gets the best MMD in all cases. This demonstrates the effectiveness of dynamics embedding, as comparing to only using the normalizing flow model.
6.2 Real-world Image Datasets
|(a) Samples on MNIST||(b) Histogram on MNIST||(c) Samples on CIFAR-10||(d) Histogram on CIFAR-10|
To further verify that the proposed adversarial dynamics embedding can learn complicated exponential family and the corresponding sampler well for real-world images, we trained models on MNIST and CIFAR-10 dataset. For both datasets, we use the CNN architecture for discriminator as used in Miyato et al. (2018), and spectral normalization is added to all discriminator layers. Specifically, for the discriminator in CIFAR-10 experiments, we replace all downsampling operations by average pooling, as in Du and Mordatch (2019). The detailed architectures are shown in subsec:real_archi.
We parametrize the initial distribution with neural nets as deep Gaussian latent variable model (Deep LVM) specified in appendix:practical. After several HMC steps on this initial distribution, instead of clipping the images, we pass the sample to function to obtain the final sampled images.
We used the standard spectral normalization on the discriminator to stabilize the training process, and Adam with learning rate to optimize our model. We trained the models with iterations with batch size being . The step sizes for our HMC sampler are independently learned for all HMC dimensions but shared among all time steps, and the values are all initialized to . We set the number of HMC steps to . The coefficient of the entropy regularization term is set to and that of the
regularization on the momentum vector in the last HMC step is set to.
|WGAN-GP (Gulrajani et al., 2017)||6.50|
|Spectral GAN (Miyato et al., 2018)||7.42|
|Langevin PCD (Du and Mordatch, 2019)||6.02|
|Langevin PCD (10 ensemble) (Du and Mordatch, 2019)||6.78|
|ADE: Deep LVM init w/o HMC||6.51|
|ADE: Deep LVM init w/ HMC||7.43|
We report the inception scores in tab:inception_score. For our ADE, we trained two models with Deep LVM as the initial dual sampler, one with HMC samplers and one without that. We observed that the HMC sampler greatly improves the performance of the samples generated by initial Deep LVM along and enables the generator to be comparable with Spectral GAN. Moreover, we note that the initial Deep LVM helps sample generation, compared to the Langevin PCD models reported in (Du and Mordatch, 2019).
We show some samples of generated images in fig:real_images(a) and (c). More sampled images can be found in subsec:more_images. In addition, we also plot the potential distribution (unnormalized) of generated samples and that of the real images on MNIST and CIFAR-10
(using 1000 data points for each) in fig:real_images(b) and (d). The distributions of energies of both generated images and real iamges mostly overlap with each other, showing that the discriminators learn the desirable distributions. Also, with simple importance sampling and proposal distribution being uniform distribution on( is the dimension of images), the log likelihood (in nats) on CIFAR-10 is estimated to be around .
We proposed a novel method to efficiently construct maximum likelihood estimator (MLE) for general exponential families. Specifically, by utilizing the nice properties of primal-dual formulation of the MLE of the augmented exponential family distribution with kinetic variables, we incorporate dynamics-based distribution parametrization for dual sampler into the estimation process in a fully differentiable way. Such coupling of the primal model and the dual sampler benefits each other, and thus, achieving both better model estimation and efficient inference at the same time. Our empirical results on both synthetic and real data validate the feasibility of our proposed dynamics embedding. In future work, we will extend this method to other graphical model learning problems.
NH is supported in part by NSF-CRII-1755829, NSF-CMMI-1761699, and NCSA Faculty Fellowship. LS is supported in part by NSF IIS-1218749, NIH BIGDATA 1R01GM108341, NSF CAREER IIS-1350983, NSF IIS-1639792 EAGER, NSF IIS-1841351 EAGER, NSF CCF-1836822, NSF CNS-1704701, ONR N00014-15-1-2340, Intel ISTC, NVIDIA, Amazon AWS, Siemens and Google Cloud.
- Bertsekas (1995) D. P. Bertsekas. Nonlinear Programming. Athena Scientific, Belmont, MA, 1995.
- Besag (1975) J. Besag. Statistical analysis of non-lattice data. The Statistician, 24:179–195, 1975.
- Boutsidis et al. (2017) Christos Boutsidis, Petros Drineas, Prabhanjan Kambadur, Eugenia-Maria Kontopoulou, and Anastasios Zouzias. A randomized algorithm for approximating the log determinant of a symmetric positive definite matrix. Linear Algebra and its Applications, 533:95–117, 2017.
- Brown (1986) Lawrence D. Brown. Fundamentals of Statistical Exponential Families, volume 9 of Lecture notes-monograph series. Institute of Mathematical Statistics, Hayward, Calif, 1986.
- Caterini et al. (2018) Anthony L Caterini, Arnaud Doucet, and Dino Sejdinovic. Hamiltonian variational auto-encoder. arXiv preprint arXiv:1805.11328, 2018.
- Ceylan and Gutmann (2018) Ciwan Ceylan and Michael U Gutmann. Conditional noise-contrastive estimation of unnormalised models. arXiv preprint arXiv:1806.03664, 2018.
Dai et al. (2016)
Bo Dai, Niao He, Hanjun Dai, and Le Song.
Provable bayesian inference via particle mirror descent.In
Proceedings of the 19th International Conference on Artificial Intelligence and Statistics, pages 985–994, 2016.
- Dai et al. (2018a) Bo Dai, Hanjun Dai, Arthur Gretton, Le Song, Dale Schuurmans, and Niao He. Kernel exponential family estimation via doubly dual embedding. 2018a. URL http://arxiv.org/abs/1811.02228.
- Dai et al. (2018b) Bo Dai, Hanjun Dai, Niao He, Weiyang Liu, Zhen Liu, Jianshu Chen, Lin Xiao Xiao, and Le Song. Coupled variational bayes via optimization embedding. In NIPS, 2018b.
- Dai et al. (2017) Zihang Dai, Amjad Almahairi, Philip Bachman, Eduard Hovy, and Aaron Courville. Calibrating energy-based generative adversarial networks. arXiv preprint arXiv:1702.01691, 2017.
- Dinh et al. (2016) Laurent Dinh, Jascha Sohl-Dickstein, and Samy Bengio. Density estimation using real nvp. arXiv preprint arXiv:1605.08803, 2016.
- Du and Mordatch (2019) Yilun Du and Igor Mordatch. Implicit generation and generalization in energy-based models. arXiv preprint arXiv:1903.08689, 2019.
- Feng et al. (2017) Yihao Feng, Dilin Wang, and Qiang Liu. Learning to draw samples with amortized stein variational gradient descent. arXiv preprint arXiv:1707.06626, 2017.
- Gong et al. (2018) Wenbo Gong, Yingzhen Li, and José Miguel Hernández-Lobato. Meta-learning for stochastic gradient mcmc. arXiv preprint arXiv:1806.04522, 2018.
- Goodfellow et al. (2014) Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In Advances in Neural Information Processing Systems, pages 2672–2680, 2014.
- Grathwohl et al. (2018) Will Grathwohl, Ricky TQ Chen, Jesse Betterncourt, Ilya Sutskever, and David Duvenaud. Ffjord: Free-form continuous dynamics for scalable reversible generative models. arXiv preprint arXiv:1810.01367, 2018.
- Gulrajani et al. (2017) Ishaan Gulrajani, Faruk Ahmed, Martin Arjovsky, Vincent Dumoulin, and Aaron C Courville. Improved training of wasserstein gans. In Advances in Neural Information Processing Systems, pages 5767–5777, 2017.
- Gutmann and Hyvärinen (2010) Michael Gutmann and Aapo Hyvärinen. Noise-contrastive estimation: A new estimation principle for unnormalized statistical models. In Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics, pages 297–304, 2010.
- Han et al. (2015) Insu Han, Dmitry Malioutov, and Jinwoo Shin. Large-scale log-determinant computation through stochastic chebyshev expansions. In International Conference on Machine Learning, pages 908–917, 2015.
- Hinton (2002) Geoffrey E. Hinton. Training products of experts by minimizing contrastive divergence. Neural Computation, 14(8):1771–1800, 2002.
- Hyvärinen (2005) A. Hyvärinen. Estimation of non-normalized statistical models using score matching. Journal of Machine Learning Research, 6:695–709, 2005.
- Hyvärinen (2007) Aapo Hyvärinen. Some extensions of score matching. Computational statistics & data analysis, 51(5):2499–2512, 2007.
- Kim and Bengio (2016) Taesup Kim and Yoshua Bengio. Deep directed generative models with energy-based probability estimation. arXiv preprint arXiv:1606.03439, 2016.
- Kinderman and Snell (1980) R. Kinderman and J. L. Snell. Markov Random Fields and their applications. Amer. Math. Soc., Providence, RI, 1980.
- Kingma and Dhariwal (2018) Diederik P Kingma and Prafulla Dhariwal. Glow: Generative flow with invertible 1x1 convolutions. arXiv preprint arXiv:1807.03039, 2018.
- Kingma and LeCun (2010) Diederik P Kingma and Yann LeCun. Regularized estimation of image statistics by score matching. In NIPS, 2010.
- Kingma and Welling (2013) Diederik P Kingma and Max Welling. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114, 2013.
- Kingma et al. (2016) Diederik P Kingma, Tim Salimans, Rafal Jozefowicz, Xi Chen, Ilya Sutskever, and Max Welling. Improved variational inference with inverse autoregressive flow. In Advances in Neural Information Processing Systems, pages 4743–4751, 2016.
- Lafferty et al. (2001) J. D. Lafferty, A. McCallum, and F. Pereira. Conditional random fields: Probabilistic modeling for segmenting and labeling sequence data. In Proceedings of International Conference on Machine Learning, volume 18, pages 282–289, San Francisco, CA, 2001. Morgan Kaufmann.
- LeCun et al. (2006) Yann LeCun, Sumit Chopra, Raia Hadsell, M Ranzato, and F Huang. A tutorial on energy-based learning. Predicting structured data, 1(0), 2006.
- Levy et al. (2017) Daniel Levy, Matthew D Hoffman, and Jascha Sohl-Dickstein. Generalizing hamiltonian monte carlo with neural networks. arXiv preprint arXiv:1711.09268, 2017.
- Lindsay (1988) B. G. Lindsay. Composite likelihood methods. Contemporary Mathematics, 80(1):221–239, 1988.
- Liu and Wang (2017) Qiang Liu and Dilin Wang. Learning deep energy models: Contrastive divergence vs. amortized mle. arXiv preprint arXiv:1707.00797, 2017.
- Miyato et al. (2018) Takeru Miyato, Toshiki Kataoka, Masanori Koyama, and Yuichi Yoshida. Spectral normalization for generative adversarial networks. arXiv preprint arXiv:1802.05957, 2018.
- Mnih and Kavukcuoglu (2013) Andriy Mnih and Koray Kavukcuoglu. Learning word embeddings efficiently with noise-contrastive estimation. In Advances in Neural Information Processing Systems, pages 2265–2273, 2013.
- Mnih and Teh (2012) Andriy Mnih and Yee Whye Teh. A fast and simple algorithm for training neural probabilistic language models. arXiv preprint arXiv:1206.6426, 2012.
- Neal et al. (2011) Radford M Neal et al. Mcmc using hamiltonian dynamics. Handbook of Markov Chain Monte Carlo, 2(11), 2011.
Rezende et al. (2014)
Danilo J Rezende, Shakir Mohamed, and Daan Wierstra.
Stochastic backpropagation and approximate inference in deep generative models.In Proceedings of the 31st International Conference on Machine Learning (ICML-14), pages 1278–1286, 2014.
- Rezende and Mohamed (2015) Danilo Jimenez Rezende and Shakir Mohamed. Variational inference with normalizing flows. arXiv preprint arXiv:1505.05770, 2015.
- Rockafellar (1970) R. T. Rockafellar. Convex Analysis, volume 28 of Princeton Mathematics Series. Princeton University Press, Princeton, NJ, 1970.
- Sohl-Dickstein et al. (2011) Jascha Sohl-Dickstein, Peter Battaglino, and Michael R DeWeese. Minimum probability flow learning. In Proceedings of the 28th International Conference on International Conference on Machine Learning, pages 905–912. Omnipress, 2011.
- Song et al. (2017) Jiaming Song, Shengjia Zhao, and Stefano Ermon. A-nice-mc: Adversarial training for mcmc. In Advances in Neural Information Processing Systems, pages 5140–5150, 2017.
- Sriperumbudur et al. (2017) Bharath Sriperumbudur, Kenji Fukumizu, Arthur Gretton, Aapo Hyvärinen, and Revant Kumar. Density estimation in infinite dimensional exponential families. The Journal of Machine Learning Research, 18(1):1830–1888, 2017.
- Sutherland et al. (2017) Dougal J Sutherland, Heiko Strathmann, Michael Arbel, and Arthur Gretton. Efficient and principled score estimation with nystr” om kernel exponential families. arXiv preprint arXiv:1705.08360, 2017.
Training restricted Boltzmann machines using approximations to the likelihood gradient.In Proceedings of the International Conference on Machine Learning, 2008.
- Tieleman and Hinton (2009) Tijmen Tieleman and Geoffrey Hinton. Using fast weights to improve persistent contrastive divergence. In Proceedings of the 26th Annual International Conference on Machine Learning, pages 1033–1040. ACM, 2009.
- Vickrey et al. (2010) David Vickrey, Cliff Chiung-Yu Lin, and Daphne Koller. Non-local contrastive objectives. In Proceedings of the International Conference on Machine Learning, 2010.
- Wainwright and Jordan (2008) M. J. Wainwright and M. I. Jordan. Graphical models, exponential families, and variational inference. Foundations and Trends in Machine Learning, 1(1 – 2):1–305, 2008.
- Wu et al. (2018) Ying Nian Wu, Jianwen Xie, Yang Lu, and Song-Chun Zhu. Sparse and deep generalizations of the frame model. Annals of Mathematical Sciences and Applications, 3(1):211–254, 2018.
- Zhang et al. (2018) Linfeng Zhang, Weinan E, and Lei Wang. Monge-Ampère flow for generative modeling. arXiv preprint arXiv:1809.10188, 2018.
Zhu and Mumford (1998)
Song Chun Zhu and David Mumford.
Grade: Gibbs reaction and diffusion equations.
Sixth International Conference on Computer Vision (IEEE Cat. No. 98CH36271), pages 847–854. IEEE, 1998.
Appendix A Proof of Theorems in sec:mle
thm:grad_flow (HMC embeddings as gradient flow) For a continuous time with infinitesimal stepsize , the density of the particles , denoted as , follows Fokker-Planck equation
with . Then as . The first part of the theorem is trivial. When , the HMC follows the dynamical system
By applying the Fokker-Planck equation, we obtain
To show that the stationary distribution of such dynamical system converges to , recall the fact that
The Fokker-Planck equation can be rewritten as
Substitute into (40) and notice
we have , , is the stationary distribution.
thm:hmc_density (Density value evaluation) If , after vanilla HMC steps (14), we have