Exponential Family Estimation via Adversarial Dynamics Embedding

04/27/2019 ∙ by Bo Dai, et al. ∙ Google 28

We present an efficient algorithm for maximum likelihood estimation (MLE) of the general exponential family, even in cases when the energy function is represented by a deep neural network. We consider the primal-dual view of the MLE for the kinectics augmented model, which naturally introduces an adversarial dual sampler. The sampler will be represented by a novel neural network architectures, dynamics embeddings, mimicking the dynamical-based samplers, e.g., Hamiltonian Monte-Carlo and its variants. The dynamics embedding parametrization inherits the flexibility from HMC, and provides tractable entropy estimation of the augmented model. Meanwhile, it couples the adversarial dual samplers with the primal model, reducing memory and sample complexity. We further show that several existing estimators, including contrastive divergence (Hinton, 2002), score matching (Hyvärinen, 2005), pseudo-likelihood (Besag, 1975), noise-contrastive estimation (Gutmann and Hyvärinen, 2010), non-local contrastive objectives (Vickrey et al., 2010), and minimum probability flow (Sohl-Dickstein et al., 2011), can be recast as the special cases of the proposed method with different prefixed dual samplers. Finally, we empirically demonstrate the superiority of the proposed estimator against existing state-of-the-art methods on synthetic and real-world benchmarks.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 12

page 14

page 25

page 26

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The exponential family is one of the most important classes of distributions in statistics and machine learning. It possesses a number of useful properties 

(Brown, 1986) and includes many commonly used distributions, and the notably powerful undirected graphical models (Wainwright and Jordan, 2008)

and energy-based models 

(LeCun et al., 2006; Wu et al., 2018), , Markov random fields (Kinderman and Snell, 1980), conditional random fields (Lafferty et al., 2001), and language models (Mnih and Teh, 2012; Mnih and Kavukcuoglu, 2013). Despite the flexibility of the exponential family, a majority of models in this family are intractable due to the partition function having no known analytic form. This leads to difficulties in evaluating, sampling and learning of the model, which hinders its application in practice. In this paper, we try to provide an answer to the longstanding fundamental question:

How to design a simple yet effective algorithm for estimating the exponential family distributions?

There have been many attempts in the literature to answer this question. Since the maximum likelihood estimator (MLE) has already been well-studied with many desirable statistical properties, , consistency and asymptotic unbiasedness and normality (Brown, 1986), a major research direction lies in approximating the MLE of the general exponential family. Contrastive divergence (CD) (Hinton, 2002) and its variants (Tieleman and Hinton, 2009; Du and Mordatch, 2019)

are representative algorithms: rather than evaluating the gradient of the partition function, which also involves an intractable integral, CD exploits a stochastic gradient estimator with samples generated from a few Markov chain Monte Carlo (MCMC) steps. The major demerits of the CD algorithm and its variants are twofold:

i), it requires careful design of the transition kernel in MCMC, which is not a trivial task itself; more severely, ii), the stochastic gradient is biased, which may lead to an inferior estimate.

Because of these difficulties with MLE, many other learning approaches have been introduced, including pseudo-likelihood (Besag, 1975), score matching (Hyvärinen, 2005), noise-contrastive estimation (Gutmann and Hyvärinen, 2010; Ceylan and Gutmann, 2018), and minimum probability flow (Sohl-Dickstein et al., 2011). Instead of approximating the MLE, these algorithms introduce alternative learning objectives that naturally sidestep the calculation of the partition function.

Pseudo-likelihood estimators (Besag, 1975)

factorize the joint distribution by the product of conditional distributions, in which each factor is the distribution of a single random variable conditioned on the others. Even these conditional distributions can lack partition functions in analytic form for general exponential family distributions, however.

Score matching (Hyvärinen, 2005) instead minimizes the Fisher divergence between the empirical distribution and the model, in order to learn the parameters. The Fisher divergence involves only the derivatives of the potential function in the exponential family, and does not require computation of the partition function. Fisher score matching method requires the calculation of third order derivatives for the optimization procedure, however, which can be prohibitive for complex models (Kingma and LeCun, 2010).

Noise-contrastive estimation (Gutmann and Hyvärinen, 2010; Ceylan and Gutmann, 2018)

recasts the problem as ratio estimation between the target distribution and a pre-defined auxiliary noise distribution. The ratio can then be estimated by logistic regression. The quality of the estimator and its statistical properties strongly rely on the choice of the auxiliary noise distribution, which needs to cover the support of the data, have an analytical expression, and be easy to sample. Such requirements are unlikely to be satisfied for many applications, especially in high dimensions.

Minimum probability flow (Sohl-Dickstein et al., 2011) leverages the observation that in an ideal case, the empirical distribution is the stationary distribution of the dynamics constructed from the optimal exponential family. MPF then estimates the model by matching these two distributions. Since the parameter in the model is implicitly represented through the dynamics, the objective also avoids the calculation of the partition function. While the idea behind the MPF is inspiring, the challenge remains to construct sampling dynamics for efficient learning.

In this paper, we introduce a novel algorithm, Adversarial Dynamics Embedding (ADE), for learning and sampling from a maximum likelihood estimator, with a view to both computational and statistical efficiency. Our algorithm is largely motivated by the recent work of Dai et al. (2018a), who exploit a primal-dual view of the MLE that allows to bypass the computation of the intractable -partition function. Such a view provides a natural objective for jointly learning both a sampler and a model, as an alternative to the expensive and biased MCMC steps in the CD algorithm. The parametrization of the dual sampler should be flexible to reduce the extra approximation error and density tractable for entropy computation in the objective. Dai et al. (2018a) consider the transport mapping for the dual distribution, leading to difficulty in entropy estimation, and thus requiring extra auxiliary distribution to be learned with additional computational and memory cost.

Inspired by the properties of Hamiltonian Monte-Carlo (HMC) (Neal et al., 2011), ,

  • HMC forms stationary distribution with independent potential and kinetic variables;

  • HMC can approximate the exponential family arbitrary closely,

we consider the augmented distribution with auxiliary kinetic variable as latent variable subsec:argumented_mle and introduce dynamics embedding to parametrize the sampler by mimicking the sampling algorithms in subsec:dyna_embed. Due to the independence between potential and kinetic variables, the posterior of latent kinetic variable has explicit closed-form, eliminating the extra variational inference for latent kinetic variable. The dynamic embedding represents the dual samplers with tractable entropy via the primal model parameters, and thus enrich the flexibility of sampler without introducing extra parameters. Finally, the coupled primal model and dual sampler will be learned adversarially, promoting the learning sample efficiency as described in subsec:coupled_learning.

We further demonstrate that the proposed estimator contains CD, pseudo-likelihood, score matching, non-local contrastive objectives, noise-contrastive estimation, and minimum probability flow as special cases with hand-designed dual samplers in sec:connection. Our numerical experiments in sec:experiments show that it also outperforms existing state-of-the-art estimators empirically.

2 Preliminaries

We provide a brief introduction to the technical background that is needed in the derivation of the new algorithm, including exponential family, dynamics-based MCMC sampler, and primal-dual view of MLE.

2.1 Exponential Family and Energy-based Model

The natural form of the exponential family over is defined as

(1)

where and . The sufficient statistic

can be general parametric model, , neural networks. The

are the parameters that can be learned from observed data. The exponential family has be generalized to infinite dimensional case by settinig belongs to some RKHSs (Sriperumbudur et al., 2017; Sutherland et al., 2017; Dai et al., 2018a). The energy-based model (LeCun et al., 2006) can be viewed as a special case of the exponential family by setting with .

Traditionally, the is prefixed and covers the support , which is difficult in practical high-dimension problems. In this paper, we focus on learning with together, ,

(2)

Given samples and denoting as the valid parametrization family, the maximum log-likelihood estimation of the exponential family can be written as

(3)

which leads to the gradient as

(4)

Since and are intractable, solving the MLE of general exponential family is extremely difficult.

2.2 Dynamics-based MCMC

The dynamics-based MCMC is a general and effective tool for inference. The basic idea of dynamics-based MCMC is to represent the target distribution as the solution to some (stochastic) differential equations. Therefore, samples of the target distribution can be obtained by simulating along the dynamics defined by the differential equations.

Hamiltonian Monte-Carlo (HMC) (Neal et al., 2011) is a representative algorithm in this category, which exploits the well-known Hamiltonian dynamics. Specifically, given a target distribution , the Hamiltonian is defined as

(5)

where is the kinetic energy. The Hamiltonian dynamics generate over time , following

(6)

Asymptotically as , the follows the target distribution. In practical HMC, to reduce the error from discretization, an acceptance-rejection step can be introduced.

The finite-step dynamics-based MCMC sampler can be used for approximating in (4), which leads to the CD algorithm (Hinton, 2002; Zhu and Mumford, 1998).

2.3 The Primal-Dual View of MLE

To address the intractability of the -partition function in MLE (3), the Fenchel duality of has been exploited (Rockafellar, 1970; Wainwright and Jordan, 2008; Dai et al., 2018a), , [Fenchel dual of log-partition, Proposition 5.1 (Wainwright and Jordan, 2008)] Let , we have:

(7)
(8)

where denotes the space of distributions, .

Plugging the Fenchel dual of into the MLE (3), we achieve a saddle-point reformulation of as

(9)

which bypasses the explicit computation of the partition function. Another by-product of the primal-dual view is that the dual distribution can be used in the inference stage, which usually requires executing relative expensive sampling algorithms in vanilla estimators.

The dual sampler plays a vital role in the primal-dual formulation of the MLE for exponential family estimation in (9). To achieve better performance, we have several principal requirements in parameterizing the dual distribution:

  • the parametrization family needs to be flexible enough to diminish the error from solving the inner minimization problem;

  • the entropy of the parametrized dual distribution should be tractable.

Moreover, as shown in thm:fenchel_log_partition, the optimal dual sampler is totally decided by primal model ,

  • the parametrized dual sampler should explicitly incorporate the primal model .

Such dependence will reduce both the memory cost and sample complexity in the learning procedure. In the ideal case, the dual sampler is solely represented via , and thus, no extra auxiliary component for dual sampler need to be learned.

A wide spectrum of techniques for distribution parametrization has been developed, such as the reparametrized latent variable models (Kingma and Welling, 2013; Rezende et al., 2014), transport mapping (Goodfellow et al., 2014), and normalizing flows (Rezende and Mohamed, 2015; Kingma et al., 2016; Dinh et al., 2016), . However, none of these off-the-shelf parametrization satisfies the requirements about flexibility and density tractability simultaneously, not even to mention the dependency requirement. Specifically, the density value in transport mapping is intractable. On the other hand, the reparametrized trick restricts the distributions lie in known density form and the flow-based models rely on invertible operations, both limiting the expressive ability. Meanwhile, the reparametrized latent variable model requires extra variational posterior to be learned.

3 Adversarial Dynamics Embedding

In this section, we consider augmenting the original exponential family with kinetic variables. The primal-dual view of MLE for such new augmented model will naturally introduce an adversarial dual sampler with momentums as the latent variables that can be parametrized by dynamic embedding, satisfying all three requirements above without extra variational distribution introduced.

We start with the embeddings of classical Hamiltonian dynamics (Neal et al., 2011; Caterini et al., 2018) for dual distribution parametrization as an example, and discuss its generalized dynamics in latent space and the stochastic Langevin dynamics. This technique can be extended to other dynamics, each with its own advantages. We provide these extensions in Appendix B.

3.1 Primal-Dual View of Augmented MLE

As we discussed, it is difficult to find a parametrization of in (9) satisfying all the three requirements simultaneously. Therefore, instead of directly working on (9) for original model, inspired with HMC, we consider the augmented exponential family with auxiliary momentum variable, ,

(10)

The MLE of such model is

(11)

where the last equation comes from the definition of true posterior . By exploiting the independence between and in defined in (10), we have Applying the Fenchel dual to of the augmented model (10), we derive the primal-dual view of (11), leading to the objective,

(12)
Remark (momentum as latent variable):

The in (12) contains latent variable . One can exploit latent variable model for in (9). However, the in (9) requires marginalization, which is intractable in general cases, and usually estimated via variational inference with extra posterior introduced. By directly considering the specifically designed augmented model, (12) eliminates such extra variational steps.

Similarly, one can introduce multiple momentums as the hierarchical latent variable augmented model, ,

leading to the objective as

(13)

3.2 Representing Dual Sampler via Primal Model

In this section, we introduce the Hamiltonian dynamic embedding representation of distributions for dual sampler, as well as its generalization and special instantiation.

The vanilla HMC is derived by discretizing the Hamiltonian dynamics defined in (6) with leapfrog integrator. Specifically, in a single time step, the sample moves according to

(14)

where is defined as the leapfrog stepsize. Denote the one-step leapfrog as and assume the , after iterations, we obtain

(15)

which can be viewed as a neural network with special architecture. We term this Hamiltonian (HMC) dynamics embedding, by which the dual sampler is represented by the primal model, , the potential function .

The flexibility of the HMC embedding for parameterizing the distribution actually comes from the nature of the dynamics-based samplers. In the limit the case, the proposed neural network (15) reduces to a gradient flow, whose stationary distribution is exactly the model distribution

Formally, the approximation ability of the HMC embedding can be justified, [HMC embeddings as gradient flow] For a continuous time with infinitesimal stepsize , the density of the particles , denoted as , follows Fokker-Planck equation

(16)

with . Then as . Due to the space limit, the details of the proofs are postpone in appendix:derivation_sec_mle. As we can see, the neural network composed by HMC embedding can approximate well to exponential family distributions on continuous variables.

Remark (Generalized HMC dynamics in latent space):

the leapfrog operation in the vanilla HMC is directly working in the original observation space, which is high-dimensional and noisy. We generalize the leapfrog update rule in latent space and form the new dynamics, ,

(17)
(18)
(19)

where denotes the element-wise product. Specifically, the and rescales and coordinately. The can be understood as projecting the gradient information to the an essential latent space. Then, for updateing , the latent momentum is projected back to original space by . With such generalized leapfrog updates, the dynamic system avoids operating the high-dimensional noisy input and become more efficient. We emphasize that the proposed generalized leapfrog parametrization (17) is different from the one used in Levy et al. (2017), which is inspired by the real-NVP flow (Dinh et al., 2016).

By embedding the generalized HMC (17), we have a flexible layer . The will be learned in addition to the stepsize. In fact, the vanilla HMC layer is a special case of the by setting as zero and as identity function.

Remark (Stochastic Langevin dynamics):

we can recover the stochastic Langevin dynamics from leapfrog step by resampling momentum in every step. Specifically, the sample moves according to

(20)

where . As we can see, the stochastic Langevin dynamics resample to replace the momentum in leapfrog (14), ignoring the gradient accumulation. After -iteration, we obtain

(21)

as the derived neural network. Similarly, we can also generalize the stochastic Langevin updates by introducing and in corresponding positions.

On the other hand, the density value of the proposed neural network is also tractable, leading to tractable entropy estimation in (12) and (13). In particular, we have [Density value evaluation] If , after vanilla HMC steps (14), we have

(22)

For the from the generalized leapfrog steps (17), we have

(23)

where and denotes

(24)

For the from the stochastic Langevin dynamics (20) with , we have

(25)

The proof of thm:hmc_density can be found in appendix:derivation_sec_mle.

As we can see, the proposed dynamics embedding satisfies all three requirements: it covers a flexible family of distributions with tractable density value, meanwhile, it couples the dual sampler with primal model, which leads to memory and sample efficient learning algorithm, as we will introduce in next section.

3.3 Coupled Model and Sampler Learning

1:  Initialize randomly, set length of steps .
2:  for iteration  do
3:     Sample mini-batch from dataset and from .
4:     for iteration  do
5:        Compute for each pair of .
6:     end for
7:     [Learning the sampler]
8:     [Estimating the exponential family] .
9:  end for
Algorithm 1 MLE via Adversarial Dynamics Embedding (ADE)

Plug the -step Hamiltonian dynamics embedded parametrization (14) or (17) into the primal-dual MLE of the augmented model (12) and apply the density value evaluation in (22) and (23), respectively, we obtain our ultimate adversarial objective,

(26)

Here denotes the learnable components in the dynamics embedding, , initialization , the stepsize in the HMC/Langevin updates, and the adaptive part in the generalized HMC, which is learned adversarially. The parametrization of initial distribution is discussed in appendix:practical. For the -step stochastic Langevin dynamics embedded parametrization (20), we apply the density value (25) to (13), which leads to a similar saddle-point optimization with hierarchical momentums.

We can now use stochastic gradient descent to estimate

for the exponential families as well as the the parameters in the dynamics embedding adversarially. Note that the generated sample is a function of , the gradient w.r.t. should also take these variables into account as back-propagation through time (BPTT), ,

(27)

We illustrate the MLE via HMC adversarial dynamics embedding (ADE) in alg:mle_de.

The same technique can be applied to other parametrization of dual sampler via alternative dynamics embeddings, as listed in appendix:variants, leading to different architectures and the entropy term evaluations. Consider the dynamics embedding as an adaptive sampler that can be automatically learned w.r.t. different models and datasets, then, the updates for in can be understood as learning to sample.

4 Connections to Other Estimators

We establish the connections between the proposed estimator with existing approaches. In summary, the proposed algorithm is equipped with an adaptive sampler; while the existing algorithms can be recast as the special cases of our estimator with some hand-designed samplers, and thus, may lead to extra error and inferior solution.

4.1 Connection to Contrastive Divergence

The CD algorithm (Hinton, 2002) is a special case of the proposed algorithm. By thm:fenchel_log_partition, the optimal solution to the inner optimization is . Applying Danskin’s theorem (Bertsekas, 1995), the gradient of w.r.t. is

(28)

To estimate the integral , the CD algorithm approximates the negative term in (28) stochastically with a finite MCMC step away from empirical data.

In the proposed dual sampler, by setting to be the empirical distribution and eliminating the sampling learning, the dynamic embedding will collapse to CD with -HMC steps if we remove gradient through the sampler, , ignoring the third term in (27). Similarly, the persistent CD (PCD) (Tieleman, 2008) and recent ensemble CD (Du and Mordatch, 2019) can also be recast as special cases by setting the negative sampler to be MCMC with initial samples from previous model and ensemble of MCMC samplers, respectively.

From this perspective, the CD and PCD algorithms induce errors not only from the sampler, but also from the gradient back-propagation truncation. The proposed algorithm escapes these sources of bias by learning to sample, and by adopting true gradients, respectively. Therefore, the proposed estimator is expected to achieve better performance than CD as demonstrated in the empirical experiments sec:real_exp.

4.2 Connection to Score Matching

The score matching (Hyvärinen, 2005) estimates the exponential family by minimizing the Fisher divergence, leading to optimization,

(29)

As explained in Hyvärinen (2007), the objective (29) can be derived as the nd-order Taylor approximation of the MLE with -step Langevin Monte Carlo as the dual sampler. Specifically, the Langevin Monte Carlo generates samples via

then, a simple Taylor expansion gives

Plug such into the negative expectation in , leading to

which is exactly the defined in (29).

Therefore, the score matching can be viewed as applying Taylar expansion approximation with fixed -step Langevin sampler in our framework, which is compared in sec:synthetic_exp.

4.3 Connection to Pseudo-Likelihood and Conditional Composite Likelihood

The pseudo-likelihood estimation (Besag, 1975) is a special case of the proposed algorithm by restricting the parametrization of the dual distribution. Specifically, denote the with , instead of directly maximizing likelihood, the pseudo-likelihood estimator is maximizing

(30)

Then, the is updated by the following the gradient of , ,

The pseudo-likelihood estimator can be recast as a special case of the proposed framework if we fix the dual sampler as i), sample uniformly; ii), sample and mask ; iii), sample and compose .

The conditional composite likelihood (Lindsay, 1988) is a generalization of pseudo-likelihood by maximizing

(31)

where and . Similarly, the composite likelihood is updating with prefixed conditional block sampler for negative sampling.

Same as CD, the prefixed sampler and the biased gradient in pseudo-likelihood and composite likelihood estimator will induce extra errors and lead to inferior solution. Moreover, the pseudo-likelihood may not applicable to the general exponential family with continuous variables, whose conditional distribution is also intractable.

4.4 Connection to Non-local Contrastive Objectives

The non-local contrastive estimator (Vickrey et al., 2010) is obtained by maximizing

(32)

where denotes some prefixed partition of , , and with . The objective (32) leads to the update direction as

(33)

where with as the empirical distribution and , . Therefore, the non-local contrastive objective is a special case of the proposed framework with the dual sampler as i), sample uniformly from ; ii), sample conditional on according to ; iii), sample within . Such negative sampling method is also not applicable to the general exponential family with continuous variables.

4.5 Connection to Minimum Probability Flow

In the continuous state model, the minimum probability flow (Sohl-Dickstein et al., 2011) estimates the exponential family by maximizing

where is a hand-designed symmetric transition kernel based on the potential function , , Hamiltonian or Langevin simulation. Then, the MPF update direction can be rewritten as

(34)

where . The probability flow operator actually defines a Markov chain sampler that achieves the following balance equation,

Similar to CD and score matching, the MPF exploits the -step MCMC. Moreover, the gradient in MPF also considers the effects in sampler as the third term in (34). Therefore, the MPF can be recast as a special case of our algorithm with the prefixed dual sampler as and .

4.6 Connection to Noise-Contrastive Estimator

Instead of directly estimating the in the exponential family, Gutmann and Hyvärinen (2010) propose the noise-contrastive estimation (NCE) for the density ratio between the exponential family and some user defined reference distribution , from which the parameter can be reconstructed. Specifically, the NCE considers an alternative representation of exponential family distribution as , which explicitly enforces . The NCE is obtained by maximizing

(35)

where . Then, we have the gradient of as

(36)

The negative sampler in the eq:nce_grad can be understood as an approximate importance sampling algorithm where the proposal is and the rewighting part is . As the approaching , the will approach the true ratio , and thus, the negative samples will converge to true model samples.

The NCE can be understood as learning an important sampler. However, the performance of NCE highly relies on the quality , , the choice of . It is required to cover the support of , which is non-trivial in practical high-dimensional applications.

5 Related Work

The parametrization of the dual sampler should be both flexible enough and density tractable to achieve better performance. Pioneering works are limited in either one aspect or other. Kim and Bengio (2016) parameterize the sampler via a deep directed graphical model, whose approximation ability is limited to known distributions. Meanwhile, they fit by minimizing the reverse -divergence to avoid the entropy calculation, leading to unclear relationship to MLE. Due to the difficulty of the entropy term for general transport mapping parametrization, a variety of approximate surrogates have been proposed to relax the density value tractability requirement. Liu and Wang (2017) learn the sampler to mimic the Stein variational gradient descent sampling procedure; Dai et al. (2017)

propose algorithms relying on either a heuristic approximation or a upper bound of the entropy, with extra auxiliary component introduced to be learned; 

Dai et al. (2018a)

apply a second Fenchel dual representation to reformulate the entropy term, at the cost of introducing another auxiliary function to be estimated. Meanwhile, the second Fenchel duality parametrization relies on a proposal distribution with the same support for numerical stability, which is impractical for high-dimensional data. In contrast to these existing methods, the proposed dynamics embedding achieves both flexibility and tractability of entropy estimation with less independent auxiliary parameters introduced. Moreover, the dynamics embedding avoids the design of proposal, which is naturally suitable for high-dimension data.

One of our major contributions is learning a sampling strategy for the exponential family estimation through the primal-dual view of MLE. The proposed algorithm shares some similarities with recent advances in meta learning for sampling (Levy et al., 2017; Feng et al., 2017; Song et al., 2017; Gong et al., 2018), in which the sampler is parametrized via neural network and will be learned through certain objectives. However, we emphasize that the most significant difference lies in the ultimate goals: we focus on exponential family model estimation and the learned sampler is only introduced to assist with this objective. By contrast, the learning to sample techniques are targeting on learning a fixed model that is already given. This fundamentally distinguishes the proposed ADE from methods that only learns samplers, leading to totally different learning criterion and algorithm updates, , the primal model will be learned back through the learned sampler, from which perspective the proposed algorithm can be understood as meta-learning.

6 Experiments

6.1 Synthetic experiments

(a) 2spirals (b) Banana (c) circles (d) cos (e) Cosine (f) Funnel (g) swissroll

(h) line (i) moons (j) Multiring (k) pinwheel (l) Ring (m) Spiral (n) Uniform
Figure 1:

Learned samplers in odd row and potential function

in even row from different synthetic datasets.
Figure 2: Convergence behavior of sampler on 2spirals, Multiring, pinwheel, Spiral synthetic datasets.

We parametrize the potential function

with fully connected multi-layer perceptron with 3 hidden layers. Each hidden layer has

hidden units. We use ReLU to do the nonlinear activation in each hidden layer. For baseline methods, we compare with score matching (SM) estimator 

(Hyvärinen, 2005) and primal-dual MLE with the normalizing flow (Rezende and Mohamed, 2015) parametrized dual sampler (NF). Note that although the normalizing flow can be used to perform maximum likelihood estimation (MLE) to fit the data (Rezende and Mohamed, 2015), here the flow is asked to help train the exponential model . Using normalizing flow to learn the exponential family model can be viewed as the special case of our method, which has only parameterized as flow, but no HMC dynamics is involved. Thus this is also served as one ablation study for the proposed ADE.

For our method, we use normalizing flow as the initial proposal distribution . Then we perform several steps of stochastic Langevin operation to get the final samples . is the distribution for sampling momentum. We clip the norm of when updating , and clip when updating . The coefficient in (26) is tuned in . For the NF baseline, we tune the number of layers in . For our ADE, we fix the number of normalizing flow layers to be , and then perform at most steps of dynamics updates. So finally, the number of steps for sampling is comparable, while the ADE maintains less memory cost.

In this section, we compare the proposed ADE with other baseline methods on several synthetic datasets. For visualization purpose, all the samples are in space. The dataset generators are collect from several open-source projects 111https://github.com/rtqichen/ffjord 222https://github.com/kevin-w-li/deep-kexpfam. During training, we use this generator to generate the data from the true distribution on the fly. To get a quantitative comparison, we also generate 1,000 data samples for held-out evaluation.

Dataset SM NF ADE
2spirals 5.09 0.69 -0.76
Banana 8.10 0.88 -1.13
circles 4.90 0.76 -1.30
cos 10.36 0.91 -0.58
Cosine 8.34 2.15 -1.00
Funnel 13.07 -0.92 -1.02
swissroll 19.93 1.97 -0.97
line 10.28 0.39 -0.53
moons 41.34 0.80 -1.04
Multiring 2.01 0.30 -1.05
pinwheel 18.41 3.01 -0.83
Ring 9.22 161.89 -1.38
Spiral 9.48 5.96 -0.62
Uniform 5.88 0.00 -1.08
Table 1: Quantitative comparison on synthetic data using maximum mean discrepancy (MMD ).

In fig:synthetic, we visualize the learned distribution using both the learned dual sampler and also the exponential model , where is a constant that is tuned within [0.01, 10]. Overall one can see the sampler almost perfectly recover the distribution, and the learned almost captures the landscape of the distribution. We also visualize the convergence behavior in fig:synthetic_sampler. We can see the samples are converging to the true data distribution smoothly. As the learned sampler depends on the , so this figure also suggests the nice convergence behavior of indirectly.

The quantitative comparison of the sampler is illustrated in tab:synthetic_mmd. For NF and ADE, we use 1,000 samples from their sampler to calculate MMD with gaussian kernel. The kernel bandwidth is chosen using median trick (Dai et al., 2016). For SM, since there is no such sampler available, we directly use vanilla HMC to get sample from the learned model , and use them to estimate MMD. From the table we can see, our approach gets the best MMD in all cases. This demonstrates the effectiveness of dynamics embedding, as comparing to only using the normalizing flow model.

6.2 Real-world Image Datasets

(a) Samples on MNIST (b) Histogram on MNIST (c) Samples on CIFAR-10 (d) Histogram on CIFAR-10
Figure 3: The generated images on MNIST and CIFAR-10 and the comparison between energies of generated samples and real images. The blue histogram illustates the distribution of on generated samples, and the orange histogram is generated by on testing samples. As we can see, the learned potential function matches the empirical dataset well.

To further verify that the proposed adversarial dynamics embedding can learn complicated exponential family and the corresponding sampler well for real-world images, we trained models on MNIST and CIFAR-10 dataset. For both datasets, we use the CNN architecture for discriminator as used in Miyato et al. (2018), and spectral normalization is added to all discriminator layers. Specifically, for the discriminator in CIFAR-10 experiments, we replace all downsampling operations by average pooling, as in Du and Mordatch (2019). The detailed architectures are shown in subsec:real_archi.

We parametrize the initial distribution with neural nets as deep Gaussian latent variable model (Deep LVM) specified in appendix:practical. After several HMC steps on this initial distribution, instead of clipping the images, we pass the sample to function to obtain the final sampled images.

We used the standard spectral normalization on the discriminator to stabilize the training process, and Adam with learning rate to optimize our model. We trained the models with iterations with batch size being . The step sizes for our HMC sampler are independently learned for all HMC dimensions but shared among all time steps, and the values are all initialized to . We set the number of HMC steps to . The coefficient of the entropy regularization term is set to and that of the

regularization on the momentum vector in the last HMC step is set to

.

Model Inception Score
WGAN-GP (Gulrajani et al., 2017) 6.50
Spectral GAN (Miyato et al., 2018) 7.42
Langevin PCD (Du and Mordatch, 2019) 6.02
Langevin PCD (10 ensemble) (Du and Mordatch, 2019) 6.78
ADE: Deep LVM init w/o HMC 6.51
ADE: Deep LVM init w/ HMC 7.43
Table 2: Inception scores of different models on CIFAR-10 (unconditional).

We report the inception scores in tab:inception_score. For our ADE, we trained two models with Deep LVM as the initial dual sampler, one with HMC samplers and one without that. We observed that the HMC sampler greatly improves the performance of the samples generated by initial Deep LVM along and enables the generator to be comparable with Spectral GAN. Moreover, we note that the initial Deep LVM helps sample generation, compared to the Langevin PCD models reported in (Du and Mordatch, 2019).

We show some samples of generated images in fig:real_images(a) and (c). More sampled images can be found in subsec:more_images. In addition, we also plot the potential distribution (unnormalized) of generated samples and that of the real images on MNIST and CIFAR-10

 (using 1000 data points for each) in fig:real_images(b) and (d). The distributions of energies of both generated images and real iamges mostly overlap with each other, showing that the discriminators learn the desirable distributions. Also, with simple importance sampling and proposal distribution being uniform distribution on

( is the dimension of images), the log likelihood (in nats) on CIFAR-10 is estimated to be around .

7 Conclusion

We proposed a novel method to efficiently construct maximum likelihood estimator (MLE) for general exponential families. Specifically, by utilizing the nice properties of primal-dual formulation of the MLE of the augmented exponential family distribution with kinetic variables, we incorporate dynamics-based distribution parametrization for dual sampler into the estimation process in a fully differentiable way. Such coupling of the primal model and the dual sampler benefits each other, and thus, achieving both better model estimation and efficient inference at the same time. Our empirical results on both synthetic and real data validate the feasibility of our proposed dynamics embedding. In future work, we will extend this method to other graphical model learning problems.

Acknowledgements

NH is supported in part by NSF-CRII-1755829, NSF-CMMI-1761699, and NCSA Faculty Fellowship. LS is supported in part by NSF IIS-1218749, NIH BIGDATA 1R01GM108341, NSF CAREER IIS-1350983, NSF IIS-1639792 EAGER, NSF IIS-1841351 EAGER, NSF CCF-1836822, NSF CNS-1704701, ONR N00014-15-1-2340, Intel ISTC, NVIDIA, Amazon AWS, Siemens and Google Cloud.

References

Appendix A Proof of Theorems in sec:mle

thm:grad_flow (HMC embeddings as gradient flow) For a continuous time with infinitesimal stepsize , the density of the particles , denoted as , follows Fokker-Planck equation

(37)

with . Then as . The first part of the theorem is trivial. When , the HMC follows the dynamical system

By applying the Fokker-Planck equation, we obtain

(38)

To show that the stationary distribution of such dynamical system converges to , recall the fact that

(39)

The Fokker-Planck equation can be rewritten as

(40)

Substitute into (40) and notice

we have , , is the stationary distribution.

thm:hmc_density (Density value evaluation) If , after vanilla HMC steps (14), we have

For the from the generalized leapfrog steps (