# Kernel Exponential Family Estimation via Doubly Dual Embedding

We investigate penalized maximum log-likelihood estimation for exponential family distributions whose natural parameter resides in a reproducing kernel Hilbert space. Key to our approach is a novel technique, doubly dual embedding, that avoids computation of the partition function. This technique also allows the development of a flexible sampling strategy that amortizes the cost of Monte-Carlo sampling in the inference stage. The resulting estimator can be easily generalized to kernel conditional exponential families. We furthermore establish a connection between infinite-dimensional exponential family estimation and MMD-GANs, revealing a new perspective for understanding GANs. Compared to current score matching based estimators, the proposed method improves both memory and time efficiency while enjoying stronger statistical properties, such as fully capturing smoothness in its statistical convergence rate while the score matching estimator appears to saturate. Finally, we show that the proposed estimator can empirically outperform state-of-the-art methods in both kernel exponential family estimation and its conditional extension.

• 97 publications
• 32 publications
• 76 publications
• 101 publications
• 66 publications
• 33 publications
12/12/2013

### Density Estimation in Infinite Dimensional Exponential Families

In this paper, we consider an infinite dimensional exponential family, P...
05/23/2017

### Efficient and principled score estimation with Nyström kernel exponential families

We propose a fast method with statistical guarantees for learning an exp...
04/27/2019

### Exponential Family Estimation via Adversarial Dynamics Embedding

We present an efficient algorithm for maximum likelihood estimation (MLE...
11/15/2017

### Kernel Conditional Exponential Family

A nonparametric family of conditional distributions is introduced, which...
07/15/2021

### Statistical inference using Regularized M-estimation in the reproducing kernel Hilbert space for handling missing data

Imputation and propensity score weighting are two popular techniques for...
11/20/2018

### Learning deep kernels for exponential family densities

The kernel exponential family is a rich class of distributions,which can...
01/13/2021

### Denoising Score Matching with Random Fourier Features

The density estimation is one of the core problems in statistics. Despit...

## 1 Introduction

The exponential family is one of the most important classes of distributions in statistics and machine learning. The exponential family possesses a number of useful properties

(Brown, 1986)and includes many commonly used distributions with finite-dimensional natural parameters, such as the Gaussian, Poisson and multinomial distributions, to name a few. It is natural to consider generalizing the richness and flexibility of the exponential family to an infinite-dimensional parameterization via reproducing kernel Hilbert spaces (RKHS) (Canu and Smola, 2006)

Maximum log-likelihood estimation (MLE) has already been well-studied in the case of finite-dimensional exponential families, where desirable statistical properties such as asymptotic unbiasedness, consistency and asymptotic normality have been established. However, it is difficult to extend MLE to the infinite-dimensional case. Beyond the intractability of evaluating the partition function for a general exponential family, the necessary conditions for maximizing log-likelihood might not be feasible; that is, there may be no solution to the KKT conditions in the infinite-dimensional case (Pistone and Rogantin, 1999; Fukumizu, 2009). To address this problem, Barron and Sheu (1991); Gu and Qiu (1993); Fukumizu (2009) considered several ways to regularize the function space by constructing (a series of) finite-dimensional spaces that approximate the original RKHS, yielding a tractable estimator in the restricted finite-dimensional space. However, as Sriperumbudur et al. (2017) note, even with the finite-dimension approximation, these algorithms are still expensive as every update requires Monte-Carlo sampling from a current model to compute the partition function.

An alternative score matching based estimator has recently been introduced by Sriperumbudur et al. (2017). This approach replaces the Kullback–Leibler () with the Fisher divergence, defined by the expected squared distance between the score of the model (, the derivative of the -density) and the target distribution score. By minimizing the Tikhonov-regularized Fisher divergence, Sriperumbudur et al. (2017) develop a computable estimator for the infinite-dimensional exponential family that also obtains a consistency guarantee. Recently, this method has been generalized to conditional infinite-dimensional exponential family estimation (Arbel and Gretton, 2017). Although score matching avoids the generally intractable integral, it requires computing and saving the first- and second-order derivatives of the reproducing kernel for each dimension on each sample. For samples with features, this results in memory and time cost respectively, which becomes prohibitive for datasets with only moderate sizes of and . To alleviate this cost, Sutherland et al. (2017) utilize the -rank Nystöm approximation to the kernel matrix, reducing memory and time complexity to and respectively. Although this reduces the cost dependence on sample size, the dependence on

is unaffected, hence the estimator remains unsuitable for high-dimensional data. Estimating a general exponential family model by either MLE or score matching also generally requires some form of Monte-Carlo sampling, such as MCMC or HMC, to perform inference, which significantly adds to the computational cost.

In summary, both the MLE and score matching based estimators for the infinite-dimensional exponential family incur significant computational overhead, particularly in high-dimensional applications.

In this paper, we revisit penalized MLE for the kernel exponential family and propose a new estimation strategy. Instead of solving the log-likelihood equation directly, as in existing MLE methods, we exploit a doubly dual embedding technique that leads to a novel saddle-point reformulation for the MLE (along with its conditional distribution generalization) in sec:dual_mle. We then propose a stochastic algorithm for the new view of penalized MLE in sec:alg. Since the proposed estimator is based on penalized MLE, it does not require the first- and second-order derivatives of the kernel, as in score matching, greatly reducing the memory and time cost. Moreover, since the saddle point reformulation of penalized MLE avoids the intractable integral, the need for a Monte-Carlo step is also bypassed, thus accelerating learning. This approach also allows a flexibly parameterized sampler to be learned simultaneously, therefore, further reducing inference cost. We present the consistency and rate of the new estimation strategy in the well-specified case, , the true density belongs to the kernel exponential family, and an algorithm convergence guarantee in sec:theory. We demonstrate the empirical advantages of the proposed algorithm in sec:experiments, comparing to state-of-the-art estimators for both the kernel exponential family and its conditional extension (Sutherland et al., 2017; Arbel and Gretton, 2017).

## 2 Preliminaries

We first provide a preliminary introduction to the exponential family and Fenchel duality, which will play vital roles in the derivation of the new estimator.

### 2.1 Exponential family

The natural form of the exponential family over with the sufficient statistics is defined as

 pf\rbrx=p0(x)exp\rbrλf(x)−A\rbrλf, (1)

where , , and . In this paper, we mainly focus on the case where is a reproducing Hilbert kernel space (RKHS) with kernel , such that

. However, we emphasize that the proposed algorithm in sec:alg can be easily applied to arbitrary differentiable function parametrizations, such as deep neural networks.

Given samples , a model with a finite-dimensional parameterization can be learned via maximum log-likelihood estimation (MLE),

 maxf∈\Fcal1NN∑i=1logpf(xi)=ˆ\EE\Dcal\sbrλf(x)+logp0\rbrx−A(λf),

where denotes the empirical expectation over . The MLE eq:mle is well-studied and has nice properties. However, the MLE is known to be “ill-posed” in the infinite-dimensional case, since the optimal solution might not be exactly achievable in the representable space. Therefore, penalized MLE has been introduced for the finite (Dudík et al., 2007) and infinite dimensional Gu and Qiu (1993); Altun and Smola (2006)

exponential families respectively. Such regularization essentially relaxes the moment matching constraints in terms of some norm, as shown later, guaranteeing the existence of a solution. In this paper, we will also focus on MLE with RKHS norm regularization.

One useful theoretical property of MLE for the exponential family is convexity w.r.t. . With convex regularization, ,

, one can use stochastic gradient descent to recover a unique global optimum. Let

denote RKHS norm penalized log-likelihood, ,

 L\rbrf\defeq1NN∑i=1logpf(xi)−η2\nbrf2\Hcal. (2)

where denotes the regularization parameter. The gradient of w.r.t. can be computed as

 ∇fL=ˆ\EE\Dcal\sbrλf∇f(x)−∇fA(λf)−ηf. (3)

To calculate the in eq:grad, we denote and expand the partition function by definition,

 ∇fA(λf) = 1Z(λf)∇f∫exp\rbrλf(x)p0\rbrxdx (4) = ∫p0\rbrxexp\rbrλf(x)Z(λf)∇fλf(x)dx = \EEpf(x)\sbr∇fλf(x).

We may approximate the

by MCMC samples, which leads to the Contrastive Divergence (CD) algorithm

(Hinton, 2002).

To avoid costly MCMC sampling in estimating the gradient, Sriperumbudur et al. (2017) construct an estimator based on score matching instead of MLE, which minimizes the penalized Fisher divergence. Plugging the kernel exponential family into the empirical Fisher divergence, the optimization reduces to

 J\rbrf\defeqλ22⟨f,\Chatf⟩\Hcal+λ⟨f,^δ⟩\Hcal+η2\nbrf2\Hcal, (5)

where

 \Chat \defeq 1nn∑i=1d∑j=1∂jk\rbrxi,⋅⊗∂jk\rbrxi,⋅, ^δ \defeq 1nn∑i=1d∑j=1∂jk\rbrxi,⋅\rbr∂jlogp0\rbrxi+∂2jk\rbrxi,⋅.

As we can see, the score matching objective (5) does not involve the intractable integral in . However, such an estimator requires the computation of first- and second-order of derivatives of kernel for each dimension on each data, leading to memory and time cost of and respectively. This quickly becomes prohibitive for even modest and .

The same difficulty also appears in the score matching based estimator in (Arbel and Gretton, 2017) for the conditional exponential family, which is defined as

 p\rbry|x=p0\rbryexp\rbrλf\rbrx,y−Ax\rbrλf,f∈\Fcal (6)

where , , , and , is defined as .

In this paper, we consider such that is in RKHS for . Denoting such that , we can derive its kernel function following Micchelli and Pontil (2005); Arbel and Gretton (2017). By the Riesz representation theorem, and , there exists a linear operator such that

 ⟨h,\Tcalx⟩\Hcaly=⟨\Tcal,Γxh⟩\HcalΩx,∀\Tcal∈\HcalΩx.

Then, the kernel can be defined by composing with its dual, , and the function .

### 2.2 Convex conjugate and Fenchel duality

Denote as a function , then its convex conjugate function is defined as

 h∗(u)=supv∈\RRd{u⊤v−h(v)}.

If is proper, convex and lower semicontinuous, the conjugate function, , is also proper, convex and lower semicontinuous. Moreover, and are dual to each other, , . Such a relationship is known as Fenchel duality (Hiriart-Urruty and Lemaréchal, 2012). By the conjugate function, we can represent the by as,

 h(v)=supu∈\RRd{v⊤u−h∗(u)}.

The supremum achieves if , or equivalently .

## 3 A Saddle-Point Formulation of Penalized MLE

As discussed in sec:preliminary, the penalized MLE for the exponential family involves computing the -partition functions, and , which are intractable in general. In this section, we first introduce a saddle-point reformulation of the penalized MLE of the exponential family, using Fenchel duality to bypass these intractable quantities. This approach can also be generalized to the conditional exponential family. First, observe that we can rewrite the log-partition function via Fenchel duality as follows. [Fenchel duality of log-partition]

 A\rbrλf = maxq∈\Pcalλ\innerq(x)f\rbrx−KL\rbrq||p0, (7) pf\rbrx = \argmaxq∈\Pcalλ\innerq(x)f(x)−KL\rbrq||p0, (8)

where denotes the space of distributions and . Denote , which is strongly concave w.r.t. , the optimal can be obtained by setting the

 logq∗(x)∝λf\rbrx+logp0(x).

Consider the , we have

 q∗(x)=p0(x)exp\rbrλf(x)−A\rbrλf=pf(x),

which is the second conclusion. Plugging the optimal solution to , we obtain the maximum as , which is exactly , achieving the first conclusion. Therefore, plugging the Fenchel dual of into the penalized MLE, we achieve a saddle-point optimization, ,

 maxf∈\FcalL\rbrf∝minq∈\Pcalˆ\EE\Dcal\sbrf(x)−\EEq(x)\sbrf(x)−η2\nbrf2\Hcal+1λKL\rbrq||p0ℓ(f,q).

The saddle-point reformulation of the penalized MLE in (3) resembles the optimization in MMD GAN (Li et al., 2017; Bińkowski et al., 2018). In fact, the dual problem of the penalized MLE of the exponential family is a KL-regularized MMD GAN with a special design of the kernel family. Alternatively, if was a Wasserstein- function, the optimization would resemble the Wasserstein GAN (Arjovsky et al., 2017).

Next, we consider the duality properties. [weak and strong duality] For general case where is continuous, weak duality holds, ,

 maxf∈\Fcalminq∈\Pcalℓ(f,q)⩽minq∈\Pcalmaxf∈\Fcalℓ(f,q).

If is convex and compact, . is some RKHS, then strong duality holds for

 maxf∈\Fcalminq∈\Pcalℓ(f,q)=minq∈\Pcalmaxf∈\Fcalℓ(f,q). (9)

thm:minmax_switch can be obtained by directly applying the minimax theorem (Sion et al., 1958). We refer to the - problem in (9) as the primal problem, while the - form as the its dual form.

#### Remark (connections to MMD GAN):

Consider the dual problem with the kernel learning, ,

 minq∈\Pcalmaxf∈\Fcalϕ,ϕˆ\EE\Dcal\sbrf(x)−\EEq(x)\sbrf(x)−η2\nbrf2\Hcal+1λKL\rbrq||p0. (10)

where we involve the parameters of the kernel to be learned in the optimization. By setting the gradient , for fixed , we obtain the optimal witness function in the RKHS, , which leads to

 minq∈\Pcalmaxϕˆ\EE\sbrkϕ\rbrx,x′−2ˆ\EE\EEq\sbrkϕ\rbrx,x′+\EEq\sbrkϕ\rbrx,x′MMDϕ\rbr\Dcal,q+2ηλKL\rbrq||p0. (11)

It is clear that this is the -divergence regularized MMD GAN. Thus, with -divergence regularization, the MMD GAN learns an infinite-dimension exponential family in an adaptive RKHS. Such a novel perspective bridges GAN and exponential family estimation, which appears to be of independent interest and potentially brings a new connection to the GAN literature for further theoretical development.

#### Remark (connections to Maximum Entropy Moment Matching):

Altun and Smola (2006); Dudík et al. (2007) discuss the maximum entropy moment matching method for distribution estimation,

 minq∈\Pcal KL\rbrq||p0 (12) \st \nbr\EEq\sbrk\rbrx,⋅−ˆ\EE\sbrk\rbrx,⋅2\Hcalk⩽η2,

whose dual problem is exactly the penalized MLE (2). Interestingly, the proposed saddle-point formulation (3) shares the solution to (12). From the maximum entropy view, it is clear that the penalty relaxes the moment matching constraints. However, the algorithms provided in Altun and Smola (2006); Dudík et al. (2007) simply ignore the difficulty in computing the expectation in , which is not practical, especially when is infinite-dimensional.

Similar to thm:fenchel_log_partition, we can also represent by its Fenchel dual,

 Ax\rbrλf = (13) pf\rbry|x = \argmaxq(⋅|x)∈\Pcalλ\innerq(x)f(x)−KL\rbrq||p0. (14)

Then, we can recover the penalized MLE for the conditional exponential family as

 maxf∈\Fcalminq\rbr⋅|x∈\Pcalˆ\EE\Dcal\sbrf(x,y)−\EEq(y|x)\sbrf(x,y)−η2\nbrf2\Hcal+1λKL\rbrq||p0 (15)

The only difference between (3) and (15) is that the dual distribution for the conditional exponential family is now a conditional distribution.

The saddle-point reformulations of penalized MLE in (3) and (15) bypass the difficulty in the partition function. Therefore, it is very natural to consider learning the exponential family by solving the saddle-point problem (3) with an appropriate parametrized dual distribution . This approach is referred to as the “dual embedding” technique in Dai et al. (2016), which requires:

• the parametrization family should be flexible enough to reduce the extra approximation error;

• the parametrized representation should be able to provide density value.

As we will see in sec:theory, the flexibility of the parametrization of the dual distribution will have a significant effect on the consistency of the estimator.

One can of course use the kernel density estimator (KDE) as the dual distribution parametrization, which leads to convex-concave preservation. However, KDE will easily fail when approximating high-dimension data. Applying the reparametrization trick with a suitable class of probability distributions

(Kingma and Welling, 2013; Rezende et al., 2014) can also be used to parametrize the dual distribution. However, the class of such parametrized distributions is typically restricted to simple known distributions, which will not be able to approximate the true solution, leading to a potentially huge approximation bias. At the other end of the spectrum, the distribution family generated by transport mapping is sufficiently flexible to model smooth distributions (Goodfellow et al., 2014; Arjovsky et al., 2017). However, the density value of such distributions, , , is not available for -divergence computation, and thus, is not applicable for parameterizing the dual distribution. Recently, flow-based parametrized density functions (Rezende and Mohamed, 2015; Kingma et al., 2016; Dinh et al., 2016) have been proposed for a trade-off between the flexibility and tractability. However, the expressive power of existing flow-based models remains overly restrictive even in our synthetic example and cannot be directly applied to conditional models.

### 3.1 Doubly Dual Embedding for MLE

Transport mapping is very flexible for generating smooth distributions. However, a major difficulty is that it lacks the ability to obtain the density value , making the computation of the -divergence impossible. To enjoy the flexibility of transport mapping and avoid the calculation of , we introduce doubly dual embedding, which will achieve a delicate balance between flexibility and tractability.

First, noting that -divergence is also a convex function, we consider the Fenchel dual of -divergence (Nguyen et al., 2008), ,

 KL\rbrq||p0=maxν\EEq\sbrν(x)−\EEp0\sbrexp\rbrν(x)+1, (16) logq\rbrxp0\rbrx=\argmaxν\EEq\sbrν(x)−\EEp0\sbrexp\rbrν(x)+1, (17)

with . One can see in the dual representation of the -divergence that the introduction of the auxiliary optimization variable eliminates the explicit appearance of in (16), which makes the transport mapping parametrization for become applicable. Since there is no extra restriction on , we can use arbitrary smooth function approximation for , such as kernel functions or neural networks.

Applying the dual representation of -divergence into the dual problem of the saddle-point reformulation (3), we obtain the ultimate saddle-point optimization for the estimation of the kernel exponential family,

 minq∈\Pcalmaxf,ν∈\Fcal~ℓ\rbrf,ν,q\defeqˆ\EE\Dcal\sbrf−\EEq\sbrf−η2\nbrf2\Hcal+1λ\rbr\EEq\sbrν−\EEp0\sbrexp\rbrν.

It should be emphasized that although other works have applied Fenchel duality to -divergence (Nguyen et al., 2008; Nowozin et al., 2016), the approach taken here is different in considering , whereas those works consider . It is also obvious in the view of the optimization we obtained in (3.1), comparing to the objective in -GAN.

#### Remark (Extension for kernel conditional exponential family):

Similarly, we can apply the doubly dual embedding technique to the penalized MLE of the kernel conditional exponential family, which leads to

 minq∈\Pcalxmaxf,ν∈\Fcalˆ\EE\Dcal\sbrf−\EEq\rbry|x,x∼\Dcal\sbrf−η2\nbrf2\Hcal+1λ\rbr\EEq\rbry|x,x∼\Dcal\sbrν−\EEp0\rbry\sbrexp\rbrν.

In summary, with the doubly dual embedding technique, we achieve a saddle-point reformulation of penalized MLE that bypasses the difficulty of handling the intractable partition function while allowing great flexibility in parameterizing the dual distribution.

## 4 Practical Algorithm

In this section, we introduce the transport mapping parametrization for the dual distribution, , and adapt the stochastic gradient descent for solving the optimization. Since these two optimizations are almost the same, for simplicity of exposition, we only illustrate the algorithm for (3.1). We emphasize the algorithm can be easily applied to (3.1).

Denote the parameters in the transport mapping for as , such that and . We illustrate the algorithm with a kernel represented . There are many alternative choices of parametrization of , , neural networks—the proposed algorithm is still applicable to the parametrization as long as it is differentiable. We abuse notation somewhat by using as in (3.1).

With the kernel parametrized , the inner maximization over and is a standard concave optimization on. We can solve it using existing algorithms to achieve the global optimal solution. Due to the existence of the expectation, we will use the stochastic function gradient descent for scalability. Given , following the definition of functional gradients Kivinen et al. (2004); Dai et al. (2014), we have

 ζf\rbr⋅\defeq∇f~ℓ\rbrf,ν,wg=ˆ\EE\Dcal\sbrk\rbrx,⋅−\EEξ\sbrk\rbrgwg\rbrξ0,⋅−ηf\rbr⋅. (18)
 ζν\rbr⋅\defeq∇ν~ℓ\rbrf,ν,wg1λ\rbr\EEξ\sbrk\rbrgwg\rbrξ,⋅−\EEp0\sbrexp\rbrν\rbrxk\rbrx,⋅. (19)

In the -th iteration, given sample , and , the update rule for and will be

 fk+1\rbr⋅=\rbr1−ητkfk\rbr⋅+τk\rbrk\rbrx,⋅−k\rbrgwg\rbrξ,⋅,
 νk+1\rbr⋅=νk\rbr⋅+τkλ\rbrk\rbrgwg\rbrξ,⋅−exp\rbrνk\rbrx′k\rbrx′,⋅,

where denotes the step-size.

Then, we consider the update rule for the parameters in dual transport mapping embedding. [Dual gradient] Denoting as and , we have

 ∇wg\Lhat\rbrwg=−\EEξ\sbr∇wgf∗wg\rbrgwg\rbrξ+1λ\EEξ\sbr∇wgν∗wg\rbrgwg\rbrξ. (20)

Proof details are given in appendix:subsec:dual_grad. With this gradient estimator, we can apply stochastic gradient descent to update iteratively. We summarize the updates for and in alg:sgd_double_dual and 2.

Compared to score matching based estimators Sriperumbudur et al. (2017); Sutherland et al. (2017), although the convexity no longer holds for the saddle-point estimator, the doubly dual embedding estimator avoids representing with derivatives of the kernel, which avoids the memory cost dependence on the square of dimension. In particular, the proposed estimator for based on the saddle-point reformulation of penalized MLE reduces the memory cost from to where denotes the number of iterations in alg:sgd_f. In terms of time cost, we exploit the stochastic update, which is naturally suitable for large-scale datasets, avoiding the matrix inverse computation in score matching estimator whose cost will be . Moreover, we also learn the dual distribution simultaneously, which can easily generate samples from the learned exponential family. This can therefore be used for inference, saving the Monte-Carlo sampling in the inference stage.

#### Remark (random feature extension):

Memory cost is the well-known bottleneck for applying kernel methods to large-scale problems. When we set , the memory cost will be , which is prohibitive for millions of data points. Random feature approximation (Rahimi and Recht, 2008; Dai et al., 2014; Bach, 2015) can be utilized for scaling up kernel methods. The proposed alg:sgd_double_dual and alg:sgd_f are also compatible with random feature approximation, and hence, applicable to large-scale problems in the same way. With random features, we can further reduce the memory cost of to . However, even with a random feature approximation, the score matching based estimator will still require memory. One can also learn random features by back-propagation, which leads to a neural networks extension. Due to space limitations, the details of this variant of alg:sgd_double_dual are given in Appendix B.

## 5 Theoretical Analysis

In this section, we will first provide the analysis of consistency and the sample complexity of the proposed estimator based on the saddle-point reformulation of the penalized MLE in the well-specified case, where the true density is assumed to be in the kernel (conditional) exponential family, following Sutherland et al. (2017); Arbel and Gretton (2017). Then, we consider the convergence property of the proposed algorithm. We mainly focus on the analysis for . The results can be easily extended to kernel conditional exponential family .

### 5.1 Statistical Consistency

We explicitly consider approximation error from the dual embedding in the consistency of the proposed estimator. We first establish some notation that will be used in the analysis. For simplicity, we set and improperly in the exponential family expression (1). We denote as the ground-true distribution with its true potential function and as the optimal primal and dual solution to the saddle point reformulation of the penalized MLE (3). The parametrized dual space is denoted as . We denote as the exponential family generated by . We have the consistency results as Assume the spectrum of kernel decays sufficiently homogeneously in rate . With some other mild assumptions listed in appendix:subsec:consistency, we have as and ,

 KL\rbrp∗||p\ftil+KL\rbrp\ftil||p∗=\Ocalp∗\rbrn−1η−1r+η+ϵ2approx,

where . Therefore, when setting , converges to in terms of Jensen-Shannon divergence at rate For the details of the assumptions and the proof, please refer to appendix:subsec:consistency. Recall the connection between the proposed model and MMD GAN as discussed in sec:dual_mle. thm:consistency also provides a learnability guarantee for a class of GAN models as a byproduct. The most significant difference of the bound provided above, compared to Gu and Qiu (1993); Altun and Smola (2006), is the explicit consideration of the bias from the dual parametrization. Moreover, instead of the Rademacher complexity used in the sample complexity results of Altun and Smola (2006), our result exploits the spectral decay of the kernel, which is more directly connected to properties of the RKHS.

From thm:consistency we can clearly see the effect of the parametrization of the dual distribution: if the parametric family of the dual distribution is simple, the optimization for the saddle-point problem may become easy, however, will dominate the error. The other extreme case is to also use the kernel exponential family to parametrize the dual distribution, then, will reduce to , however, the optimization will be difficult to handle. The saddle-point reformulation provides us the opportunity to balance the difficulty of the optimization with approximation error.

The statistical consistency rate of our estimator and the score matching estimator (Sriperumbudur et al., 2017) are derived under different assumptions, therefore, they are not directly comparable. However, since the smoothness is not fully captured by the score matching based estimator in Sriperumbudur et al. (2017), it only achieves even if is infinitely smooth. While under the case that dual distribution parametrization is relative flexible, , is negligible, and the the spectrum of the kernel decay rate , the proposed estimator will converge in rate , which is significantly more efficient than the score matching method.

### 5.2 Algorithm Convergence

It is well-known that the stochastic gradient descent converges for saddle-point problem with convex-concave property (Nemirovski et al., 2009). However, for better the dual parametrization to reduce in thm:consistency, we parameterize the dual distribution with the nonlinear transport mapping, which breaks the convexity. In fact, by thm:dual_gradient, we obtain the unbiased gradient w.r.t. . Therefore, the proposed alg:sgd_double_dual can be understood as applying the stochastic gradient descent for the non-convex dual minimization problem, , . From such a view, we can prove the sublinearly convergence rate to a stationary point when stepsize is diminishing following Ghadimi and Lan (2013); Dai et al. (2017). We list the result below for completeness.

[Ghadimi and Lan (2013)] Assume that the parametrized objective is

-Lipschitz and variance of its stochastic gradient is bounded by

. Let the algorithm run for iterations with stepsize for some and output . Setting the candidate solution to be randomly chosen from such that , then it holds that where represents the distance of the initial solution to the optimal solution. The above result implies that under the choice of the parametrization of and , the proposed alg:sgd_double_dual converges sublinearly to a stationary point, whose rate will depend on the smoothing parameter.

## 6 Experiments

In this section, we compare the proposed doubly dual embedding (DDE) with the current state-of-the-art score matching estimators for kernel exponential family (KEF) Sutherland et al. (2017) and its conditional extension (KCEF) Arbel and Gretton (2017), respectively, as well as several competitors. We test the proposed estimator empirically following their setting. We use Gaussian RBF kernel for both exponential family and its conditional extension, , with the bandwidth set by median-trick (Dai et al., 2014). For a fair comparison, we follow Sutherland et al. (2017) and Arbel and Gretton (2017) to set the for kernel exponential family and its conditional extension, respectively. The dual variables are parametrized by MLP with layers.

#### Density estimation

We evaluate the DDE on the synthetic datasets, including ring, grid and two moons, where the first two are used in Sutherland et al. (2017), and the last one is from Rezende and Mohamed (2015). The ring dataset contains the points uniformly sampled along three circles with radii and noise in the radial direction and extra dimensions. The -dim grid dataset contains samples from mixture of Gaussians. Each center lies on one dimension in the -dimension hypercube. The two moons dataset is sampled from the exponential family with potential function as . We use samples for training, and for testing (grid) or (ring, two moons) samples, following Sutherland et al. (2017).

We visualize the samples generated by the learned sampler, and compare it with the training datasets in the first row in fig:vis. The learned is also plotted in the second row in fig:vis. The DDE learned models generate samples that cover the training data, showing the ability of the DDE for estimating kernel exponential family on complicated data.

Then, we demonstrate the convergence of the DDE in fig:alg_conv on rings dataset. We initialize with a random dual distribution. As the algorithm iterates, the distribution converges to the target true distribution, justifying the convergence guarantees. More results on -dimensional grid and two moons can be found in fig:alg_conv_more in appendix:more_exp. The DDE algorithm behaves similarly on these two datasets.

Finally, we compare the MMD between the generated samples from the learned model to the training data with the current state-of-the-art method for KEF (Sutherland et al., 2017), which performs better than KDE and other alternatives (Strathmann et al., 2015). We use HMC to generate the samples from the KEF learned model, while in the DDE, we can bypass the HMC step by the learned sampler. The computation time for inference is listed in tab:mmd_synthetic. It shows the DDE improves the inference efficiency in orders. The MMD comparison is listed in tab:mmd_synthetic. The proposed DDE estimator performs significantly better than the best performances by KEF in terms of MMD.

#### Conditional model extension.

In this part of the experiment, models are trained to estimate the conditional distribution on the benchmark datasets for studying these methods (Arbel and Gretton, 2017; Sugiyama et al., 2010). We centered and normalized the data and randmly split the datasets into a training and a testing set with equal size, as in Arbel and Gretton (2017). We evaluate the performances by the negative -likelihood. Besides the score matching based KCEF, we also compared with LS-CDE and -KDE, introduced in (Sugiyama et al., 2010). The empirical results are summarized in tab:r_benchmark.

Although these datasets are low-dimensional with few samples and the KCEF uses the anisotropic RBF kernel (, different bandwidth in each dimension, making the experiments preferable to the KCEF), the proposed DDE still outperforms the competitors on six datasets significantly, and achieves comparable performance on the rest, even though it uses a simple isotropic RBF kernel. This further demonstrates the statistical power of the proposed DDE.

## 7 Conclusion

In this paper, we exploit the doubly dual embedding to reformulate the penalized MLE to a novel saddle-point optimization, which bypasses the intractable integration and provides flexibility in parameterizing dual. The novel view reveals a unique understanding of GANs and leads to a practical algorithm, which achieves state-of-the-art performance. We also establish the statistical consistency and algorithm convergence guarantee for the proposed algorithm.

## Appendix A Proof Details

 ∇wg\Lhat\rbrwg=−\EEξ\sbr∇wgf∗\rbrgwg\rbrξ+1λ\EEξ\sbr∇wgν∗\rbrgwg\rbrξ.

The conclusion can be proved by chain rule and the optimality conditions.

Specifically, notice that the are implicit functions of , we can calculate the gradient of w.r.t.

 ∇wg\Lhat\rbrwg = ˆ\EE\Dcal\sbr∇ff∗wg∇wgf∗wg−\EEξ\sbr∇gf\rbrg\rbrξ∇wgg−\EEq\sbr∇ff∗wg∇wgf∗wg−η2∇f\nbrf∗wg2\Hcal∇wgf∗wg +1λ\rbr\EEξ\sbr∇gν∗wg\rbrg\rbrξ∇wgg+\EEq\sbr∇νν∗wg∇wgν∗wg−\EEp0\sbrexp\rbrν∗wg∇νν∗wg∇wgν∗wg = \rbrˆ\EE\Dcal\sbr∇ff∗wg−\EEq\sbr∇ff∗wg−η2∇f\nbrf∗wg2\Hcal0∇wgf∗wg−\EEξ\sbr∇gf\rbrg\rbrξ∇wgg +1λ\EEξ\sbr∇gν∗wg\rbrg\rbrξ∇wgg+1λ\rbr\EEq\sbr∇νν∗wg−\EEp0\sbrexp\rbrν∗wg∇νν∗wg0∇wgν∗wg = −\EEξ\sbr∇wgf∗\rbrg\rbrξ+1λ\EEξ\sbr∇wgν∗\rbrgwg\rbrξ,

where the second equations come from the fact are optimal and are not functions of .

### a.2 Proof for thm:consistency

The proof of thm:consistency mainly follows the technique in Gu and Qiu (1993) with extra consideration of the approximation error from the dual embedding.

We first define some notations that will be used in the proof. We denote , which induces the norm denoted as . We introduce as the maximizer to defined as

 \Ltil\rbrh\defeqˆ\EE\Dcal\sbrh−\EEp∗\sbrh−12\nbrh−f∗2p∗−η2\nbrh2\Hcal.

The proof relies on decomposing the error into two parts: i) the error between and ; and ii) the error between and .

By Mercer decomposition (König, 1986), we can expand as

 k\rbrx,x′=∞∑l=1ζlψl\rbrxψl\rbrx′,

With the eigen-decomposition, we can rewrite function as . Then, we have and .

We make the following standard assumptions: There exists such that , . The eigenvalues of the kernel decay sufficiently homogeneously with rate , ,