1 Introduction
The exponential family is one of the most important classes of distributions in statistics and machine learning. The exponential family possesses a number of useful properties
(Brown, 1986)and includes many commonly used distributions with finitedimensional natural parameters, such as the Gaussian, Poisson and multinomial distributions, to name a few. It is natural to consider generalizing the richness and flexibility of the exponential family to an infinitedimensional parameterization via reproducing kernel Hilbert spaces (RKHS) (Canu and Smola, 2006)Maximum loglikelihood estimation (MLE) has already been wellstudied in the case of finitedimensional exponential families, where desirable statistical properties such as asymptotic unbiasedness, consistency and asymptotic normality have been established. However, it is difficult to extend MLE to the infinitedimensional case. Beyond the intractability of evaluating the partition function for a general exponential family, the necessary conditions for maximizing loglikelihood might not be feasible; that is, there may be no solution to the KKT conditions in the infinitedimensional case (Pistone and Rogantin, 1999; Fukumizu, 2009). To address this problem, Barron and Sheu (1991); Gu and Qiu (1993); Fukumizu (2009) considered several ways to regularize the function space by constructing (a series of) finitedimensional spaces that approximate the original RKHS, yielding a tractable estimator in the restricted finitedimensional space. However, as Sriperumbudur et al. (2017) note, even with the finitedimension approximation, these algorithms are still expensive as every update requires MonteCarlo sampling from a current model to compute the partition function.
An alternative score matching based estimator has recently been introduced by Sriperumbudur et al. (2017). This approach replaces the Kullback–Leibler () with the Fisher divergence, defined by the expected squared distance between the score of the model (, the derivative of the density) and the target distribution score. By minimizing the Tikhonovregularized Fisher divergence, Sriperumbudur et al. (2017) develop a computable estimator for the infinitedimensional exponential family that also obtains a consistency guarantee. Recently, this method has been generalized to conditional infinitedimensional exponential family estimation (Arbel and Gretton, 2017). Although score matching avoids the generally intractable integral, it requires computing and saving the first and secondorder derivatives of the reproducing kernel for each dimension on each sample. For samples with features, this results in memory and time cost respectively, which becomes prohibitive for datasets with only moderate sizes of and . To alleviate this cost, Sutherland et al. (2017) utilize the rank Nystöm approximation to the kernel matrix, reducing memory and time complexity to and respectively. Although this reduces the cost dependence on sample size, the dependence on
is unaffected, hence the estimator remains unsuitable for highdimensional data. Estimating a general exponential family model by either MLE or score matching also generally requires some form of MonteCarlo sampling, such as MCMC or HMC, to perform inference, which significantly adds to the computational cost.
In summary, both the MLE and score matching based estimators for the infinitedimensional exponential family incur significant computational overhead, particularly in highdimensional applications.
In this paper, we revisit penalized MLE for the kernel exponential family and propose a new estimation strategy. Instead of solving the loglikelihood equation directly, as in existing MLE methods, we exploit a doubly dual embedding technique that leads to a novel saddlepoint reformulation for the MLE (along with its conditional distribution generalization) in sec:dual_mle. We then propose a stochastic algorithm for the new view of penalized MLE in sec:alg. Since the proposed estimator is based on penalized MLE, it does not require the first and secondorder derivatives of the kernel, as in score matching, greatly reducing the memory and time cost. Moreover, since the saddle point reformulation of penalized MLE avoids the intractable integral, the need for a MonteCarlo step is also bypassed, thus accelerating learning. This approach also allows a flexibly parameterized sampler to be learned simultaneously, therefore, further reducing inference cost. We present the consistency and rate of the new estimation strategy in the wellspecified case, , the true density belongs to the kernel exponential family, and an algorithm convergence guarantee in sec:theory. We demonstrate the empirical advantages of the proposed algorithm in sec:experiments, comparing to stateoftheart estimators for both the kernel exponential family and its conditional extension (Sutherland et al., 2017; Arbel and Gretton, 2017).
2 Preliminaries
We first provide a preliminary introduction to the exponential family and Fenchel duality, which will play vital roles in the derivation of the new estimator.
2.1 Exponential family
The natural form of the exponential family over with the sufficient statistics is defined as
(1) 
where , , and . In this paper, we mainly focus on the case where is a reproducing Hilbert kernel space (RKHS) with kernel , such that
. However, we emphasize that the proposed algorithm in sec:alg can be easily applied to arbitrary differentiable function parametrizations, such as deep neural networks.
Given samples , a model with a finitedimensional parameterization can be learned via maximum loglikelihood estimation (MLE),
where denotes the empirical expectation over . The MLE eq:mle is wellstudied and has nice properties. However, the MLE is known to be “illposed” in the infinitedimensional case, since the optimal solution might not be exactly achievable in the representable space. Therefore, penalized MLE has been introduced for the finite (Dudík et al., 2007) and infinite dimensional Gu and Qiu (1993); Altun and Smola (2006)
exponential families respectively. Such regularization essentially relaxes the moment matching constraints in terms of some norm, as shown later, guaranteeing the existence of a solution. In this paper, we will also focus on MLE with RKHS norm regularization.
One useful theoretical property of MLE for the exponential family is convexity w.r.t. . With convex regularization, ,
, one can use stochastic gradient descent to recover a unique global optimum. Let
denote RKHS norm penalized loglikelihood, ,(2) 
where denotes the regularization parameter. The gradient of w.r.t. can be computed as
(3) 
To calculate the in eq:grad, we denote and expand the partition function by definition,
(4)  
We may approximate the
by MCMC samples, which leads to the Contrastive Divergence (CD) algorithm
(Hinton, 2002).To avoid costly MCMC sampling in estimating the gradient, Sriperumbudur et al. (2017) construct an estimator based on score matching instead of MLE, which minimizes the penalized Fisher divergence. Plugging the kernel exponential family into the empirical Fisher divergence, the optimization reduces to
(5) 
where
As we can see, the score matching objective (5) does not involve the intractable integral in . However, such an estimator requires the computation of first and secondorder of derivatives of kernel for each dimension on each data, leading to memory and time cost of and respectively. This quickly becomes prohibitive for even modest and .
The same difficulty also appears in the score matching based estimator in (Arbel and Gretton, 2017) for the conditional exponential family, which is defined as
(6) 
where , , , and , is defined as .
In this paper, we consider such that is in RKHS for . Denoting such that , we can derive its kernel function following Micchelli and Pontil (2005); Arbel and Gretton (2017). By the Riesz representation theorem, and , there exists a linear operator such that
Then, the kernel can be defined by composing with its dual, , and the function .
2.2 Convex conjugate and Fenchel duality
Denote as a function , then its convex conjugate function is defined as
If is proper, convex and lower semicontinuous, the conjugate function, , is also proper, convex and lower semicontinuous. Moreover, and are dual to each other, , . Such a relationship is known as Fenchel duality (HiriartUrruty and Lemaréchal, 2012). By the conjugate function, we can represent the by as,
The supremum achieves if , or equivalently .
3 A SaddlePoint Formulation of Penalized MLE
As discussed in sec:preliminary, the penalized MLE for the exponential family involves computing the partition functions, and , which are intractable in general. In this section, we first introduce a saddlepoint reformulation of the penalized MLE of the exponential family, using Fenchel duality to bypass these intractable quantities. This approach can also be generalized to the conditional exponential family. First, observe that we can rewrite the logpartition function via Fenchel duality as follows. [Fenchel duality of logpartition]
(7)  
(8) 
where denotes the space of distributions and . Denote , which is strongly concave w.r.t. , the optimal can be obtained by setting the
Consider the , we have
which is the second conclusion. Plugging the optimal solution to , we obtain the maximum as , which is exactly , achieving the first conclusion. Therefore, plugging the Fenchel dual of into the penalized MLE, we achieve a saddlepoint optimization, ,
The saddlepoint reformulation of the penalized MLE in (3) resembles the optimization in MMD GAN (Li et al., 2017; Bińkowski et al., 2018). In fact, the dual problem of the penalized MLE of the exponential family is a KLregularized MMD GAN with a special design of the kernel family. Alternatively, if was a Wasserstein function, the optimization would resemble the Wasserstein GAN (Arjovsky et al., 2017).
Next, we consider the duality properties. [weak and strong duality] For general case where is continuous, weak duality holds, ,
If is convex and compact, . is some RKHS, then strong duality holds for
(9) 
thm:minmax_switch can be obtained by directly applying the minimax theorem (Sion et al., 1958). We refer to the  problem in (9) as the primal problem, while the  form as the its dual form.
Remark (connections to MMD GAN):
Consider the dual problem with the kernel learning, ,
(10) 
where we involve the parameters of the kernel to be learned in the optimization. By setting the gradient , for fixed , we obtain the optimal witness function in the RKHS, , which leads to
(11) 
It is clear that this is the divergence regularized MMD GAN. Thus, with divergence regularization, the MMD GAN learns an infinitedimension exponential family in an adaptive RKHS. Such a novel perspective bridges GAN and exponential family estimation, which appears to be of independent interest and potentially brings a new connection to the GAN literature for further theoretical development.
Remark (connections to Maximum Entropy Moment Matching):
Altun and Smola (2006); Dudík et al. (2007) discuss the maximum entropy moment matching method for distribution estimation,
(12)  
whose dual problem is exactly the penalized MLE (2). Interestingly, the proposed saddlepoint formulation (3) shares the solution to (12). From the maximum entropy view, it is clear that the penalty relaxes the moment matching constraints. However, the algorithms provided in Altun and Smola (2006); Dudík et al. (2007) simply ignore the difficulty in computing the expectation in , which is not practical, especially when is infinitedimensional.
Similar to thm:fenchel_log_partition, we can also represent by its Fenchel dual,
(13)  
(14) 
Then, we can recover the penalized MLE for the conditional exponential family as
(15) 
The only difference between (3) and (15) is that the dual distribution for the conditional exponential family is now a conditional distribution.
The saddlepoint reformulations of penalized MLE in (3) and (15) bypass the difficulty in the partition function. Therefore, it is very natural to consider learning the exponential family by solving the saddlepoint problem (3) with an appropriate parametrized dual distribution . This approach is referred to as the “dual embedding” technique in Dai et al. (2016), which requires:

the parametrization family should be flexible enough to reduce the extra approximation error;

the parametrized representation should be able to provide density value.
As we will see in sec:theory, the flexibility of the parametrization of the dual distribution will have a significant effect on the consistency of the estimator.
One can of course use the kernel density estimator (KDE) as the dual distribution parametrization, which leads to convexconcave preservation. However, KDE will easily fail when approximating highdimension data. Applying the reparametrization trick with a suitable class of probability distributions
(Kingma and Welling, 2013; Rezende et al., 2014) can also be used to parametrize the dual distribution. However, the class of such parametrized distributions is typically restricted to simple known distributions, which will not be able to approximate the true solution, leading to a potentially huge approximation bias. At the other end of the spectrum, the distribution family generated by transport mapping is sufficiently flexible to model smooth distributions (Goodfellow et al., 2014; Arjovsky et al., 2017). However, the density value of such distributions, , , is not available for divergence computation, and thus, is not applicable for parameterizing the dual distribution. Recently, flowbased parametrized density functions (Rezende and Mohamed, 2015; Kingma et al., 2016; Dinh et al., 2016) have been proposed for a tradeoff between the flexibility and tractability. However, the expressive power of existing flowbased models remains overly restrictive even in our synthetic example and cannot be directly applied to conditional models.3.1 Doubly Dual Embedding for MLE
Transport mapping is very flexible for generating smooth distributions. However, a major difficulty is that it lacks the ability to obtain the density value , making the computation of the divergence impossible. To enjoy the flexibility of transport mapping and avoid the calculation of , we introduce doubly dual embedding, which will achieve a delicate balance between flexibility and tractability.
First, noting that divergence is also a convex function, we consider the Fenchel dual of divergence (Nguyen et al., 2008), ,
(16)  
(17) 
with . One can see in the dual representation of the divergence that the introduction of the auxiliary optimization variable eliminates the explicit appearance of in (16), which makes the transport mapping parametrization for become applicable. Since there is no extra restriction on , we can use arbitrary smooth function approximation for , such as kernel functions or neural networks.
Applying the dual representation of divergence into the dual problem of the saddlepoint reformulation (3), we obtain the ultimate saddlepoint optimization for the estimation of the kernel exponential family,
It should be emphasized that although other works have applied Fenchel duality to divergence (Nguyen et al., 2008; Nowozin et al., 2016), the approach taken here is different in considering , whereas those works consider . It is also obvious in the view of the optimization we obtained in (3.1), comparing to the objective in GAN.
Remark (Extension for kernel conditional exponential family):
Similarly, we can apply the doubly dual embedding technique to the penalized MLE of the kernel conditional exponential family, which leads to
In summary, with the doubly dual embedding technique, we achieve a saddlepoint reformulation of penalized MLE that bypasses the difficulty of handling the intractable partition function while allowing great flexibility in parameterizing the dual distribution.
4 Practical Algorithm
In this section, we introduce the transport mapping parametrization for the dual distribution, , and adapt the stochastic gradient descent for solving the optimization. Since these two optimizations are almost the same, for simplicity of exposition, we only illustrate the algorithm for (3.1). We emphasize the algorithm can be easily applied to (3.1).
Denote the parameters in the transport mapping for as , such that and . We illustrate the algorithm with a kernel represented . There are many alternative choices of parametrization of , , neural networks—the proposed algorithm is still applicable to the parametrization as long as it is differentiable. We abuse notation somewhat by using as in (3.1).
With the kernel parametrized , the inner maximization over and is a standard concave optimization on. We can solve it using existing algorithms to achieve the global optimal solution. Due to the existence of the expectation, we will use the stochastic function gradient descent for scalability. Given , following the definition of functional gradients Kivinen et al. (2004); Dai et al. (2014), we have
(18) 
(19) 
In the th iteration, given sample , and , the update rule for and will be
where denotes the stepsize.
Then, we consider the update rule for the parameters in dual transport mapping embedding. [Dual gradient] Denoting as and , we have
(20) 
Proof details are given in appendix:subsec:dual_grad. With this gradient estimator, we can apply stochastic gradient descent to update iteratively. We summarize the updates for and in alg:sgd_double_dual and 2.
Compared to score matching based estimators Sriperumbudur et al. (2017); Sutherland et al. (2017), although the convexity no longer holds for the saddlepoint estimator, the doubly dual embedding estimator avoids representing with derivatives of the kernel, which avoids the memory cost dependence on the square of dimension. In particular, the proposed estimator for based on the saddlepoint reformulation of penalized MLE reduces the memory cost from to where denotes the number of iterations in alg:sgd_f. In terms of time cost, we exploit the stochastic update, which is naturally suitable for largescale datasets, avoiding the matrix inverse computation in score matching estimator whose cost will be . Moreover, we also learn the dual distribution simultaneously, which can easily generate samples from the learned exponential family. This can therefore be used for inference, saving the MonteCarlo sampling in the inference stage.
Remark (random feature extension):
Memory cost is the wellknown bottleneck for applying kernel methods to largescale problems. When we set , the memory cost will be , which is prohibitive for millions of data points. Random feature approximation (Rahimi and Recht, 2008; Dai et al., 2014; Bach, 2015) can be utilized for scaling up kernel methods. The proposed alg:sgd_double_dual and alg:sgd_f are also compatible with random feature approximation, and hence, applicable to largescale problems in the same way. With random features, we can further reduce the memory cost of to . However, even with a random feature approximation, the score matching based estimator will still require memory. One can also learn random features by backpropagation, which leads to a neural networks extension. Due to space limitations, the details of this variant of alg:sgd_double_dual are given in Appendix B.
5 Theoretical Analysis
In this section, we will first provide the analysis of consistency and the sample complexity of the proposed estimator based on the saddlepoint reformulation of the penalized MLE in the wellspecified case, where the true density is assumed to be in the kernel (conditional) exponential family, following Sutherland et al. (2017); Arbel and Gretton (2017). Then, we consider the convergence property of the proposed algorithm. We mainly focus on the analysis for . The results can be easily extended to kernel conditional exponential family .
5.1 Statistical Consistency
We explicitly consider approximation error from the dual embedding in the consistency of the proposed estimator. We first establish some notation that will be used in the analysis. For simplicity, we set and improperly in the exponential family expression (1). We denote as the groundtrue distribution with its true potential function and as the optimal primal and dual solution to the saddle point reformulation of the penalized MLE (3). The parametrized dual space is denoted as . We denote as the exponential family generated by . We have the consistency results as Assume the spectrum of kernel decays sufficiently homogeneously in rate . With some other mild assumptions listed in appendix:subsec:consistency, we have as and ,
where . Therefore, when setting , converges to in terms of JensenShannon divergence at rate For the details of the assumptions and the proof, please refer to appendix:subsec:consistency. Recall the connection between the proposed model and MMD GAN as discussed in sec:dual_mle. thm:consistency also provides a learnability guarantee for a class of GAN models as a byproduct. The most significant difference of the bound provided above, compared to Gu and Qiu (1993); Altun and Smola (2006), is the explicit consideration of the bias from the dual parametrization. Moreover, instead of the Rademacher complexity used in the sample complexity results of Altun and Smola (2006), our result exploits the spectral decay of the kernel, which is more directly connected to properties of the RKHS.
From thm:consistency we can clearly see the effect of the parametrization of the dual distribution: if the parametric family of the dual distribution is simple, the optimization for the saddlepoint problem may become easy, however, will dominate the error. The other extreme case is to also use the kernel exponential family to parametrize the dual distribution, then, will reduce to , however, the optimization will be difficult to handle. The saddlepoint reformulation provides us the opportunity to balance the difficulty of the optimization with approximation error.
The statistical consistency rate of our estimator and the score matching estimator (Sriperumbudur et al., 2017) are derived under different assumptions, therefore, they are not directly comparable. However, since the smoothness is not fully captured by the score matching based estimator in Sriperumbudur et al. (2017), it only achieves even if is infinitely smooth. While under the case that dual distribution parametrization is relative flexible, , is negligible, and the the spectrum of the kernel decay rate , the proposed estimator will converge in rate , which is significantly more efficient than the score matching method.
5.2 Algorithm Convergence
It is wellknown that the stochastic gradient descent converges for saddlepoint problem with convexconcave property (Nemirovski et al., 2009). However, for better the dual parametrization to reduce in thm:consistency, we parameterize the dual distribution with the nonlinear transport mapping, which breaks the convexity. In fact, by thm:dual_gradient, we obtain the unbiased gradient w.r.t. . Therefore, the proposed alg:sgd_double_dual can be understood as applying the stochastic gradient descent for the nonconvex dual minimization problem, , . From such a view, we can prove the sublinearly convergence rate to a stationary point when stepsize is diminishing following Ghadimi and Lan (2013); Dai et al. (2017). We list the result below for completeness.
[Ghadimi and Lan (2013)] Assume that the parametrized objective is
Lipschitz and variance of its stochastic gradient is bounded by
. Let the algorithm run for iterations with stepsize for some and output . Setting the candidate solution to be randomly chosen from such that , then it holds that where represents the distance of the initial solution to the optimal solution. The above result implies that under the choice of the parametrization of and , the proposed alg:sgd_double_dual converges sublinearly to a stationary point, whose rate will depend on the smoothing parameter.6 Experiments
In this section, we compare the proposed doubly dual embedding (DDE) with the current stateoftheart score matching estimators for kernel exponential family (KEF) Sutherland et al. (2017)^{1}^{1}1https://github.com/karlnapf/nystromkexpfam and its conditional extension (KCEF) Arbel and Gretton (2017)^{2}^{2}2https://github.com/MichaelArbel/KCEF, respectively, as well as several competitors. We test the proposed estimator empirically following their setting. We use Gaussian RBF kernel for both exponential family and its conditional extension, , with the bandwidth set by mediantrick (Dai et al., 2014). For a fair comparison, we follow Sutherland et al. (2017) and Arbel and Gretton (2017) to set the for kernel exponential family and its conditional extension, respectively. The dual variables are parametrized by MLP with layers.
Density estimation
We evaluate the DDE on the synthetic datasets, including ring, grid and two moons, where the first two are used in Sutherland et al. (2017), and the last one is from Rezende and Mohamed (2015). The ring dataset contains the points uniformly sampled along three circles with radii and noise in the radial direction and extra dimensions. The dim grid dataset contains samples from mixture of Gaussians. Each center lies on one dimension in the dimension hypercube. The two moons dataset is sampled from the exponential family with potential function as . We use samples for training, and for testing (grid) or (ring, two moons) samples, following Sutherland et al. (2017).
(a) initialization  (b) th  (c) th 
(d) th  (c) th  (d) th 
We visualize the samples generated by the learned sampler, and compare it with the training datasets in the first row in fig:vis. The learned is also plotted in the second row in fig:vis. The DDE learned models generate samples that cover the training data, showing the ability of the DDE for estimating kernel exponential family on complicated data.
Then, we demonstrate the convergence of the DDE in fig:alg_conv on rings dataset. We initialize with a random dual distribution. As the algorithm iterates, the distribution converges to the target true distribution, justifying the convergence guarantees. More results on dimensional grid and two moons can be found in fig:alg_conv_more in appendix:more_exp. The DDE algorithm behaves similarly on these two datasets.
MMD  time (s)  

Datasets  DDE  KEF  DDEsample  KEF+HMC 
two moons  0.11  1.32  0.07  13.04 
ring  1.53  3.19  0.07  14.12 
grid  1.32  40.39  0.04  4.43 
Finally, we compare the MMD between the generated samples from the learned model to the training data with the current stateoftheart method for KEF (Sutherland et al., 2017), which performs better than KDE and other alternatives (Strathmann et al., 2015). We use HMC to generate the samples from the KEF learned model, while in the DDE, we can bypass the HMC step by the learned sampler. The computation time for inference is listed in tab:mmd_synthetic. It shows the DDE improves the inference efficiency in orders. The MMD comparison is listed in tab:mmd_synthetic. The proposed DDE estimator performs significantly better than the best performances by KEF in terms of MMD.
Conditional model extension.
In this part of the experiment, models are trained to estimate the conditional distribution on the benchmark datasets for studying these methods (Arbel and Gretton, 2017; Sugiyama et al., 2010). We centered and normalized the data and randmly split the datasets into a training and a testing set with equal size, as in Arbel and Gretton (2017). We evaluate the performances by the negative likelihood. Besides the score matching based KCEF, we also compared with LSCDE and KDE, introduced in (Sugiyama et al., 2010). The empirical results are summarized in tab:r_benchmark.
DDE  KCEF  KDE  LSCDE  

geyser  0.54 0.07  1.21 0.04  1.11 0.02  0.7 0.01 
caution  0.92 0.33  0.99 0.01  1.25 0.19  1.19 0.02 
ftcollinssnow  1.44 0.12  1.46 0.0  1.53 0.05  1.56 0.01 
highway  1.27 0.16  1.17 0.01  2.24 0.64  1.98 0.04 
snowgeese  0.81 0.19  0.72 0.02  1.35 0.17  1.39 0.05 
GAGurine  0.46 0.16  0.46 0.0  0.92 0.05  0.7 0.01 
topo  1.07 0.22  0.67 0.01  1.18 0.09  0.83 0.0 
CobarOre  1.37 0.08  3.42 0.03  1.65 0.09  1.61 0.02 
mcycle  0.72 0.19  0.56 0.01  1.25 0.23  0.93 0.01 
BigMac2003  0.43 0.34  0.59 0.01  1.29 0.14  1.63 0.03 
cpus  0.85 0.17  N/A  1.01 0.10  1.04 0.07 
crabs  0.51 0.20  N/A  0.99 0.09  0.07 0.11 
birthwt  1.18 0.09  1.18 0.13  1.48 0.01  1.43 0.01 
gilgais  0.64 0.05  0.65 0.08  1.35 0.03  0.73 0.05 
UN3  0.99 0.07  1.15 0.21  1.78 0.14  1.42 0.12 
ufc  1.02 0.03  0.96 0.14  1.40 0.02  1.03 0.01 
Although these datasets are lowdimensional with few samples and the KCEF uses the anisotropic RBF kernel (, different bandwidth in each dimension, making the experiments preferable to the KCEF), the proposed DDE still outperforms the competitors on six datasets significantly, and achieves comparable performance on the rest, even though it uses a simple isotropic RBF kernel. This further demonstrates the statistical power of the proposed DDE.
7 Conclusion
In this paper, we exploit the doubly dual embedding to reformulate the penalized MLE to a novel saddlepoint optimization, which bypasses the intractable integration and provides flexibility in parameterizing dual. The novel view reveals a unique understanding of GANs and leads to a practical algorithm, which achieves stateoftheart performance. We also establish the statistical consistency and algorithm convergence guarantee for the proposed algorithm.
References

Altun and Smola (2006)
Y. Altun and A. J. Smola.
Unifying divergence minimization and statistical inference via convex
duality.
In H.U. Simon and G. Lugosi, editors,
Proc. Annual Conf. Computational Learning Theory
, LNCS, pages 139–153. Springer, 2006.  Arbel and Gretton (2017) Michael Arbel and Arthur Gretton. Kernel conditional exponential family. arXiv preprint arXiv:1711.05363, 2017.
 Arjovsky et al. (2017) Martin Arjovsky, Soumith Chintala, and Léon Bottou. Wasserstein gan. arXiv preprint arXiv:1701.07875, 2017.
 Bach (2015) Francis R. Bach. On the equivalence between quadrature rules and random features. CoRR, abs/1502.06800, 2015. URL http://arxiv.org/abs/1502.06800.
 Barron and Sheu (1991) Andrew R Barron and ChyongHwa Sheu. Approximation of density functions by sequences of exponential families. The Annals of Statistics, pages 1347–1369, 1991.
 Bińkowski et al. (2018) Mikołaj Bińkowski, Dougal J Sutherland, Michael Arbel, and Arthur Gretton. Demystifying mmd gans. arXiv preprint arXiv:1801.01401, 2018.
 Brown (1986) Lawrence D. Brown. Fundamentals of Statistical Exponential Families, volume 9 of Lecture notesmonograph series. Institute of Mathematical Statistics, Hayward, Calif, 1986.
 Canu and Smola (2006) S. Canu and A. Smola. Kernel methods and the exponential family. 69(79):714–720, 2006.
 Dai et al. (2014) B. Dai, B. Xie, N. He, Y. Liang, A. Raj, M. Balcan, and L. Song. Scalable kernel methods via doubly stochastic gradients. In Neural Information Processing Systems, 2014.
 Dai et al. (2016) Bo Dai, Niao He, Yunpeng Pan, Byron Boots, and Le Song. Learning from conditional distributions via dual embeddings. CoRR, abs/1607.04579, 2016.
 Dai et al. (2017) Bo Dai, Albert Shaw, Lihong Li, Lin Xiao, Niao He, Zhen Liu, Jianshu Chen, and Le Song. Sbeed: Convergent reinforcement learning with nonlinear function approximation. CoRR, abs/1712.10285, 2017. URL http://arxiv.org/abs/1712.10285.
 Devinatz (1953) A. Devinatz. Integral representation of pd functions. Trans. AMS, 74(1):56–77, 1953.
 Dinh et al. (2016) Laurent Dinh, Jascha SohlDickstein, and Samy Bengio. Density estimation using real nvp. arXiv preprint arXiv:1605.08803, 2016.
 Dudík et al. (2007) Miroslav Dudík, Steven J. Phillips, and Robert E. Schapire. Maximum entropy density estimation with generalized regularization and an application to species distribution modeling. Journal of Machine Learning Research, 8:1217–1260, 2007.
 Fukumizu (2009) K. Fukumizu. Exponential manifold by reproducing kernel Hilbert spaces, page 291–306. Cambridge University Press, 2009.
 Ghadimi and Lan (2013) Saeed Ghadimi and Guanghui Lan. Stochastic firstand zerothorder methods for nonconvex stochastic programming. SIAM Journal on Optimization, 23(4):2341–2368, 2013.
 Goodfellow et al. (2014) Ian Goodfellow, Jean PougetAbadie, Mehdi Mirza, Bing Xu, David WardeFarley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In Advances in Neural Information Processing Systems, pages 2672–2680, 2014.
 Gu and Qiu (1993) Chong Gu and Chunfu Qiu. Smoothing spline density estimation: Theory. The Annals of Statistics, pages 217–234, 1993.
 Hein and Bousquet (2004) M. Hein and O. Bousquet. Kernels, associated structures, and generalizations. Technical Report 127, Max Planck Institute for Biological Cybernetics, 2004.
 Hinton (2002) Geoffrey E. Hinton. Training products of experts by minimizing contrastive divergence. Neural Computation, 14(8):1771–1800, 2002.
 HiriartUrruty and Lemaréchal (2012) JeanBaptiste HiriartUrruty and Claude Lemaréchal. Fundamentals of convex analysis. Springer Science & Business Media, 2012.
 Kingma and Welling (2013) Diederik P Kingma and Max Welling. Autoencoding variational bayes. arXiv preprint arXiv:1312.6114, 2013.
 Kingma et al. (2016) Diederik P Kingma, Tim Salimans, Rafal Jozefowicz, Xi Chen, Ilya Sutskever, and Max Welling. Improved variational inference with inverse autoregressive flow. In Advances in Neural Information Processing Systems, pages 4743–4751, 2016.
 Kivinen et al. (2004) J. Kivinen, A. J. Smola, and R. C. Williamson. Online learning with kernels. IEEE Transactions on Signal Processing, 52(8), Aug 2004.
 König (1986) H. König. Eigenvalue Distribution of Compact Operators. Birkhäuser, Basel, 1986.
 Li et al. (2017) ChunLiang Li, WeiCheng Chang, Yu Cheng, Yiming Yang, and Barnabas Poczos. Mmd gan: Towards deeper understanding of moment matching network. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems 30, pages 2203–2213. Curran Associates, Inc., 2017.

Micchelli and Pontil (2005)
C.A. Micchelli and M. Pontil.
On learning vectorvalued functions.
Neural Computation, 17(1):177–204, 2005.  Nemirovski et al. (2009) A. Nemirovski, A. Juditsky, G. Lan, and A. Shapiro. Robust stochastic approximation approach to stochastic programming. SIAM J. on Optimization, 19(4):1574–1609, January 2009. ISSN 10526234.
 Nguyen et al. (2008) X.L. Nguyen, M. Wainwright, and M. Jordan. Estimating divergence functionals and the likelihood ratio by penalized convex risk minimization. In Advances in Neural Information Processing Systems 20, pages 1089–1096. MIT Press, Cambridge, MA, 2008.
 Nowozin et al. (2016) Sebastian Nowozin, Botond Cseke, and Ryota Tomioka. fgan: Training generative neural samplers using variational divergence minimization. arXiv preprint arXiv:1606.00709, 2016.
 Pistone and Rogantin (1999) Giovanni Pistone and Maria Piera Rogantin. The exponential statistical manifold: mean parameters, orthogonality and space transformations. Bernoulli, 5(4):721–760, 1999.
 Rahimi and Recht (2008) A. Rahimi and B. Recht. Random features for largescale kernel machines. In J.C. Platt, D. Koller, Y. Singer, and S. Roweis, editors, Advances in Neural Information Processing Systems 20. MIT Press, Cambridge, MA, 2008.
 Rahimi and Recht (2009) A. Rahimi and B. Recht. Weighted sums of random kitchen sinks: Replacing minimization with randomization in learning. In Neural Information Processing Systems, 2009.

Rezende et al. (2014)
Danilo J Rezende, Shakir Mohamed, and Daan Wierstra.
Stochastic backpropagation and approximate inference in deep generative models.
In Proceedings of the 31st International Conference on Machine Learning (ICML14), pages 1278–1286, 2014.  Rezende and Mohamed (2015) Danilo Jimenez Rezende and Shakir Mohamed. Variational inference with normalizing flows. arXiv preprint arXiv:1505.05770, 2015.
 Sion et al. (1958) Maurice Sion et al. On general minimax theorems. Pacific Journal of mathematics, 8(1):171–176, 1958.
 Sriperumbudur et al. (2017) Bharath Sriperumbudur, Kenji Fukumizu, Arthur Gretton, Aapo Hyvärinen, and Revant Kumar. Density estimation in infinite dimensional exponential families. The Journal of Machine Learning Research, 18(1):1830–1888, 2017.
 Strathmann et al. (2015) Heiko Strathmann, Dino Sejdinovic, Samuel Livingstone, Zoltan Szabo, and Arthur Gretton. Gradientfree hamiltonian monte carlo with efficient kernel exponential families. In Advances in Neural Information Processing Systems, pages 955–963, 2015.
 Sugiyama et al. (2010) Masashi Sugiyama, Ichiro Takeuchi, Taiji Suzuki, Takafumi Kanamori, Hirotaka Hachiya, and Daisuke Okanohara. Conditional density estimation via leastsquares density ratio estimation. In AISTATS, pages 781–788, 2010.
 Sutherland et al. (2017) Dougal J Sutherland, Heiko Strathmann, Michael Arbel, and Arthur Gretton. Efficient and principled score estimation with nystr” om kernel exponential families. arXiv preprint arXiv:1705.08360, 2017.
Appendix A Proof Details
a.1 Proof of thm:dual_gradient
thm:dual_gradient (Dual gradient) Denoted as and , we have
The conclusion can be proved by chain rule and the optimality conditions.
Specifically, notice that the are implicit functions of , we can calculate the gradient of w.r.t.
where the second equations come from the fact are optimal and are not functions of .
a.2 Proof for thm:consistency
The proof of thm:consistency mainly follows the technique in Gu and Qiu (1993) with extra consideration of the approximation error from the dual embedding.
We first define some notations that will be used in the proof. We denote , which induces the norm denoted as . We introduce as the maximizer to defined as
The proof relies on decomposing the error into two parts: i) the error between and ; and ii) the error between and .
By Mercer decomposition (König, 1986), we can expand as
With the eigendecomposition, we can rewrite function as . Then, we have and .
We make the following standard assumptions: There exists such that , . The eigenvalues of the kernel decay sufficiently homogeneously with rate , ,