Introduction
Divergences and distances between multivariate probability distributions play a central role in many mathematical, engineering, and scientific fields ranging from statistical physics, large deviations theory, and uncertainty quantification to information theory, statistics, and machine learning. Variational representation formulas for divergences, also referred to as dual formulations, convert divergence calculation into an optimization problem over a function space and offer a valuable mathematical tool to build, train, and analyze probabilistic models and measure similarity between data collections. Typical examples of variational representations are, among others, the Legendre transformation (LT) of an divergence [1, 2], the DonskerVaradhan (DV) formula for the KullbackLeibler (KL) divergence [3, 4] and the RubinsteinKantorovich duality formula for Wasserstein distance [5]. Variational representations have been used in statistical mechanics and interacting particles [6], large deviations [4], divergence estimation [7, 8, 9], determining independence through mutual information estimation [10], adversarial learning of generative models [11, 12, 13], uncertainty quantification (UQ) of stochastic processes [14, 15], bounding risk in probably approximately correct (PAC) learning [16, 17, 18], as well as in parameter estimation [19].
There are two main mathematical ingredients needed for the construction of a variational formula. First, the function space where the optimal solution will be searched for and, second, the representation expression, called here the ‘objective functional’, whose optimization leads to the value of the divergence. Crucial practical advantages of variational formulas in statistics and machine learning include: a) they do not require an explicit form of the probability distributions (or their likelihood ratio); related probabilistic quantities can be approximated by statistical estimators over available data; b) they can exploit the capacity of rich regression models such as neural networks to efficiently search the function space for optimal solutions; the optimal solution is typically related to likelihood ratio.
A single divergence can be derived from several different objective functionals. The key contribution of this paper is a systematic methodology that uses families of transformations (e.g., shifts, affine, and powers) to build new, tighter variational representations for divergences by creating improved objective functionals, as described in our main Theorem 1. This idea is both simple and powerful; it provides a general framework that unifies many of the previous variational formulas in the literature, reveals new connections between them, drives the derivation of new variational formulas, and has practical implications in terms of accelerated statistical training, learning, and estimation from data.
Striking consequences of the proposed framework include: (i) the connection between LTbased KL, the DV representation formula, and to a new, improved DVtype formula, (ii) a concrete representation of the abstract objective functional in [9], and (iii) a derivation of new representation formulas for divergence and connections with a recently derived, DVtype variational representation of Rényi divergences.
The improved objective functionals constructed via our framework have the same optimal solution, but they are tighter in the sense that the same approximation of the optimum will provide a better approximation of the divergence, i.e., they are flatter around the optimal solution. We propose to employ (functional) Hessians of the objective functionals to quantify and order relative tightness gains between different variational representations of divergences, in terms of the local curvature around the optimal solution.
Finally, we demonstrate that these tighter representation formulas can accelerate numerical optimization and estimation of divergences in a series of synthetic and real examples, such as the statistical estimation of fdivergences and mutual information, including cases with highdimensional, real data (in excess of 700 dimensions). Similarly to [10], we parameterize the function space using neural networks, hence the (parameter) optimization is efficiently performed with a backpropagation algorithm. We find that the improved, tighter representation formulas converge several times faster than the initial representation formula, often by nearly an order of magnitude in highdimensional problems.
1 Tightening the Variational Representation of fDivergences
Background
Define to be the set of all convex functions with . If (resp. b) is finite, we extend to (resp. ) by continuity and set for . Such functions are appropriate for defining divergences, , which have the variational characterization
(1) 
where denotes the set of bounded measurable functions, [20, 8]. Under appropriate assumptions (see Theorem 4.4 in [20]), the maximum is achieved at
(2) 
There are cases, such as divergence with , where can be given a meaningful finite value even if (see [21]). However, the right hand side of (1) is always if and so here we use the convention that when . The special case of KullbackLeibler (KL) divergence (i.e., with ) has a wellknown alternative variational representation, the DonskerVaradhan (DV) variational formula
(3) 
It is known that the objective functional in (3) is tighter than that of (1), in the sense that for all , [9]. In this paper, we present a general procedure for obtaining tighter variational representations of any divergence, for which the transition from (1) to (3) is just one special case and where we can derive even tighter representations than (3).
Theoretical Results
Our method for deriving tighter variational representations for divergences is described in the following Theorem; a proof can be found in Section 4.
Theorem 1.
Let and suppose . With defined by Eq. (2) (when it exists), let be a family of functions (the test functions) with
(4) 
Consider any family of transformations
(5) 
that includes the identity map. Then
(6)  
(7) 
and the maximum in (6) is achieved at . Furthermore, the objective functional, , in the variational representation (6) is tighter than the objective functional in (1), in the sense that
(8) 
for all .
(i) The main new insight and primary mathematical tool in this paper is formula (6), which allows for the objective functional to be ‘improved/tightened’ using any appropriate family of transformations ; examples of such families are discussed in the next subsection. Eq. (6) is a simple but farreaching idea that reveals connections between many known variational representations and also leads to the derivation of new ones (see below, starting with Eq. (12)). (ii) The extension of Eq. (1) from to is useful because the exact optimizer, , is generally unbounded; various versions of this extension can be found in the literature [20, 8]. This extension is needed to justify the computation of variational derivatives around the optimum, see Section 2. It also implies that one does not need to impose boundedness condition via a cutoff function when employing neuralbased statistical estimation.(iii) The generalization of Eq. (1) to a family, , that contains the optimizer is the natural next step; again, see [20, 8]. It provides a great deal of flexibility in adapting the proposed variational representation (6) to different divergences and informs the algorithmic implementation. We use this idea several times to restrict the optimization to, e.g., positive functions for the divergences and finite dimensional submanifolds for exponential families, see Section 3. (iv) If is a group under composition then the objective functional (7) is invariant under the family of transformations , i.e., for all . (v) If is a metric space with the Borel algebra then one can replace with (bounded continuous functions) and with (continuous functions) in Theorem 1. (vi) The auxiliary optimization problem in Eq. (7) is often computed analytically; alternatively and due to its low dimensionality, the corresponding optimization can easily fit without significant additional computational cost, within gradient descent algorithms seeking for the optimal solution in .
Families of Transformations
Next, we present several useful families of transformations that, in conjunction with Theorem 1, yield tighter variational formulas.

Identity: leads to what we call the standard (or LT) divergence variational formula given by (1).

Scaling Transformation: , , which lead to the scaling or improved variational formula
(10) 
Affine Transformation: The above two cases can be combined into a two parameter family .

Power Transformation: , , which lead to the power or improved variational formula
(11) As with the affine transformations, the power transformations can be combined with the shift and/or scaling transformations to form a multiparameter family. These are related to the wellknown BoxCox transformation [23], used in statistics to transform nonnormal data sets to approximately normal.
Deriving New and Existing Variational Representations
Here we explore several specific cases of the above framework, focusing primarily on examples where the optimization over the shift and/or scaling parameter can be done analytically. In doing so, we produce new variational representations, as well as uncover connections with previously known variational formulas. We do not claim these examples are exhaustive. Nevertheless, they cover many important cases and illustrate the power and flexibility of Theorem 1.

DonskerVaradhan formula (KLdivergence with shift transformations): KLdivergence is the divergence corresponding to , which has Legendre transform . The maximum over the shift transformations in (6) occurs at , and hence
(12) The result is the objective functional in the wellknown DonskerVaradhan variational formula (3) and so this framework provides the connection between (3) and (1). We also note that in [24] a connection is derived between (3) and (1) by using a logarithmic change of variables in the function space, based on (2) for the KL case.

Improved DonskerVaradhan (KLDivergence with affine transformations): Introducing a scaling parameter into Eq. (12), i.e., optimizing over all affine transformations in (6), one finds the new KL variational representation
(13) The inclusions implies that (13) is tighter than DV (3). Calculations and numerical results that quantify this improved tightness are found in Section 2. The optimization over in Eq. (13) cannot be evaluated analytically in general, but it can be done numerically (as discussed in Section 3) and can also be approximated analytically, leading to an alternative variational characterization of the KL divergence; see Appendix B for details.

Connection with the results of [9]: In Theorem 1 of [9] the following improved variational formula was derived:
(14) This is another special case of our framework, as can be seen by first rewriting the LegendreFenchel transform and then using Theorem 4.2 in [22]:
(15) Hence the variational formula (14) is in fact the same as the improved variational formula (9).

Exponential Families: If and are members of a parametric family then the set of test functions, , in (6) can be reduced to a finite dimensional manifold. For instance, if and are members of the same exponential family with
the vector of sufficient statistics then the explicit optimizer
lies on an dimensional manifold of functions, parameterized by the sufficient statistics and constants, , , and computation of the divergence reduces to the following finitedimensional optimization problem:(16) This variational representation can be further combined with any appropriate family of transformations ; see Appendix D for details.

Divergences (scaling transformations): The divergences are the family of divergences corresponding to . See [21] for properties, related families, and further references. It includes the KL, Hellinger and divergences as special cases, see [25], is closely related to the Tsallis entropies [26], and appears also in the context of information geometry, see [27]. By optimizing (6) over the family of scaling transformations, (restricted to ), we obtain a new variational representation of the divergences:
(17) where if and if . Eq. (17) has the exact optimizers . Theorem 1 guarantees that the objective functionals in the new variational representations (17) are tighter than that of the standard divergence objective functional from (1). See Appendix A for details on the calculations that lead to Eq. (17), as well as for connections to the KL divergence in the limits as .

Variational representations of Rényi divergences: Eq. (17) leads to a variational characterization of Rényi divergences. Using the known connection between the and Rényi divergences, along with Eq. (17) and the change of variables (see Appendix A for details) one obtains
(18) This constitutes an independent derivation of the Renyi variational formula derived in [28, 24], while in the asymptotic limit one recovers (3). The Renyi variational formula (18) was also used in [29] to construct cumulantbased, generative adversarial networks.

Divergence: We use Theorem 1 to provide a new tight variational perspective on the classical HammersleyChapmanRobbins bound for the divergence. The divergence is a special case of the divergences, . Here we optimize (6) over to obtain
(19) where and the maximum is achieved at . Eq. (19) implies the HammersleyChapmanRobbins bound for the divergence (see, e.g., Eq. 4.13 in [30]) and shows tightness over the set . The objective functional in (19
) was proposed as loss function for
GANs, [31]; thus, (19) provides a complete and rigorous justification for this choice. Finally, if we instead optimize (6) over , we obtain the objective functional for derived in [24], which is thus less tight than (19); see Appendix A.3. 
Connections with Uncertainty Quantification: The improved DV representation (13) provides an alternative and arguably more general derivation of model uncertainty bounds derived recently in [14, 15]. These results quantify the effects of model uncertainty by bounding expectations of an observable under an alternative model, , in terms of the behavior under a baseline model and the model discrepancy, measured by . Specifically, (13) implies after straightforward manipulation that [14]. More generally, we can obtain similar UQ bounds when model discrepancy is measured by an fdivergence by using in (6) and performing the analogous manipulations:
(20) The HammersleyChapmanRobbins bound can also be viewed as a special case of (20) in this UQ context. Similarly, UQ bounds for risksensitive functionals in terms of Rényi divergences, which were obtained recently in [32], readily follow from (18) after appropriate manipulations and an optimization over .

Divergences (scaling and power transformations): For , Combining the scaling and power family of transformations yields
(21) The optimization over scalings was evaluated as in Eq. (17) but the optimization over the power transformations, , cannot be done analytically. In practice, can be included as an additional parameter in a numerical optimization procedure; see Section 3 for further discussion.
2 Variational Derivatives and Tightness Gains
In Theorem 1, we established the general methodology for building tighter variational representations of divergences, by constructing suitable objective functionals . Here we will quantify relative tightness gains corresponding to different transformation families : for all such families the maximizer in (6) is always given by (2). Therefore, our approach relies on building quadratic variational approximations of each objective functional (7) around the common maximizer , and subsequently comparing the corresponding (variational) Hessians; see Figure 1 for a demonstration. Specifically, using that the maximum occurs at , an asymptotic expansion yields
(22) 
where we define and is any functional perturbation of the maximizer . The second order term , i.e., the variational Hessian, is necessarily nonpositive and determines the behavior in a neighborhood of the maximizer. By comparing for different families , we can quantify the ‘tightness gains’ provided by different transformation families. All calculations can be made rigorous under appropriate assumptions, but here we choose to operate formally to keep the discussion brief.
We focus our analysis on affine tranformations, , but a similar analysis can be performed for any family with a smooth, finitedimensional parameterization. The standard divergence variational formula (1) corresponds to containing only the identity, and we write the corresponding objective functional as . Specializing (7) to the affine case, we define the functional
(23)  
which leads to four different objective functionals and variational representations of the divergence
(24)  
and the corresponding Hessians , , , and . Next we compute and compare these variational Hessians for the important case of KL divergence, where and . Detailed computations for general divergences can be found in Appendix C.
Tightness gains for KL divergence:
(25)  
(26)  
(27) 
corresponding to (1), (3), and (13), respectively. The gains inherent in the inclusions in Theorem 1 are quantified by comparing the variational curvatures Eq. (25), Eq. (26), and Eq. (27
); note that they are progressively smaller in magnitude. Curvature computations demonstrate how one can make rigorous and precisely quantify heuristics such as those presented in Figure 1 from
[9]. Our Hessian computations here also quantify and extend to divergences the accuracy gains observed in the neural estimation of mutual information in [10]. Figure 1 shows a simple numerical example using 1D Gaussians: , , with perturbations in the directions (top) and (bottom). Optimizing over all affine transformations (red curves) provides significant curvature gains when compared to optimization over only shifts (blue curves), i.e., the improved DV proposed in (13) compared to the classical DV objective functional (3) and even more so compared to the Legendre transform case, (1) (black curves).3 Numerical Examples: Faster Statistical Estimation and Learning
Next we discuss practical implications of using tighter variational representations developed in Theorem 1, focusing on accelerating neuralbased statistical learning and estimation. In recent works, variational representations such as (1) or (3) were used to estimate fdivergences and likelihood ratios based solely on available data [8]. This variational perspective proved also to be a crucial mathematical step in training generative adversarial networks (GAN) [11, 12, 13] and towards developing neuralbased estimators for mutual information, [10], taking advantage of the ability of neural networks to search efficiently through function spaces.
Improved variational formulas for statistical estimation and learning were previously studied in: i) [9], using Eq. (14) and assuming a Hilbert space (RKHS) function space, ii) [10], where the DV and LT formulas for the KL divergence were used to estimate mutual information and improve learning. Both of these implicitly rely on the improved variational formula (see Eq. (12) and Eq. (15)). Our Theorem 1 provides a broad generalization of these ideas to other transformation families, allows for practical implementation of the method in [9] to more general functions space parametrizations (e.g., neural networks), and generalizes the ideas in [10] to other divergences beyond KL, where it can provide improved mutual information estimators based on (6).
In the following, we employ the outcomes from Sections 1 & 2 and build several variational neural network estimators, in the general spirit of [8, 10]. We demonstrate the performance improvements that result from representations such as Eq. (6). We start with the heuristic observation, illustrated in Figure 1, that tighter representations can improve the accuracy of statistical estimators for fdivergences, in the sense that the same approximation of the optimal will provide a better approximation of the divergence. Moreover, tighter variational formulas can lead to faster convergence of the search algorithm, as we now motivate: suppose one minimizes a convex function by the simple gradient descent algorithm . If is Lipschitz (i.e., the Hessian is bounded by ) then this algorithm converges if , and the analysis suggests the optimal learning rate of and leads to the error bound (see, e.g., Theorem 3.3 in [33]). If has the same optimizer and optimal value, but has a smaller Hessian bound, , then the optimal learning rate, is larger, and the error bound after an equal number of steps is smaller, i.e., the use of in place of can lead to faster convergence.
The above argument is only heuristic; the constant learning rate algorithm is far from optimal in most cases and the above analysis does not capture the complexity of the current setting. Nonetheless, it does provide important insight into the numerical results, presented below, which demonstrate that, in practice, the improved variational formulas do generally lead to faster convergence of the estimators, letting all other factors be equal.
The examples below employ the new variational formulas (17) and (21) for divergence. We focus on the wellknown Hellinger divergence () but consider other cases as well (see also Appendix E
). Computations were done in TensorFlow using the AdamOptimizer
[34], an adaptive learningrate SGD optimizer, with all methods given the same initial learning rate. When working with neuralnetwork based estimators of divergence, we enforce positivity of the test functions (see (17)) via the parameterization whereis a neural network family with ReLU activation functions. If the optimization over a parameterized family of transformations,
, cannot be done analytically then we solve the minimization problem (6)(7) by performing stochastic gradient descent (SGD) on the full collection of parameters,
. In such cases, our method can be thought of as enhancement of the neural network structure. The nested nature of the minimization over and also allows for more sophisticated methods (not explored here), e.g., for each one can perform several SGD steps for , thus solving the (generally low dimensional) problem (7) to high accuracy, before performing another SGD step for in (6); this is reminiscent of multiscale numerical methods [35].HellingerMINE
In Figure 2, we present the computation of Hellinger mutual information (HellingerMI), , via neural network optimization. Typically the divergence in mutual information is chosen to be KL. However, one can consider a whole array of different fdivergences for this purpose, see for instance [36]. Here, and are correlated dimensional Gaussians with componentwise correlation . The results demonstrate that, for a given computational budget (i.e., fixed number of SGD iterations) the improved variational formulas (red and blue) yield more accurate results, i.e., they converge faster than the standard divergence method (1) (black). Moreover, optimizing over both scalings and powers (red) provides a nontrivial improvement over the scalingimproved method (blue). This is a generalization of the findings in [10], which compared the DV variational formula (3) with (1) for the KL divergence. We emphasize that despite the lack of an analytical formula for the optimization over , the inclusion of this single additional parameter in the variational formula leads to a clear performance gain.
Submanifold Parameterization for Exponential Families
Our method allows for a great deal of flexibility in the choice of function space parameterization. In a ‘smalldata’ setting, the assumption of an exponential family structure can serve as an effective regularization. We illustrate this with Figure 4, which shows the estimation of the divergence with between dimensional Gaussians using a data set of 5000 samples from each distribution for SGD (minibatch size of 100) and using another 5000 samples for Monte Carlo estimation of the value of the objective functional. Using the submanifold estimation formula (16) and its improved variant (see Appendix D) we obtain the magenta and red curves, respectively. In blue, we show the result from the improved variational formula (17) and in black we show the result using the standard divergence objective functional (1); both use neural network families with one fully connected hidden layer (5 nodes). The number of nodes was chosen so that all methods use approximately the same number of parameters. The neural network parameterization converges faster, but ends up with a larger bias than the submanifold parameterization. The improved variational formulas lead to faster convergence than the standard variational formula in both cases as expected by our theory.
dimensional Gaussians with randomly generated variances and one of the means randomly perturbed from zero. We compare the convergence performance between the neural network and submanifold parameterizations. Relative error was averaged over 50 runs.
MNIST Dataset
As a final example, we illustrate the accelerated speed of convergence on highdimensional ( dimensional) realistic data by estimating the Hellinger divergence between two distributions obtained by (iid) randomly translating the MNIST handwritten digits image dataset [37]. This provides an effective test case wherein we know the exact answer (). Figure 4 shows the error, as a function of the number of SGD iterations, and once again demonstrates that the improved variational formulas lead to faster convergence; in this case, nearly one order of magnitude fewer SGD iterations are required to reach an accuracy of when using the tighter objective functionals. In practice, this means that one can more quickly detect whether or not the two data streams are in fact coming from the same distribution.
Concluding, we state that although the proposed optimization framework was applied on statistical learning and estimation, it can be of broader interest, among others, in epistemic uncertainty quantification [14], in coarsegraining and model reduction [38, 39, 40], as well as in PAC learning [18] and adversarial learning [11, 13].
4 Proof of Theorem 1
In this section we provide a detailed proof of Theorem 1 from the main text. For the convenience of the reader, we will recall the relevant definitions and notation below.
Let be probability measures on a measurable space and, for any define to be the set of convex functions with . If (resp. ) is finite, we extend to (resp. ) by continuity and set for . The result is a convex, lower semicontinuous function, . The divergence of with respect to is defined by
(28) 
Our starting point is the the following variational characterization [20, 8]:
(29) 
where denotes the set of bounded measurable functions and
(30) 
is the Legendre transform.
Remark 1.
Note that . This implies is bounded below for , and hence is defined.
The technical aspects of the proof of Theorem 1 revolve around ensuring that all of the required expectations and operations are well defined (without requiring any arbitrary convention regarding the definition of ). Modulo those details, the derivation of Eq. (6) is quite simple. As a first step, we show that Eq. (29) can be extended to certain unbounded . This is similar to results in [20, 8] but we will prove explicit conditions for which the expectations exist. To do this, we will need the following lemmas:
Lemma 1.
Let . Then one of the following holds

is bounded below.

The set is of the form or for some and is nondecreasing.
Proof.
Suppose is not bounded below. Take with . We know and so and hence . is convex so it we let then this implies .
To show is nondecreasing, suppose that we have with . Taking as above, find an such that and . is convex and so, letting , we have
(31) 
This is a contradiction, hence is nondecreasing. ∎
Lemma 2.
Let and suppose is bounded below or . Then for all .
Remark 2.
We use the notation , , for the decomposition of a function into its positive and negative parts.
Proof.
Fix . If is bounded below then is bounded above and the result is trivial, so suppose not. Lemma 1 then implies that is nondecreasing. General properties of Legendre transforms on the real line imply that is convex, lower semicontinuous, and is continuous on . Hence there exists such that on and on (note that ). Define , so that , , and
(32) 
Hence .
Now define . is bounded above and so and we can use Eq. (29) to find
(33) 
We have pointwise, , and , so we can use the dominated convergence theorem to obtain
(34) 
(here it was important that we are in the case where ). We also have , hence (recall we are in the case where is nondecreasing) and for large enough we have for all . is continuous on , hence so
(35) 
Therefore the monotone convergence theorem implies , and so
(36) 
We therefore conclude that . ∎
We can now prove that Eq. (29) can be extended to .
Theorem 2.
Let and suppose either is bounded below or . Then
(37) 
where the objective functional is valued in .
Proof.
Lemma 2 implies for all and so the objective functional in Eq. (37) is valued in . If we can show for all then the claimed result will follow by using Eq. (29).
Fix . If or then the required bound is trivial, so suppose not. Then a.s. We are in the case where , and so and a.s. as well.
In summary, it suffices to show in the case where , , , . To do this, fix and define . and so Eq. (29) gives
(38) 
We have pointwise and , therefore the dominated convergence theorem .
We have and is continuous on , therefore pointwise. We also have
(39) 
hence the dominated convergence theorem implies . Combining these gives
(40) 
and so
(41) 
This proves the claim. ∎
We now prove Theorem 1 from the main text, which we restate below. For completeness, we also provide a derivation of the formula for the optimizer, , which was obtained in [20].
Theorem 3.
Let and suppose either is bounded below or . Then:

Suppose , is , is strictly increasing, and one of the following holds:


and if the value (resp. ) is achieved then

Comments
There are no comments yet.