1 Introduction
Informationtheoretic divergences are widely used to quantify the notion of ‘distance’ between probability measures
, ; commonly used examples include the KullbackLeibler divergence (i.e., KLdivergence or relative entropy),divergences, and Rényi divergences. The computation and estimation of divergences is important in many applications, including independent component analysis
[1], medical image registration [2][3], genomic clusterting [4], the information bottleneck method [5], independence testing [6], as well as in the analysis and design of generative adversarial networks (GANs) [7, 8, 9, 10, 11].Estimation of divergences is known to be a difficult problem [12, 13]. Likelihood ratio methods such as those in [14] are known to work best in low dimensions. However, recent work has shown that variational representations of divergences can be used to construct statistical estimators for the KLdivergence [15], and more general divergences [16, 17, 18], that scale well with dimension.
In this work we develop a new variational characterization for the family of Rényidivergences, , and study its use for statistical estimation. Specifically, we will prove
(1)  
where , , and denotes the set of bounded measurable realvalued functions on ; see Theorem 1 below. We also prove a version of (1) where the supremum is taken over all measurable functions. This is useful in the construction of statistical estimators and allows us to obtain a formula for the optimizer; see Theorem 2 and Section 4. Eq. (1) can be viewed as an extension of the wellknown DonskerVaradhan variational formula [19, 20],
(2) 
for the relative entropy, , to the full family of Rényi divergences. In particular, (2) is (formally) the limit of (1).
Note that the objective functionals in both optimization problems (1) and (2) depend on and only through the expectation of certain functions of . As a result, the objective functionals can be estimated in a straightforward manner using only samples from and . This property was key in the use of Eq. (2) for the statistical estimation of KLdivergence and applications to GANs in [15]. We will similarly take advantage of this property to construct statistical estimators for the Rényi divergences; see Section 4.
2 Background on Rényi Divergences
The family of Rényi divergences, first introduced by Rényi in [21], provide means of quantifying the discrepancy between two probability measures and on a measurable space that is especially sensitive to the relative tail behavior of the distributions. The Renyi divergence of order , , between and , denoted , can be defined as follows: Let be a sigmafinite positive measure with and . Then
(3)  
Such a always exists (e.g., ) and it can be shown that the definition (3) does not depend on the choice of . The satisfy the following divergence property: with equality if and only if . In this sense, the Rényi divergences provide a notion of ‘distance’ between probability measures. Note, however, that Rényi divergences are not symmetric, but rather they satisfy
(4) 
Eq. (4) is used to extend the definition of to . Rényi divergences are connected to the KLdivergence, , through the following limiting formulas:
(5) 
and if or if for some then
(6) 
See [22] for a detailed discussion of Rényi divergences and proofs of these (and many other) properties. Note, however, that our definition of the Rényi divergences is related to theirs by . Explicit formulas for the Rényi divergence between members of many common parametric families can be found in [23]. Rényi divergences are also connected with the family of divergences; see [24].
3 Variational Formulas for the Rényi Divergences
The aim of this paper is to derive variational characterizations of the Rényi divergences that generalize the DonskerVaradhan variational formula (2). Our main result is the following:
Theorem 1 (RényiDonskerVaradhan Variational Formula)
Let be probability measures on and , . Then
(7)  
If is a metric space with the Borel sigma algebra then one can replace in (7) with , the space of bounded continuous functions on , or with , the space of bounded Lipschitz functions on .
Note that functions in can have arbitrarily large (but finite) Lipschitz constants. The proof of Theorem 1 can be found in Section 5.
Remark 1
An obvious consequence of Eq. (7) is that any provides a lower bound on the Rényi divergences as follows:
(8) 
With appropriate conventions regarding possible infinities, this can be extended to unbounded functions and hence the variational formula (7) holds when is replaced by , the set of all realvalued measurable functions on . We also obtain a formula for the optimizer.
Theorem 2
4 Estimation of Rényi Divergences
The estimation of divergences in high dimensions is a difficult but important problem, e.g., for independence testing [6] and the development of GANs [7, 8, 9, 10, 11]. Likelihoodratio methods for estimating divergences are known to be effective primarily in lowdimensions (see [14] as well as Figure 1 in [15] and further references therein). In contrast, variational methods for KL and divergences have proven effective in a range of highdimensional systems [15, 18]. In this section we demonstrate that the variational characterizations (7) and (9) similarly lead to effective estimators for Rényi divergences. We mirror the approach in [15], which used the DonskerVaradhan variational formula to estimate KLbased mutual information between highdimensional distributions. It should be noted that highdimensional problems still pose a considerable challenge in general; this is due in part to the problem of sampling rare events. However, existing Monte Carlo methods for sampling rare events (see, e.g., [25, 26, 27]) are still applicable here.
The estimators for Rényi divergences are constructed by solving the optimization problem in either Theorem 1 or Theorem 2
via stochastic gradient descent (SGD). More specifically, first note that the objective functionals in Eq. (
7) and Eq. (9) do not involve the likelihood ratio , unlike the formula in the definition of the Rényi divergences (3). Rather, (7) and (9) depend on and only through the expectation of certain functions of . This allows the objective functionals to be estimated in a straightforward manner using only samples from and . Once a parametrization of the function space has been chosen (e.g., neural networks), the optimum can then be searched for via any SGD algorithm. The exact optimizer is generally unbounded and so Theorem 2, which optimizes over unbounded functions, is especially useful in this context.4.1 Example: Estimating RényiBased Mutual Information
Here we demonstrate the utility of (7) and (9) by estimating the Rényi mutual information,
(RényiMI)  (10) 
between random variables
and ; this should be compared with [15], which used the DonskerVaradhan variational formula to estimate KL mutual information, and [18] which considered divergences. (Mutual information is typically defined in terms of the KLdivergence, but one can consider many alternative divergences; see, e.g., [28]). We parameterized the function space via a neural network family ,, with ReLU activation functions; Theorem
2 implies that one does not need to impose boundedness on the family . The expectations were estimated using i.i.d. samples from and and we used the AdamOptimizer [29], an adaptive learningrate SGD algorithm, to search for the optimum. In Figure 1 we show the results of estimating the RényiMI where and and are correlated dimensional Gaussians with componentwise correlation .4.2 Example: Estimating General Rényi Divergences
More generally, the technique described above can be used to estimate for any and , provided only that one has i.i.d. samples from both distributions. The ability to estimate such divergences is a key tool in the development of Rényibased GANs [11]. Here we illustrate this ability by estimating the Rényi divergence between two distributions of the form . We again parameterize the function space by a neural network and optimize via SGD, as described in Section 4.1. In Figure 2 we show the results of estimating the Rényi divergence, as a function of .
5 Proofs of the RényiDonskerVaradhan Variational Formulas
The starting point for the proof of Theorem 1 is the following variational formula, proven in [30]: Let be a probability measure on , , and , . Then
(11)  
where the optimization is over all probability measures, , on . (Let , in Eq. (1.3) of [30]). Though Eq. (11) is not a Legendre transform, it is still in some sense a ‘dual’ version of (7); this is reminiscent of the duality between the DonskerVaradhan variational formula (2) and the Gibbs variational principle (see Proposition 1.4.2 in [20]). Eq. (11) was previously used in [30, 31, 32] to derive uncertainty quantification bounds on risksensitive quantities (e.g., rare events or large deviations estimates) and in [33] to derive PACBayesian bounds.
In fact, we will not require the full strength of (11). We will only need the following bound for , :
(12) 
To keep our argument selfcontained, we include a proof of Eq. (12) below (our proof is adapted from the proof of (11) found in Section 4 of [30]).
Proof 1 (Proof of Eq. (12))
We separate the proof into two cases.
1) : If the result is trivial (see Eq. (3)), so assume . For we can use Hölder’s inequality with conjugate exponents and to obtain
(13)  
In this case, the definition (3) implies and so we have proven the claimed bound (12).
2) : Let , as in the definition (3) and define . We have
(14)  
Using Hölder’s inequality for the measure , the conjugate exponents and , and the functions and we find
(15)  
Taking the logarithm of both sides, dividing by (which is negative), and using Eq. (14) we arrive at
(16) 
This implies the claimed bound (12) and completes the proof.
We now use Eq. (12) to derive the variational formula (7). The argument is inspired by the proof of the DonskerVaradhan variational formula from Appendix C.2 in [20].
Proof 2 (Proof of Theorem 1)
If one can show Eq. (7) for all and all , then, using Eq. (4) and reindexing in the supremum, one find that Eq. (7) also holds for all . So we only need to consider the cases and .
Eq. (12) immediately implies
(17)  
(18) 
We separate the proof of the reverse inequality into three cases.
1) and : We will show , which will prove the desired inequality. To do this, take a measurable set with but and define . The definition (18) implies
(19)  
(20) 
The lower bound goes to as (here it is key that ) and therefore we have the claimed result.
2) and : In this case we can take in Eq. (3) and write
(21) 
Define
(22) 
and . These are bounded and so Eq. (18) implies
(23)  
Define . Using the dominated convergence theorem to take , we find
(24)  
To obtain the last line we used .
We have as , and so the monotone convergence theorem implies
(25)  
This proves the claimed result for case 2.
3) : In this case definition (3) becomes
(26) 
where is any sigmafinite positive measure for which and . Define via Eq. (22) and let , where is defined to be if and if and . are bounded, hence (18) implies
(27)  
We can take using the dominated convergence theorem (here it is critical that ) to find
(28)  
(Note that the second term is always finite.) The dominated convergence theorem can be used on the second term to obtain
(29)  
Therefore the claim is proven in case 3, and the proof of Eq. (7) is complete.
Now suppose is a metric space with the Borel sigma algebra. Define the probability measure and let . Lusin’s theorem (see, e.g., Appendix D in [34]) implies that for all there exists a closed set such that and is continuous. By the Tietze Extension Theorem (see, e.g., Theorem 4.16 in [35]) there exists with and on . Therefore
(30)  
as . Similarly, we have . Hence
(31)  
was arbitrary and so we have proven
(32)  
The reverse inequality is trivial. Therefore we have shown that one can replace with in Eq. (7). To see that one can further replace with , use the fact that every is the pointwise limit of Lipschitz functions, , with (see Box 1.5 on page 6 of [36]). The result then follows from a similar computation to the above, this time using the dominated convergence theorem.
Next we prove Theorem 2, which extends the variational formula to unbounded functions.
Proof 3 (Proof of Theorem 2)
We need to show that
(33)  
for all . The equality (9) then follows by combining Eq. (33) with Theorem 1.
To prove the bound (33) we start by fixing and defining the truncated functions
(34) 
These are bounded and so Theorem 1 implies
(35)  
We now consider three cases, based on the value of .
1) : If then Eq. (33) is trivial (this is true even if , due to our convention that ), so suppose . When , Eq. (35) involves integrals of the form where and is a probability measure. We have where and for all . Therefore the dominated convergence theorem implies
(36) 
We have as and hence the monotone convergence theorem yields
(37) 
Therefore we can take the iterated limit of Eq. (35) to obtain
(38)  
(note that we are in the subcase where the second term is finite, and so this is true even if ). This proves the claim in case 1.
3) : If either or then the bound (33) is again trivial, so suppose they are both finite. For we can bound and . Therefore the dominated convergence theorem implies that
(39)  
This proves Eq. (33) in case 3 and thus completes the proof of Eq. (9).
Acknowledgments
The research of J.B., M.K., and L. R.B. was partially supported by NSF TRIPODS CISE1934846. The research of M. K. and L. R.B. was partially supported by the National Science Foundation (NSF) under the grant DMS1515712 and by the Air Force Office of Scientific Research (AFOSR) under the grant FA95501810214. The research of P.D. was supported in part by the National Science Foundation (NSF) under the grant DMS1904992 and by the Air Force Office of Scientific Research (AFOSR) under the grant FA95501810214. The research of J.W. was partially supported by the Defense Advanced Research Projects Agency (DARPA) EQUiPS program under the grant W911NF1520122.
References
 [1] A. Hyvärinen, J. Karhunen, and E. Oja, Independent Component Analysis, ser. Adaptive and Cognitive Dynamic Systems: Signal Processing, Learning, Communications and Control. Wiley, 2004.
 [2] F. Maes, A. Collignon, D. Vandermeulen, G. Marchal, and P. Suetens, “Multimodality image registration by maximization of mutual information,” IEEE Trans Med Imaging, vol. 16, no. 2, pp. 187–198, 1997.
 [3] N. Kwak and ChongHo Choi, “Input feature selection by mutual information based on Parzen window,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 24, no. 12, pp. 1667–1671, 2002.
 [4] A. Butte and K. IS., “Mutual information relevance networks: functional genomic clustering using pairwise entropy measurements,” Pac Symp Biocomput., pp. 418–429, 2000.
 [5] N. Tishby, F. C. Pereira, and W. Bialek, “The information bottleneck method,” 2000.
 [6] J. B. Kinney and G. S. Atwal, “Equitability, mutual information, and the maximal information coefficient,” Proceedings of the National Academy of Sciences, vol. 111, no. 9, pp. 3354–3359, 2014. [Online]. Available: https://www.pnas.org/content/111/9/3354
 [7] I. J. Goodfellow, J. PougetAbadie, M. Mirza, B. Xu, D. WardeFarley, S. Ozair, A. Courville, and Y. Bengio, “Generative adversarial nets,” in Proceedings of the 27th International Conference on Neural Information Processing Systems  Volume 2, ser. NIPS’14. Cambridge, MA, USA: MIT Press, 2014, p. 2672–2680.
 [8] S. Nowozin, B. Cseke, and R. Tomioka, “FGAN: Training generative neural samplers using variational divergence minimization,” in Proceedings of the 30th International Conference on Neural Information Processing Systems, ser. NIPS’16. Red Hook, NY, USA: Curran Associates Inc., 2016, p. 271–279.

[9]
M. Arjovsky, S. Chintala, and L. Bottou, “Wasserstein generative adversarial
networks,” in Proceedings of the 34th International Conference on
Machine Learning
, ser. Proceedings of Machine Learning Research, D. Precup and Y. W. Teh, Eds., vol. 70. International Convention Centre, Sydney, Australia: PMLR, 06–11 Aug 2017, pp. 214–223.
 [10] I. Gulrajani, F. Ahmed, M. Arjovsky, V. Dumoulin, and A. Courville, “Improved training of Wasserstein GANs,” in Proceedings of the 31st International Conference on Neural Information Processing Systems, ser. NIPS’17. Red Hook, NY, USA: Curran Associates Inc., 2017, p. 5769–5779.
 [11] Y. Pantazis, D. Paul, M. Fasoulakis, Y. Stylianou, and M. A. Katsoulakis, “Cumulant GAN,” arXiv:2006.06625, 2020.
 [12] L. Paninski, “Estimation of entropy and mutual information,” Neural Computation, vol. 15, no. 6, pp. 1191–1253, 2003. [Online]. Available: https://doi.org/10.1162/089976603321780272

[13]
S. Gao, G. V. Steeg, and A. Galstyan, “Efficient Estimation of Mutual
Information for Strongly Dependent Variables,” in
Proceedings of the Eighteenth International Conference on Artificial Intelligence and Statistics
, ser. Proceedings of Machine Learning Research, G. Lebanon and S. V. N. Vishwanathan, Eds., vol. 38. San Diego, California, USA: PMLR, 09–12 May 2015, pp. 277–286. [Online]. Available: http://proceedings.mlr.press/v38/gao15.html  [14] K. Kandasamy, A. Krishnamurthy, B. Poczos, L. Wasserman, and J. M. Robins, “Nonparametric von Mises estimators for entropies, divergences and mutual informations,” in Advances in Neural Information Processing Systems 28, C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama, and R. Garnett, Eds. Curran Associates, Inc., 2015, pp. 397–405.
 [15] M. I. Belghazi, A. Baratin, S. Rajeshwar, S. Ozair, Y. Bengio, A. Courville, and D. Hjelm, “Mutual information neural estimation,” in Proceedings of the 35th International Conference on Machine Learning, ser. Proceedings of Machine Learning Research, J. Dy and A. Krause, Eds., vol. 80. Stockholmsmässan, Stockholm Sweden: PMLR, 10–15 Jul 2018, pp. 531–540. [Online]. Available: http://proceedings.mlr.press/v80/belghazi18a.html
 [16] X. Nguyen, M. J. Wainwright, and M. I. Jordan, “Estimating divergence functionals and the likelihood ratio by convex risk minimization,” IEEE Transactions on Information Theory, vol. 56, no. 11, pp. 5847–5861, 2010.
 [17] A. Ruderman, M. D. Reid, D. GarcíaGarcía, and J. Petterson, “Tighter variational representations of fdivergences via restriction to probability measures,” in Proceedings of the 29th International Coference on International Conference on Machine Learning, ser. ICML’12. Madison, WI, USA: Omnipress, 2012, p. 1155–1162.
 [18] J. Birrell, M. A. Katsoulakis, and Y. Pantazis, “Optimizing variational representations of divergences and accelerating their statistical estimation,” arXiv eprints, p. arXiv:2006.08781, Jun. 2020.
 [19] M. D. Donsker and S. R. S. Varadhan, “Asymptotic evaluation of certain Markov process expectations for large time. IV,” Communications on Pure and Applied Mathematics, vol. 36, no. 2, pp. 183–212, 1983. [Online]. Available: https://onlinelibrary.wiley.com/doi/abs/10.1002/cpa.3160360204
 [20] P. Dupuis and R. Ellis., A Weak Convergence Approach to the Theory of Large Deviations, ser. Wiley series in probability and statistics. New York: John Wiley & Sons, 1997, a WileyInterscience Publication. [Online]. Available: http://opac.inria.fr/record=b1092351
 [21] A. Rényi, “On measures of entropy and information,” HUNGARIAN ACADEMY OF SCIENCES Budapest Hungary, Tech. Rep., 1961.
 [22] T. Van Erven and P. Harremoës, “Rényi divergence and KullbackLeibler divergence,” Information Theory, IEEE Transactions on, vol. 60, no. 7, pp. 3797–3820, 2014.
 [23] M. Gil, F. Alajaji, and T. Linder, “Rényi divergence measures for commonly used univariate continuous distributions,” Information Sciences, vol. 249, pp. 124 – 131, 2013.
 [24] F. Liese and I. Vajda, “On divergences and informations in statistics and information theory,” IEEE Transactions on Information Theory, vol. 52, no. 10, pp. 4394–4412, Oct 2006.
 [25] G. Rubino, B. Tuffin et al., Rare event simulation using Monte Carlo methods. Wiley Online Library, 2009, vol. 73.
 [26] J. Bucklew, Introduction to Rare Event Simulation, ser. Springer Series in Statistics. Springer New York, 2013.

[27]
A. Budhiraja and P. Dupuis, Analysis and Approximation of Rare Events:
Representations and Weak Convergence Methods
, ser. Probability Theory and Stochastic Modelling. Springer US, 2019.
 [28] S. Rahman, “The fsensitivity index,” SIAM/ASA Journal on Uncertainty Quantification, vol. 4, no. 1, pp. 130–162, 2016.
 [29] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” arXiv:1412.6980, 2014.
 [30] R. Atar, K. Chowdhary, and P. Dupuis, “Robust bounds on risksensitive functionals via Rényi divergence,” SIAM/ASA Journal on Uncertainty Quantification, vol. 3, no. 1, pp. 18–33, 2015.
 [31] P. Dupuis, M. A. Katsoulakis, Y. Pantazis, and L. ReyBellet, “Sensitivity Analysis for Rare Events based on Rényi Divergence,” arXiv eprints, p. arXiv:1805.06917, May 2018.
 [32] R. Atar, A. Budhiraja, P. Dupuis, and R. Wu, “Robust bounds and optimization at the large deviations scale for queueing models via Rényi divergence,” 2020.
 [33] L. Bégin, P. Germain, F. Laviolette, and J.F. Roy, “PACBayesian bounds based on the Rényi divergence,” in Proceedings of the 19th International Conference on Artificial Intelligence and Statistics, ser. Proceedings of Machine Learning Research, A. Gretton and C. C. Robert, Eds., vol. 51. Cadiz, Spain: PMLR, 09–11 May 2016, pp. 435–444. [Online]. Available: http://proceedings.mlr.press/v51/begin16.html

[34]
R. Dudley,
Uniform Central Limit Theorems
, ser. Cambridge Studies in Advanced Mathematics. Cambridge University Press, 2014.  [35] G. Folland, Real Analysis: Modern Techniques and Their Applications, ser. Pure and Applied Mathematics: A Wiley Series of Texts, Monographs and Tracts. Wiley, 2013.
 [36] F. Santambrogio, Optimal Transport for Applied Mathematicians: Calculus of Variations, PDEs, and Modeling, ser. Progress in Nonlinear Differential Equations and Their Applications. Springer International Publishing, 2015.
Comments
There are no comments yet.