Information-theoretic divergences are widely used to quantify the notion of ‘distance’ between probability measures, ; commonly used examples include the Kullback-Leibler divergence (i.e., KL-divergence or relative entropy),
-divergences, and Rényi divergences. The computation and estimation of divergences is important in many applications, including independent component analysis, medical image registration 3], genomic clusterting , the information bottleneck method , independence testing , as well as in the analysis and design of generative adversarial networks (GANs) [7, 8, 9, 10, 11].
Estimation of divergences is known to be a difficult problem [12, 13]. Likelihood ratio methods such as those in  are known to work best in low dimensions. However, recent work has shown that variational representations of divergences can be used to construct statistical estimators for the KL-divergence , and more general -divergences [16, 17, 18], that scale well with dimension.
In this work we develop a new variational characterization for the family of Rényi-divergences, , and study its use for statistical estimation. Specifically, we will prove
where , , and denotes the set of bounded measurable real-valued functions on ; see Theorem 1 below. We also prove a version of (1) where the supremum is taken over all measurable functions. This is useful in the construction of statistical estimators and allows us to obtain a formula for the optimizer; see Theorem 2 and Section 4. Eq. (1) can be viewed as an extension of the well-known Donsker-Varadhan variational formula [19, 20],
Note that the objective functionals in both optimization problems (1) and (2) depend on and only through the expectation of certain functions of . As a result, the objective functionals can be estimated in a straightforward manner using only samples from and . This property was key in the use of Eq. (2) for the statistical estimation of KL-divergence and applications to GANs in . We will similarly take advantage of this property to construct statistical estimators for the Rényi divergences; see Section 4.
2 Background on Rényi Divergences
The family of Rényi divergences, first introduced by Rényi in , provide means of quantifying the discrepancy between two probability measures and on a measurable space that is especially sensitive to the relative tail behavior of the distributions. The Renyi divergence of order , , between and , denoted , can be defined as follows: Let be a sigma-finite positive measure with and . Then
Such a always exists (e.g., ) and it can be shown that the definition (3) does not depend on the choice of . The satisfy the following divergence property: with equality if and only if . In this sense, the Rényi divergences provide a notion of ‘distance’ between probability measures. Note, however, that Rényi divergences are not symmetric, but rather they satisfy
Eq. (4) is used to extend the definition of to . Rényi divergences are connected to the KL-divergence, , through the following limiting formulas:
and if or if for some then
See  for a detailed discussion of Rényi divergences and proofs of these (and many other) properties. Note, however, that our definition of the Rényi divergences is related to theirs by . Explicit formulas for the Rényi divergence between members of many common parametric families can be found in . Rényi divergences are also connected with the family of -divergences; see .
3 Variational Formulas for the Rényi Divergences
The aim of this paper is to derive variational characterizations of the Rényi divergences that generalize the Donsker-Varadhan variational formula (2). Our main result is the following:
Theorem 1 (Rényi-Donsker-Varadhan Variational Formula)
Let be probability measures on and , . Then
If is a metric space with the Borel sigma algebra then one can replace in (7) with , the space of bounded continuous functions on , or with , the space of bounded Lipschitz functions on .
An obvious consequence of Eq. (7) is that any provides a lower bound on the Rényi divergences as follows:
With appropriate conventions regarding possible infinities, this can be extended to unbounded functions and hence the variational formula (7) holds when is replaced by , the set of all real-valued measurable functions on . We also obtain a formula for the optimizer.
4 Estimation of Rényi Divergences
The estimation of divergences in high dimensions is a difficult but important problem, e.g., for independence testing  and the development of GANs [7, 8, 9, 10, 11]. Likelihood-ratio methods for estimating divergences are known to be effective primarily in low-dimensions (see  as well as Figure 1 in  and further references therein). In contrast, variational methods for KL and -divergences have proven effective in a range of high-dimensional systems [15, 18]. In this section we demonstrate that the variational characterizations (7) and (9) similarly lead to effective estimators for Rényi divergences. We mirror the approach in , which used the Donsker-Varadhan variational formula to estimate KL-based mutual information between high-dimensional distributions. It should be noted that high-dimensional problems still pose a considerable challenge in general; this is due in part to the problem of sampling rare events. However, existing Monte Carlo methods for sampling rare events (see, e.g., [25, 26, 27]) are still applicable here.
via stochastic gradient descent (SGD). More specifically, first note that the objective functionals in Eq. (7) and Eq. (9) do not involve the likelihood ratio , unlike the formula in the definition of the Rényi divergences (3). Rather, (7) and (9) depend on and only through the expectation of certain functions of . This allows the objective functionals to be estimated in a straightforward manner using only samples from and . Once a parametrization of the function space has been chosen (e.g., neural networks), the optimum can then be searched for via any SGD algorithm. The exact optimizer is generally unbounded and so Theorem 2, which optimizes over unbounded functions, is especially useful in this context.
4.1 Example: Estimating Rényi-Based Mutual Information
between random variablesand ; this should be compared with , which used the Donsker-Varadhan variational formula to estimate KL mutual information, and  which considered -divergences. (Mutual information is typically defined in terms of the KL-divergence, but one can consider many alternative divergences; see, e.g., ). We parameterized the function space via a neural network family , 2 implies that one does not need to impose boundedness on the family . The expectations were estimated using i.i.d. samples from and and we used the AdamOptimizer , an adaptive learning-rate SGD algorithm, to search for the optimum. In Figure 1 we show the results of estimating the Rényi-MI where and and are correlated -dimensional Gaussians with component-wise correlation .
4.2 Example: Estimating General Rényi Divergences
More generally, the technique described above can be used to estimate for any and , provided only that one has i.i.d. samples from both distributions. The ability to estimate such divergences is a key tool in the development of Rényi-based GANs . Here we illustrate this ability by estimating the Rényi divergence between two distributions of the form . We again parameterize the function space by a neural network and optimize via SGD, as described in Section 4.1. In Figure 2 we show the results of estimating the Rényi divergence, as a function of .
5 Proofs of the Rényi-Donsker-Varadhan Variational Formulas
where the optimization is over all probability measures, , on . (Let , in Eq. (1.3) of ). Though Eq. (11) is not a Legendre transform, it is still in some sense a ‘dual’ version of (7); this is reminiscent of the duality between the Donsker-Varadhan variational formula (2) and the Gibbs variational principle (see Proposition 1.4.2 in ). Eq. (11) was previously used in [30, 31, 32] to derive uncertainty quantification bounds on risk-sensitive quantities (e.g., rare events or large deviations estimates) and in  to derive PAC-Bayesian bounds.
In fact, we will not require the full strength of (11). We will only need the following bound for , :
Proof 1 (Proof of Eq. (12))
We separate the proof into two cases.
1) : If the result is trivial (see Eq. (3)), so assume . For we can use Hölder’s inequality with conjugate exponents and to obtain
2) : Let , as in the definition (3) and define . We have
Using Hölder’s inequality for the measure , the conjugate exponents and , and the functions and we find
Taking the logarithm of both sides, dividing by (which is negative), and using Eq. (14) we arrive at
This implies the claimed bound (12) and completes the proof.
Proof 2 (Proof of Theorem 1)
Eq. (12) immediately implies
We separate the proof of the reverse inequality into three cases.
1) and : We will show , which will prove the desired inequality. To do this, take a measurable set with but and define . The definition (18) implies
The lower bound goes to as (here it is key that ) and therefore we have the claimed result.
2) and : In this case we can take in Eq. (3) and write
and . These are bounded and so Eq. (18) implies
Define . Using the dominated convergence theorem to take , we find
To obtain the last line we used .
We have as , and so the monotone convergence theorem implies
This proves the claimed result for case 2.
3) : In this case definition (3) becomes
We can take using the dominated convergence theorem (here it is critical that ) to find
(Note that the second term is always finite.) The dominated convergence theorem can be used on the second term to obtain
Therefore the claim is proven in case 3, and the proof of Eq. (7) is complete.
Now suppose is a metric space with the Borel sigma algebra. Define the probability measure and let . Lusin’s theorem (see, e.g., Appendix D in ) implies that for all there exists a closed set such that and is continuous. By the Tietze Extension Theorem (see, e.g., Theorem 4.16 in ) there exists with and on . Therefore
as . Similarly, we have . Hence
was arbitrary and so we have proven
The reverse inequality is trivial. Therefore we have shown that one can replace with in Eq. (7). To see that one can further replace with , use the fact that every is the pointwise limit of Lipschitz functions, , with (see Box 1.5 on page 6 of ). The result then follows from a similar computation to the above, this time using the dominated convergence theorem.
Next we prove Theorem 2, which extends the variational formula to unbounded functions.
Proof 3 (Proof of Theorem 2)
We need to show that
To prove the bound (33) we start by fixing and defining the truncated functions
These are bounded and so Theorem 1 implies
We now consider three cases, based on the value of .
1) : If then Eq. (33) is trivial (this is true even if , due to our convention that ), so suppose . When , Eq. (35) involves integrals of the form where and is a probability measure. We have where and for all . Therefore the dominated convergence theorem implies
We have as and hence the monotone convergence theorem yields
Therefore we can take the iterated limit of Eq. (35) to obtain
(note that we are in the sub-case where the second term is finite, and so this is true even if ). This proves the claim in case 1.
3) : If either or then the bound (33) is again trivial, so suppose they are both finite. For we can bound and . Therefore the dominated convergence theorem implies that
If is a metric space then we already know from Theorem 1 that one can restrict the supremum to without changing its value. Combined with Eq. (9), this implies that the value is unchanged if one restricts to any set of functions between and ; in particular, one can restrict to or to .
The research of J.B., M.K., and L. R.-B. was partially supported by NSF TRIPODS CISE-1934846. The research of M. K. and L. R.-B. was partially supported by the National Science Foundation (NSF) under the grant DMS-1515712 and by the Air Force Office of Scientific Research (AFOSR) under the grant FA-9550-18-1-0214. The research of P.D. was supported in part by the National Science Foundation (NSF) under the grant DMS-1904992 and by the Air Force Office of Scientific Research (AFOSR) under the grant FA-9550-18-1-0214. The research of J.W. was partially supported by the Defense Advanced Research Projects Agency (DARPA) EQUiPS program under the grant W911NF1520122.
-  A. Hyvärinen, J. Karhunen, and E. Oja, Independent Component Analysis, ser. Adaptive and Cognitive Dynamic Systems: Signal Processing, Learning, Communications and Control. Wiley, 2004.
-  F. Maes, A. Collignon, D. Vandermeulen, G. Marchal, and P. Suetens, “Multimodality image registration by maximization of mutual information,” IEEE Trans Med Imaging, vol. 16, no. 2, pp. 187–198, 1997.
-  N. Kwak and Chong-Ho Choi, “Input feature selection by mutual information based on Parzen window,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 24, no. 12, pp. 1667–1671, 2002.
-  A. Butte and K. IS., “Mutual information relevance networks: functional genomic clustering using pairwise entropy measurements,” Pac Symp Biocomput., pp. 418–429, 2000.
-  N. Tishby, F. C. Pereira, and W. Bialek, “The information bottleneck method,” 2000.
-  J. B. Kinney and G. S. Atwal, “Equitability, mutual information, and the maximal information coefficient,” Proceedings of the National Academy of Sciences, vol. 111, no. 9, pp. 3354–3359, 2014. [Online]. Available: https://www.pnas.org/content/111/9/3354
-  I. J. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio, “Generative adversarial nets,” in Proceedings of the 27th International Conference on Neural Information Processing Systems - Volume 2, ser. NIPS’14. Cambridge, MA, USA: MIT Press, 2014, p. 2672–2680.
-  S. Nowozin, B. Cseke, and R. Tomioka, “F-GAN: Training generative neural samplers using variational divergence minimization,” in Proceedings of the 30th International Conference on Neural Information Processing Systems, ser. NIPS’16. Red Hook, NY, USA: Curran Associates Inc., 2016, p. 271–279.
M. Arjovsky, S. Chintala, and L. Bottou, “Wasserstein generative adversarial
networks,” in Proceedings of the 34th International Conference on
, ser. Proceedings of Machine Learning Research, D. Precup and Y. W. Teh, Eds., vol. 70. International Convention Centre, Sydney, Australia: PMLR, 06–11 Aug 2017, pp. 214–223.
-  I. Gulrajani, F. Ahmed, M. Arjovsky, V. Dumoulin, and A. Courville, “Improved training of Wasserstein GANs,” in Proceedings of the 31st International Conference on Neural Information Processing Systems, ser. NIPS’17. Red Hook, NY, USA: Curran Associates Inc., 2017, p. 5769–5779.
-  Y. Pantazis, D. Paul, M. Fasoulakis, Y. Stylianou, and M. A. Katsoulakis, “Cumulant GAN,” arXiv:2006.06625, 2020.
-  L. Paninski, “Estimation of entropy and mutual information,” Neural Computation, vol. 15, no. 6, pp. 1191–1253, 2003. [Online]. Available: https://doi.org/10.1162/089976603321780272
S. Gao, G. V. Steeg, and A. Galstyan, “Efficient Estimation of Mutual
Information for Strongly Dependent Variables,” in
Proceedings of the Eighteenth International Conference on Artificial Intelligence and Statistics, ser. Proceedings of Machine Learning Research, G. Lebanon and S. V. N. Vishwanathan, Eds., vol. 38. San Diego, California, USA: PMLR, 09–12 May 2015, pp. 277–286. [Online]. Available: http://proceedings.mlr.press/v38/gao15.html
-  K. Kandasamy, A. Krishnamurthy, B. Poczos, L. Wasserman, and J. M. Robins, “Nonparametric von Mises estimators for entropies, divergences and mutual informations,” in Advances in Neural Information Processing Systems 28, C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama, and R. Garnett, Eds. Curran Associates, Inc., 2015, pp. 397–405.
-  M. I. Belghazi, A. Baratin, S. Rajeshwar, S. Ozair, Y. Bengio, A. Courville, and D. Hjelm, “Mutual information neural estimation,” in Proceedings of the 35th International Conference on Machine Learning, ser. Proceedings of Machine Learning Research, J. Dy and A. Krause, Eds., vol. 80. Stockholmsmässan, Stockholm Sweden: PMLR, 10–15 Jul 2018, pp. 531–540. [Online]. Available: http://proceedings.mlr.press/v80/belghazi18a.html
-  X. Nguyen, M. J. Wainwright, and M. I. Jordan, “Estimating divergence functionals and the likelihood ratio by convex risk minimization,” IEEE Transactions on Information Theory, vol. 56, no. 11, pp. 5847–5861, 2010.
-  A. Ruderman, M. D. Reid, D. García-García, and J. Petterson, “Tighter variational representations of f-divergences via restriction to probability measures,” in Proceedings of the 29th International Coference on International Conference on Machine Learning, ser. ICML’12. Madison, WI, USA: Omnipress, 2012, p. 1155–1162.
-  J. Birrell, M. A. Katsoulakis, and Y. Pantazis, “Optimizing variational representations of divergences and accelerating their statistical estimation,” arXiv e-prints, p. arXiv:2006.08781, Jun. 2020.
-  M. D. Donsker and S. R. S. Varadhan, “Asymptotic evaluation of certain Markov process expectations for large time. IV,” Communications on Pure and Applied Mathematics, vol. 36, no. 2, pp. 183–212, 1983. [Online]. Available: https://onlinelibrary.wiley.com/doi/abs/10.1002/cpa.3160360204
-  P. Dupuis and R. Ellis., A Weak Convergence Approach to the Theory of Large Deviations, ser. Wiley series in probability and statistics. New York: John Wiley & Sons, 1997, a Wiley-Interscience Publication. [Online]. Available: http://opac.inria.fr/record=b1092351
-  A. Rényi, “On measures of entropy and information,” HUNGARIAN ACADEMY OF SCIENCES Budapest Hungary, Tech. Rep., 1961.
-  T. Van Erven and P. Harremoës, “Rényi divergence and Kullback-Leibler divergence,” Information Theory, IEEE Transactions on, vol. 60, no. 7, pp. 3797–3820, 2014.
-  M. Gil, F. Alajaji, and T. Linder, “Rényi divergence measures for commonly used univariate continuous distributions,” Information Sciences, vol. 249, pp. 124 – 131, 2013.
-  F. Liese and I. Vajda, “On divergences and informations in statistics and information theory,” IEEE Transactions on Information Theory, vol. 52, no. 10, pp. 4394–4412, Oct 2006.
-  G. Rubino, B. Tuffin et al., Rare event simulation using Monte Carlo methods. Wiley Online Library, 2009, vol. 73.
-  J. Bucklew, Introduction to Rare Event Simulation, ser. Springer Series in Statistics. Springer New York, 2013.
A. Budhiraja and P. Dupuis, Analysis and Approximation of Rare Events:
Representations and Weak Convergence Methods
, ser. Probability Theory and Stochastic Modelling. Springer US, 2019.
-  S. Rahman, “The f-sensitivity index,” SIAM/ASA Journal on Uncertainty Quantification, vol. 4, no. 1, pp. 130–162, 2016.
-  D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” arXiv:1412.6980, 2014.
-  R. Atar, K. Chowdhary, and P. Dupuis, “Robust bounds on risk-sensitive functionals via Rényi divergence,” SIAM/ASA Journal on Uncertainty Quantification, vol. 3, no. 1, pp. 18–33, 2015.
-  P. Dupuis, M. A. Katsoulakis, Y. Pantazis, and L. Rey-Bellet, “Sensitivity Analysis for Rare Events based on Rényi Divergence,” arXiv e-prints, p. arXiv:1805.06917, May 2018.
-  R. Atar, A. Budhiraja, P. Dupuis, and R. Wu, “Robust bounds and optimization at the large deviations scale for queueing models via Rényi divergence,” 2020.
-  L. Bégin, P. Germain, F. Laviolette, and J.-F. Roy, “PAC-Bayesian bounds based on the Rényi divergence,” in Proceedings of the 19th International Conference on Artificial Intelligence and Statistics, ser. Proceedings of Machine Learning Research, A. Gretton and C. C. Robert, Eds., vol. 51. Cadiz, Spain: PMLR, 09–11 May 2016, pp. 435–444. [Online]. Available: http://proceedings.mlr.press/v51/begin16.html
Uniform Central Limit Theorems, ser. Cambridge Studies in Advanced Mathematics. Cambridge University Press, 2014.
-  G. Folland, Real Analysis: Modern Techniques and Their Applications, ser. Pure and Applied Mathematics: A Wiley Series of Texts, Monographs and Tracts. Wiley, 2013.
-  F. Santambrogio, Optimal Transport for Applied Mathematicians: Calculus of Variations, PDEs, and Modeling, ser. Progress in Nonlinear Differential Equations and Their Applications. Springer International Publishing, 2015.