A Variational Formula for Rényi Divergences

07/07/2020 ∙ by Jeremiah Birrell, et al. ∙ University of Massachusetts Amherst Brown University 0

We derive a new variational formula for the Rényi family of divergences, R_α(QP), generalizing the classical Donsker-Varadhan variational formula for the Kullback-Leibler divergence. The objective functional in this new variational representation is expressed in terms of expectations under Q and P, and hence can be estimated using samples from the two distributions. We illustrate the utility of such a variational formula by constructing neural-network estimators for the Rényi divergences.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Information-theoretic divergences are widely used to quantify the notion of ‘distance’ between probability measures

, ; commonly used examples include the Kullback-Leibler divergence (i.e., KL-divergence or relative entropy),

-divergences, and Rényi divergences. The computation and estimation of divergences is important in many applications, including independent component analysis

[1], medical image registration [2]

, feature selection

[3], genomic clusterting [4], the information bottleneck method [5], independence testing [6], as well as in the analysis and design of generative adversarial networks (GANs) [7, 8, 9, 10, 11].

Estimation of divergences is known to be a difficult problem [12, 13]. Likelihood ratio methods such as those in [14] are known to work best in low dimensions. However, recent work has shown that variational representations of divergences can be used to construct statistical estimators for the KL-divergence [15], and more general -divergences [16, 17, 18], that scale well with dimension.

In this work we develop a new variational characterization for the family of Rényi-divergences, , and study its use for statistical estimation. Specifically, we will prove

(1)

where , , and denotes the set of bounded measurable real-valued functions on ; see Theorem 1 below. We also prove a version of (1) where the supremum is taken over all measurable functions. This is useful in the construction of statistical estimators and allows us to obtain a formula for the optimizer; see Theorem 2 and Section 4. Eq. (1) can be viewed as an extension of the well-known Donsker-Varadhan variational formula [19, 20],

(2)

for the relative entropy, , to the full family of Rényi divergences. In particular, (2) is (formally) the limit of (1).

Note that the objective functionals in both optimization problems (1) and (2) depend on and only through the expectation of certain functions of . As a result, the objective functionals can be estimated in a straightforward manner using only samples from and . This property was key in the use of Eq. (2) for the statistical estimation of KL-divergence and applications to GANs in [15]. We will similarly take advantage of this property to construct statistical estimators for the Rényi divergences; see Section 4.

2 Background on Rényi Divergences

The family of Rényi divergences, first introduced by Rényi in [21], provide means of quantifying the discrepancy between two probability measures and on a measurable space that is especially sensitive to the relative tail behavior of the distributions. The Renyi divergence of order , , between and , denoted , can be defined as follows: Let be a sigma-finite positive measure with and . Then

(3)

Such a always exists (e.g., ) and it can be shown that the definition (3) does not depend on the choice of . The satisfy the following divergence property: with equality if and only if . In this sense, the Rényi divergences provide a notion of ‘distance’ between probability measures. Note, however, that Rényi divergences are not symmetric, but rather they satisfy

(4)

Eq. (4) is used to extend the definition of to . Rényi divergences are connected to the KL-divergence, , through the following limiting formulas:

(5)

and if or if for some then

(6)

See [22] for a detailed discussion of Rényi divergences and proofs of these (and many other) properties. Note, however, that our definition of the Rényi divergences is related to theirs by . Explicit formulas for the Rényi divergence between members of many common parametric families can be found in [23]. Rényi divergences are also connected with the family of -divergences; see [24].

3 Variational Formulas for the Rényi Divergences

The aim of this paper is to derive variational characterizations of the Rényi divergences that generalize the Donsker-Varadhan variational formula (2). Our main result is the following:

Theorem 1 (Rényi-Donsker-Varadhan Variational Formula)

Let be probability measures on and , . Then

(7)

If is a metric space with the Borel sigma algebra then one can replace in (7) with , the space of bounded continuous functions on , or with , the space of bounded Lipschitz functions on .

Note that functions in can have arbitrarily large (but finite) Lipschitz constants. The proof of Theorem 1 can be found in Section 5.

Remark 1

Formally taking in Eq. (7), one recovers the Donsker-Varadhan variational formula, Eq. (2). Similarly, taking and reindexing one obtains the Donsker-Varadhan variational formula for .

An obvious consequence of Eq. (7) is that any provides a lower bound on the Rényi divergences as follows:

(8)

With appropriate conventions regarding possible infinities, this can be extended to unbounded functions and hence the variational formula (7) holds when is replaced by , the set of all real-valued measurable functions on . We also obtain a formula for the optimizer.

Theorem 2

Let be probability measures on and , . Then

(9)

where we interpret and . On a metric space with the Borel sigma algebra, Eq. (9) holds with replaced by either , the space of continuous functions on , or by , the space of Lipschitz functions on . If , , and then the supremum in (9) achieved at .

The proof of Theorem 2 can also be found in Section 5.

Remark 2

If the optimizer, , of (9) exists then one can replace in (9) with any collection of functions that is known a priori to contain . This is especially useful when working with parametric families; see [18] for further discussion of this idea in the context of -divergences.

4 Estimation of Rényi Divergences

The estimation of divergences in high dimensions is a difficult but important problem, e.g., for independence testing [6] and the development of GANs [7, 8, 9, 10, 11]. Likelihood-ratio methods for estimating divergences are known to be effective primarily in low-dimensions (see [14] as well as Figure 1 in [15] and further references therein). In contrast, variational methods for KL and -divergences have proven effective in a range of high-dimensional systems [15, 18]. In this section we demonstrate that the variational characterizations (7) and (9) similarly lead to effective estimators for Rényi divergences. We mirror the approach in [15], which used the Donsker-Varadhan variational formula to estimate KL-based mutual information between high-dimensional distributions. It should be noted that high-dimensional problems still pose a considerable challenge in general; this is due in part to the problem of sampling rare events. However, existing Monte Carlo methods for sampling rare events (see, e.g., [25, 26, 27]) are still applicable here.

The estimators for Rényi divergences are constructed by solving the optimization problem in either Theorem 1 or Theorem 2

via stochastic gradient descent (SGD). More specifically, first note that the objective functionals in Eq. (

7) and Eq. (9) do not involve the likelihood ratio , unlike the formula in the definition of the Rényi divergences (3). Rather, (7) and (9) depend on and only through the expectation of certain functions of . This allows the objective functionals to be estimated in a straightforward manner using only samples from and . Once a parametrization of the function space has been chosen (e.g., neural networks), the optimum can then be searched for via any SGD algorithm. The exact optimizer is generally unbounded and so Theorem 2, which optimizes over unbounded functions, is especially useful in this context.

4.1 Example: Estimating Rényi-Based Mutual Information

Here we demonstrate the utility of (7) and (9) by estimating the Rényi mutual information,

(Rényi-MI) (10)

between random variables

and ; this should be compared with [15], which used the Donsker-Varadhan variational formula to estimate KL mutual information, and [18] which considered -divergences. (Mutual information is typically defined in terms of the KL-divergence, but one can consider many alternative divergences; see, e.g., [28]). We parameterized the function space via a neural network family ,

, with ReLU activation functions; Theorem

2 implies that one does not need to impose boundedness on the family . The expectations were estimated using i.i.d. samples from and and we used the AdamOptimizer [29], an adaptive learning-rate SGD algorithm, to search for the optimum. In Figure 1 we show the results of estimating the Rényi-MI where and and are correlated -dimensional Gaussians with component-wise correlation .

Figure 1: Estimation of Rényi-based mutual information (10) with between -dimensional correlated Gaussians with component-wise correlation . We used a fully-connected neural network with one hidden layer of 256 nodes and training was performed with a minibatch size of 1000. We show the Rényi-MI as a function of after 10000 steps of SGD and averaged over 20 runs. The inset shows the relative error for a single run with

, as a function of the number of SGD iterations. Computations were done in TensorFlow.

Figure 2: Estimation of as a function of between two distributions of the form (, , , , , , ). We used a fully-connected neural network with two hidden layers, consisting of 512 and 16 nodes respectively, and training was performed with a minibatch size of 1000. We show the Rényi divergence estimates after 100000 steps of SGD and averaged over 20 runs. The inset shows the relative error for a single run with , as a function of the number of SGD iterations. Computations were done in TensorFlow.

4.2 Example: Estimating General Rényi Divergences

More generally, the technique described above can be used to estimate for any and , provided only that one has i.i.d. samples from both distributions. The ability to estimate such divergences is a key tool in the development of Rényi-based GANs [11]. Here we illustrate this ability by estimating the Rényi divergence between two distributions of the form . We again parameterize the function space by a neural network and optimize via SGD, as described in Section 4.1. In Figure 2 we show the results of estimating the Rényi divergence, as a function of .

5 Proofs of the Rényi-Donsker-Varadhan Variational Formulas

The starting point for the proof of Theorem 1 is the following variational formula, proven in [30]: Let be a probability measure on , , and , . Then

(11)

where the optimization is over all probability measures, , on . (Let , in Eq. (1.3) of [30]). Though Eq. (11) is not a Legendre transform, it is still in some sense a ‘dual’ version of (7); this is reminiscent of the duality between the Donsker-Varadhan variational formula (2) and the Gibbs variational principle (see Proposition 1.4.2 in [20]). Eq. (11) was previously used in [30, 31, 32] to derive uncertainty quantification bounds on risk-sensitive quantities (e.g., rare events or large deviations estimates) and in [33] to derive PAC-Bayesian bounds.

In fact, we will not require the full strength of (11). We will only need the following bound for , :

(12)

To keep our argument self-contained, we include a proof of Eq. (12) below (our proof is adapted from the proof of (11) found in Section 4 of [30]).

Proof 1 (Proof of Eq. (12))

We separate the proof into two cases.
1) : If the result is trivial (see Eq. (3)), so assume . For we can use Hölder’s inequality with conjugate exponents and to obtain

(13)

In this case, the definition (3) implies and so we have proven the claimed bound (12).

2) : Let , as in the definition (3) and define . We have

(14)

Using Hölder’s inequality for the measure , the conjugate exponents and , and the functions and we find

(15)

Taking the logarithm of both sides, dividing by (which is negative), and using Eq. (14) we arrive at

(16)

This implies the claimed bound (12) and completes the proof.

We now use Eq. (12) to derive the variational formula (7). The argument is inspired by the proof of the Donsker-Varadhan variational formula from Appendix C.2 in [20].

Proof 2 (Proof of Theorem 1)

If one can show Eq. (7) for all and all , then, using Eq. (4) and reindexing in the supremum, one find that Eq. (7) also holds for all . So we only need to consider the cases and .

Eq. (12) immediately implies

(17)
(18)

We separate the proof of the reverse inequality into three cases.
1) and : We will show , which will prove the desired inequality. To do this, take a measurable set with but and define . The definition (18) implies

(19)
(20)

The lower bound goes to as (here it is key that ) and therefore we have the claimed result.

2) and : In this case we can take in Eq. (3) and write

(21)

Define

(22)

and . These are bounded and so Eq. (18) implies

(23)

Define . Using the dominated convergence theorem to take , we find

(24)

To obtain the last line we used .

We have as , and so the monotone convergence theorem implies

(25)

This proves the claimed result for case 2.

3) : In this case definition (3) becomes

(26)

where is any sigma-finite positive measure for which and . Define via Eq. (22) and let , where is defined to be if and if and . are bounded, hence (18) implies

(27)

We can take using the dominated convergence theorem (here it is critical that ) to find

(28)

(Note that the second term is always finite.) The dominated convergence theorem can be used on the second term to obtain

(29)

Therefore the claim is proven in case 3, and the proof of Eq. (7) is complete.

Now suppose is a metric space with the Borel sigma algebra. Define the probability measure and let . Lusin’s theorem (see, e.g., Appendix D in [34]) implies that for all there exists a closed set such that and is continuous. By the Tietze Extension Theorem (see, e.g., Theorem 4.16 in [35]) there exists with and on . Therefore

(30)

as . Similarly, we have . Hence

(31)

was arbitrary and so we have proven

(32)

The reverse inequality is trivial. Therefore we have shown that one can replace with in Eq. (7). To see that one can further replace with , use the fact that every is the pointwise limit of Lipschitz functions, , with (see Box 1.5 on page 6 of [36]). The result then follows from a similar computation to the above, this time using the dominated convergence theorem.

Next we prove Theorem 2, which extends the variational formula to unbounded functions.

Proof 3 (Proof of Theorem 2)

We need to show that

(33)

for all . The equality (9) then follows by combining Eq. (33) with Theorem 1.

To prove the bound (33) we start by fixing and defining the truncated functions

(34)

These are bounded and so Theorem 1 implies

(35)

We now consider three cases, based on the value of .

1) : If then Eq. (33) is trivial (this is true even if , due to our convention that ), so suppose . When , Eq. (35) involves integrals of the form where and is a probability measure. We have where and for all . Therefore the dominated convergence theorem implies

(36)

We have as and hence the monotone convergence theorem yields

(37)

Therefore we can take the iterated limit of Eq. (35) to obtain

(38)

(note that we are in the sub-case where the second term is finite, and so this is true even if ). This proves the claim in case 1.

2) : Use Eq. (4) and then apply the result of case 1 to the function to arrive at Eq. (33).

3) : If either or then the bound (33) is again trivial, so suppose they are both finite. For we can bound and . Therefore the dominated convergence theorem implies that

(39)

This proves Eq. (33) in case 3 and thus completes the proof of Eq. (9).

If is a metric space then we already know from Theorem 1 that one can restrict the supremum to without changing its value. Combined with Eq. (9), this implies that the value is unchanged if one restricts to any set of functions between and ; in particular, one can restrict to or to .

If , , and then we also have . By taking in (3) (and for , using the definition (4)) we find

(40)

Letting , it is straightforward to show by direct calculation that

(41)

Therefore is an optimizer. This completes the proof.

Acknowledgments

The research of J.B., M.K., and L. R.-B. was partially supported by NSF TRIPODS CISE-1934846. The research of M. K. and L. R.-B. was partially supported by the National Science Foundation (NSF) under the grant DMS-1515712 and by the Air Force Office of Scientific Research (AFOSR) under the grant FA-9550-18-1-0214. The research of P.D. was supported in part by the National Science Foundation (NSF) under the grant DMS-1904992 and by the Air Force Office of Scientific Research (AFOSR) under the grant FA-9550-18-1-0214. The research of J.W. was partially supported by the Defense Advanced Research Projects Agency (DARPA) EQUiPS program under the grant W911NF1520122.

References

  • [1] A. Hyvärinen, J. Karhunen, and E. Oja, Independent Component Analysis, ser. Adaptive and Cognitive Dynamic Systems: Signal Processing, Learning, Communications and Control.   Wiley, 2004.
  • [2] F. Maes, A. Collignon, D. Vandermeulen, G. Marchal, and P. Suetens, “Multimodality image registration by maximization of mutual information,” IEEE Trans Med Imaging, vol. 16, no. 2, pp. 187–198, 1997.
  • [3] N. Kwak and Chong-Ho Choi, “Input feature selection by mutual information based on Parzen window,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 24, no. 12, pp. 1667–1671, 2002.
  • [4] A. Butte and K. IS., “Mutual information relevance networks: functional genomic clustering using pairwise entropy measurements,” Pac Symp Biocomput., pp. 418–429, 2000.
  • [5] N. Tishby, F. C. Pereira, and W. Bialek, “The information bottleneck method,” 2000.
  • [6] J. B. Kinney and G. S. Atwal, “Equitability, mutual information, and the maximal information coefficient,” Proceedings of the National Academy of Sciences, vol. 111, no. 9, pp. 3354–3359, 2014. [Online]. Available: https://www.pnas.org/content/111/9/3354
  • [7] I. J. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio, “Generative adversarial nets,” in Proceedings of the 27th International Conference on Neural Information Processing Systems - Volume 2, ser. NIPS’14.   Cambridge, MA, USA: MIT Press, 2014, p. 2672–2680.
  • [8] S. Nowozin, B. Cseke, and R. Tomioka, “F-GAN: Training generative neural samplers using variational divergence minimization,” in Proceedings of the 30th International Conference on Neural Information Processing Systems, ser. NIPS’16.   Red Hook, NY, USA: Curran Associates Inc., 2016, p. 271–279.
  • [9] M. Arjovsky, S. Chintala, and L. Bottou, “Wasserstein generative adversarial networks,” in Proceedings of the 34th International Conference on Machine Learning

    , ser. Proceedings of Machine Learning Research, D. Precup and Y. W. Teh, Eds., vol. 70.   International Convention Centre, Sydney, Australia: PMLR, 06–11 Aug 2017, pp. 214–223.

  • [10] I. Gulrajani, F. Ahmed, M. Arjovsky, V. Dumoulin, and A. Courville, “Improved training of Wasserstein GANs,” in Proceedings of the 31st International Conference on Neural Information Processing Systems, ser. NIPS’17.   Red Hook, NY, USA: Curran Associates Inc., 2017, p. 5769–5779.
  • [11] Y. Pantazis, D. Paul, M. Fasoulakis, Y. Stylianou, and M. A. Katsoulakis, “Cumulant GAN,” arXiv:2006.06625, 2020.
  • [12] L. Paninski, “Estimation of entropy and mutual information,” Neural Computation, vol. 15, no. 6, pp. 1191–1253, 2003. [Online]. Available: https://doi.org/10.1162/089976603321780272
  • [13] S. Gao, G. V. Steeg, and A. Galstyan, “Efficient Estimation of Mutual Information for Strongly Dependent Variables,” in

    Proceedings of the Eighteenth International Conference on Artificial Intelligence and Statistics

    , ser. Proceedings of Machine Learning Research, G. Lebanon and S. V. N. Vishwanathan, Eds., vol. 38.   San Diego, California, USA: PMLR, 09–12 May 2015, pp. 277–286. [Online]. Available: http://proceedings.mlr.press/v38/gao15.html
  • [14] K. Kandasamy, A. Krishnamurthy, B. Poczos, L. Wasserman, and J. M. Robins, “Nonparametric von Mises estimators for entropies, divergences and mutual informations,” in Advances in Neural Information Processing Systems 28, C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama, and R. Garnett, Eds.   Curran Associates, Inc., 2015, pp. 397–405.
  • [15] M. I. Belghazi, A. Baratin, S. Rajeshwar, S. Ozair, Y. Bengio, A. Courville, and D. Hjelm, “Mutual information neural estimation,” in Proceedings of the 35th International Conference on Machine Learning, ser. Proceedings of Machine Learning Research, J. Dy and A. Krause, Eds., vol. 80.   Stockholmsmässan, Stockholm Sweden: PMLR, 10–15 Jul 2018, pp. 531–540. [Online]. Available: http://proceedings.mlr.press/v80/belghazi18a.html
  • [16] X. Nguyen, M. J. Wainwright, and M. I. Jordan, “Estimating divergence functionals and the likelihood ratio by convex risk minimization,” IEEE Transactions on Information Theory, vol. 56, no. 11, pp. 5847–5861, 2010.
  • [17] A. Ruderman, M. D. Reid, D. García-García, and J. Petterson, “Tighter variational representations of f-divergences via restriction to probability measures,” in Proceedings of the 29th International Coference on International Conference on Machine Learning, ser. ICML’12.   Madison, WI, USA: Omnipress, 2012, p. 1155–1162.
  • [18] J. Birrell, M. A. Katsoulakis, and Y. Pantazis, “Optimizing variational representations of divergences and accelerating their statistical estimation,” arXiv e-prints, p. arXiv:2006.08781, Jun. 2020.
  • [19] M. D. Donsker and S. R. S. Varadhan, “Asymptotic evaluation of certain Markov process expectations for large time. IV,” Communications on Pure and Applied Mathematics, vol. 36, no. 2, pp. 183–212, 1983. [Online]. Available: https://onlinelibrary.wiley.com/doi/abs/10.1002/cpa.3160360204
  • [20] P. Dupuis and R. Ellis., A Weak Convergence Approach to the Theory of Large Deviations, ser. Wiley series in probability and statistics.   New York: John Wiley & Sons, 1997, a Wiley-Interscience Publication. [Online]. Available: http://opac.inria.fr/record=b1092351
  • [21] A. Rényi, “On measures of entropy and information,” HUNGARIAN ACADEMY OF SCIENCES Budapest Hungary, Tech. Rep., 1961.
  • [22] T. Van Erven and P. Harremoës, “Rényi divergence and Kullback-Leibler divergence,” Information Theory, IEEE Transactions on, vol. 60, no. 7, pp. 3797–3820, 2014.
  • [23] M. Gil, F. Alajaji, and T. Linder, “Rényi divergence measures for commonly used univariate continuous distributions,” Information Sciences, vol. 249, pp. 124 – 131, 2013.
  • [24] F. Liese and I. Vajda, “On divergences and informations in statistics and information theory,” IEEE Transactions on Information Theory, vol. 52, no. 10, pp. 4394–4412, Oct 2006.
  • [25] G. Rubino, B. Tuffin et al., Rare event simulation using Monte Carlo methods.   Wiley Online Library, 2009, vol. 73.
  • [26] J. Bucklew, Introduction to Rare Event Simulation, ser. Springer Series in Statistics.   Springer New York, 2013.
  • [27] A. Budhiraja and P. Dupuis, Analysis and Approximation of Rare Events: Representations and Weak Convergence Methods

    , ser. Probability Theory and Stochastic Modelling.   Springer US, 2019.

  • [28] S. Rahman, “The f-sensitivity index,” SIAM/ASA Journal on Uncertainty Quantification, vol. 4, no. 1, pp. 130–162, 2016.
  • [29] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” arXiv:1412.6980, 2014.
  • [30] R. Atar, K. Chowdhary, and P. Dupuis, “Robust bounds on risk-sensitive functionals via Rényi divergence,” SIAM/ASA Journal on Uncertainty Quantification, vol. 3, no. 1, pp. 18–33, 2015.
  • [31] P. Dupuis, M. A. Katsoulakis, Y. Pantazis, and L. Rey-Bellet, “Sensitivity Analysis for Rare Events based on Rényi Divergence,” arXiv e-prints, p. arXiv:1805.06917, May 2018.
  • [32] R. Atar, A. Budhiraja, P. Dupuis, and R. Wu, “Robust bounds and optimization at the large deviations scale for queueing models via Rényi divergence,” 2020.
  • [33] L. Bégin, P. Germain, F. Laviolette, and J.-F. Roy, “PAC-Bayesian bounds based on the Rényi divergence,” in Proceedings of the 19th International Conference on Artificial Intelligence and Statistics, ser. Proceedings of Machine Learning Research, A. Gretton and C. C. Robert, Eds., vol. 51.   Cadiz, Spain: PMLR, 09–11 May 2016, pp. 435–444. [Online]. Available: http://proceedings.mlr.press/v51/begin16.html
  • [34] R. Dudley,

    Uniform Central Limit Theorems

    , ser. Cambridge Studies in Advanced Mathematics.   Cambridge University Press, 2014.
  • [35] G. Folland, Real Analysis: Modern Techniques and Their Applications, ser. Pure and Applied Mathematics: A Wiley Series of Texts, Monographs and Tracts.   Wiley, 2013.
  • [36] F. Santambrogio, Optimal Transport for Applied Mathematicians: Calculus of Variations, PDEs, and Modeling, ser. Progress in Nonlinear Differential Equations and Their Applications.   Springer International Publishing, 2015.