DeepAI

# On the Rényi Cross-Entropy

The Rényi cross-entropy measure between two distributions, a generalization of the Shannon cross-entropy, was recently used as a loss function for the improved design of deep learning generative adversarial networks. In this work, we examine the properties of this measure and derive closed-form expressions for it when one of the distributions is fixed and when both distributions belong to the exponential family. We also analytically determine a formula for the cross-entropy rate for stationary Gaussian processes and for finite-alphabet Markov sources.

• 2 publications
• 22 publications
• 17 publications
08/15/2022

### Rényi Cross-Entropy Measures for Common Distributions and Processes with Memory

Two Rényi-type generalizations of the Shannon cross-entropy, the Rényi c...
03/22/2022

### A Quantitative Comparison between Shannon and Tsallis Havrda Charvat Entropies Applied to Cancer Outcome Prediction

In this paper, we propose to quantitatively compare loss functions based...
04/27/2021

### A Dual Process Model for Optimizing Cross Entropy in Neural Networks

Minimizing cross-entropy is a widely used method for training artificial...
10/26/2018

### Information Bottleneck Methods for Distributed Learning

We study a distributed learning problem in which Alice sends a compresse...
10/14/2020

### Temperature check: theory and practice for training models with softmax-cross-entropy losses

The softmax function combined with a cross-entropy loss is a principled ...
12/04/2018

### Set Cross Entropy: Likelihood-based Permutation Invariant Loss Function for Probability Distributions

We propose a permutation-invariant loss function designed for the neural...
02/06/2013

### Probability Update: Conditioning vs. Cross-Entropy

Conditioning is the generally agreed-upon method for updating probabilit...

## I Introduction

The Rényi entropy [11] of order

of a discrete distribution (probability mass function)

with finite support , defined as

 Hα(p)=11−αln∑x∈Sp(x)α

for , is a generalization of the Shannon entropy,111For ease of reference, a table summarising the Shannon entropy and cross-entropy measures as well as the Kullback-Liebler (KL) divergence is provided in Appendix A. , in that . Similarly, the Rényi divergence (of order ) between two discrete distributions and with common finite support , given by

 Dα(p||q)=1α−1ln∑x∈Sp(x)αq(x)1−α,

reduces to the KL divergence, , as .

Since the introduction of these measures, several other Rényi-type information measures have been put forward, each obeying the condition that their limit as goes to one reduces to a Shannon-type information measure (e.g., see [16] and the references therein for three different order extensions of Shannon’s mutual information due to Sibson, Arimoto and Csiszár.)

Many of these definitions admit natural counterparts in the (absolutely) continuous case (i.e., when the involved distributions have a probability density function (pdf)), giving rise to information measures such as the Rényi differential entropy for pdf

with support ,

 hα(p)=11−αln∫Sp(x)αdx,

and the Rényi (differential) divergence between pdfs and with common support ,

 Dα(p||q)=1α−1ln∫Sp(x)αq(x)1−αdx.

The Rényi cross-entropy between distributions and is an analogous generalization of the Shannon cross-entropy . Two definitions for this measure have been suggested. In [12], mirroring the fact that Shannon’s cross-entropy satisfies , the authors define Rényi cross-entropy as

 ~Hα(p;q):=Dα(p||q)+Hα(p). (1)

In contrast, prior to [12], the authors of [15] introduced the Rényi cross-entropy in their study of the so-called shifted Rényi measures (expressed as the logarithm of weighted generalized power means). Specifically, upon simplifying Definition 6 in [15], their expression for the Rényi cross-entropy between distributions and is given by

 Hα(p;q):=11−αln∑x∈Sp(x)q(x)α−1. (2)

For the continuous case, the definition in (2) can be readily converted to yield the Rényi differential cross-entropy between pdfs and :

 hα(p;q):=11−αln∫Sp(x)q(x)α−1dx. (3)

As the Rényi differential divergence and entropy were already calculated for numerous distributions in [5] and [14], respectively, determining the Rényi differential cross-entropy using the definition in (1) is straightforward. As such, this paper’s focus is to establish closed-form expressions of the Rényi differential cross-entropy as defined in (3) for various distributions, as well as to derive the Rényi cross-entropy rate for two important classes of sources with memory, Gaussian and Markov sources.

Motivation for determining formulae for the Rényi cross-entropy extends beyond idle curiosity. The Shannon differential cross-entropy was used as a loss function for the design of deep learning generative adversarial networks (GANs) [6]. Recently, the Rényi differential cross-entropy measures in (3) and (1), were used in [1, 2] and [12], respectively, to generalize the original GAN loss function. It is shown that in [1] and [2] that the resulting Rényi-centric generalized loss function preserves the equilibrium point satisfied by the original GAN based on the Jensen-Rényi divergence [8], a natural extension of the Jensen-Shannon divergence [9]. In [12], a different Rényi-type generalized loss function is obtained and is shown to benefit from stability properties. Improved stability and system performance are shown in [1, 2] and [12] by virtue of the parameter that can be judiciously used to fine-tune the adopted generalized loss functions which recover the original GAN loss function as .

The rest of this paper is organised as follows. In Section II, basic properties of the Rényi cross-entropy are examined. In Section III, the Rényi differential cross-entropy for members of the exponential family is calculated. In Section IV, the Rényi differential cross-entropy between two different distributions is obtained. In Section V, the Rényi differential cross-entropy rate is derived for stationary Gaussian sources. Finally in Section VI, the Rényi cross-entropy rate is established for finite-alphabet time-invariant Markov sources.

## Ii Basic Properties of the Rényi cross-entropy and differential cross-entropy

For the Rényi cross-entropy to deserve its name it would be preferable that it satisfies at least two key properties: it reduces to the Rényi entropy when and its limit as goes to one is the Shannon cross-entropy. Similarly, it is desirable that the Rényi differential cross-entropy reduces to the Rényi differential entropy when and its limit as tends to one yields the Shannon differential cross-entropy. In both cases, the former property is trivial, and the latter property was proven in [2] for the continuous case under some finiteness conditions (in the discrete case, the result holds directly via L’Hôpital’s rule).

It is also proven in [2] that the Rényi differential cross-entropy is non-increasing in by showing that its derivative with respect to is non-positive. The same monotonicity property holds in the disrcrete case.

Like its Shannon counterpart, the Rényi cross-entropy is non-negative (); while the Rényi differential cross-entropy can be negative. This is easily verified when, for example, and and

are both Gaussian (normal) distributions with zero mean and variance

, and parallels the same lack of non-negativity of the Shannon differential cross-entropy.

We close this section by deriving the cross-entropy limit, . To begin with, for any non-zero constant , we have

 limα→∞11−αln∑x∈S~cq(x)α−1 =limα→∞11−αln~c+limα→∞11−αln∑x∈Sq(x)α−1 =limβ→∞1−β−β11−βln∑Sq(x)β(β=α−1) =limβ→∞Hβ(q)=−lnqM, (4)

where and where we have used the fact that for the Rényi entropy, . Now, denoting the minimum and maximum values of over by and , respectively, we have that for ,

 11−αln∑x∈Spmq(x)α−1 ≤11−αln∑x∈Sp(x)q(x)α−1 and 11−αln∑x∈Sp(x)q(x)α−1

and hence by (4) we obtain

 limα→∞Hα(p;q)=−lnqM. (5)

## Iii Rényi Differential Cross-Entropy for Exponential Family Distributions

or with parameter is said to belong to the exponential family (e.g., see [3]) if on its support it admits a pdf of the form

 f(x)=c(θ)b(x)exp(η(θ)⋅T(x)),x∈S, (6)

for some real-valued (measurable) functions , , and .222Note that and consequently

can be vectors in cases where the distribution admits multiple parameters.

Here is known as the natural parameter of the distribution, is the sufficient statistic and is the normalization constant in the sense that for all within the parameter space

 ∫Sb(x)exp(η(θ)⋅T(x))dx=c(θ)−1.

The pdf in (6) can also be written as

 f(x)=b(x)exp(η⋅T(x)+A(η)), (7)

where

. Examples of distributions in the exponential family include the Gaussian, Beta, and exponential distributions.

###### Lemma 1.

Let and be pdfs of the same type in the exponential family with natural parameters and , respectively. Define as being of the same type as and but with natural parameter . Then

 hα(f1;f2)=A(η1)−A(ηh)+lnEh1−α−A(η2), (8)

where

###### Proof.

Using (7), we have

 f1 (x)f2(x)α−1 =b(x)exp(η1⋅T(x)+A(η1)) ⋅(b(x)exp(η2⋅T(x)+A(η2)))α−1 =b(x)αexp((η1+(α−1)η2)⋅T(x)) ⋅exp(A(η1)+(α−1)A(η2)) =b(x)αexp(ηh⋅T(x)+A(ηh)) ⋅exp(A(η1)+(α−1)A(η2)−A(ηh)) =b(x)α−1fh(x)exp(A(η1)+(α−1)A(η2)−A(ηh)).

Thus,

 ∫S f1(x)f2(x)α−1dx =∫Sb(x)α−1fh(x)dx ⋅exp(A(η1)+(α−1)A(η2)−A(ηh)) =exp(A(η1)+(α−1)A(η2)−A(ηh))Eh,

and therefore,

 hα(f1;f2)=A(η1)−A(ηh)+lnEh1−α−A(η2).

###### Remark.

If is a constant for all , then

 lnEh1−α=−lnb.

In many cases, we have that on , and thus the term disappears in (8).

Table I lists Rényi differential cross-entropy expressions we derived using Lemma 1 for some common distributions in the exponential family (which we describe in Appendix B for convenience). In the table, the subscript of is used to denote that a parameter belongs to pdf , .

## Iv Rényi differential Cross-Entropy between different distributions

Let and be pdfs with common support . Below are some general formulae for the differential Rényi cross-entropy between one specific (common) distribution and any general distribution. If is an interval below, then denotes its length.

### Iv-a Distribution q is uniform

Let

. Then

 hα(p;q) =11−αln∫Sp(x)q(x)α−1dx=ln|S|.

### Iv-B Distribution p is uniform

Now suppose is uniformly distributed on . Then

 hα(p;q) =11−αln∫Sp(x)q(x)α−1dx =11−αln1|S|−hα−1(q).

### Iv-C Distribution q is exponentially distributed

Suppose the and is exponential with parameter

. Suppose also that the moment generating function (MGF) of

, exists. We have

 hα(p;q) =11−αln∫Sp(x)q(x)α−1dx =11−αlnEp[q(x)α−1] =11−αlnEp[(λexp(−λx))α−1] =−lnλ+11−αlnMp(λ(1−α)).

### Iv-D Distribution q is Gaussian

Now assume that is a (normal) Gaussian distribution and that the MGF of , , exists, where

is a random variable with distribution

. Then

 hα(p;q) =11−αlnEp[q(X)α−1] =11−αlnσ(√2π)1−αE(exp((1−α)Y2σ2)) =lnσ√2π+11−αlnMY(1−α2σ2).

The case where is a half-normal distribution can be directly derived from the above. Given is a half-normal distribution, on its support its pdf is the same as that of a normal distribution times 2. Hence if ’s support is , then .

## V Rényi Differential Cross-Entropy Rate for Stationary Gaussian Processes

###### Lemma 2.

The Rényi differential cross-entropy between two zero-mean multivariate dimension-Gaussian distributions with invertible covariance matrices and , respectively, is given by

 hα(p;q)=ln|Σ1||S|2α−2+12ln|Σ2|+n2ln2π, (9)

where .

###### Proof.

Recall that the pdf of a multivariate Gaussian with mean and invertible covariance matrix is given by:

 f(x)=exp(−12xTΣ−1x)(2π)k/2|Σ|1/2

for . Note that this distribution is a member of the exponential family, where , , and . Hence the Rényi differential cross-entropy between two zero-mean multivariate Gaussian distributions with covariance matrices and , respectively, is

 hα(p;q) =11−α(12ln∣∣ ∣∣2Σ−112∣∣ ∣∣ −12ln∣∣ ∣∣2Σ−11+(α−1)Σ−122∣∣ ∣∣) −12ln∣∣ ∣∣2Σ−122∣∣ ∣∣−ln(2π)−n2 =ln|Σ1||S|2α−2+12ln|Σ2|+n2ln2π.

Let and be stationary zero-mean Gaussian processes. For a given , and are multivariate Gaussian random variables with mean 0 and covariance matrices and , respectively. Since and are stationary, their covariance matrices are Toeplitz. Furthermore, is Toeplitz.

###### Lemma 3.

Let , and be the power spectral densities of , and the zero-mean Gaussian process with covariance matrix , respectively.

Then the Rényi differential cross-entropy rate between and , , is given by

 ln2π2+14π(1−α)∫2π0[(2−α)ln~g(λ)−ln~h(λ)]dλ.
###### Proof.

From Lemma 2, we first note that . With this in mind the Rényi differential cross-entropy can be rewritten using (9) as

 1n(ln|ΣXn||Σ−1XnBnΣ−1Yn|2(α−1)+12ln|ΣYn|+n2ln2π) =ln2π2+12n(ln|ΣXn||Σ−1Xn||Bn||Σ−1Yn|(α−1)+ln|ΣYn|) =ln2π2+12n(ln|Bn|−ln|ΣYn|(α−1)+ln|ΣYn|) =ln2π2+12n(1−α)((2−α)ln|ΣYn|−ln|Bn|).

It was proven in [7] that for a sequence of Toeplitz matrices with spectral density such that is Reimann integrable, one has

 limn→∞ln|Tn|=12π∫2π0lnt(λ)dλ.

We therefore obtain that the Rényi differential cross-entropy rate is given by

 ln2π2+14π(1−α)∫2π0[(2−α)ln~g(λ)−ln~h(λ)]dλ.

Note that . ∎

## Vi Rényi cross-entropy rate for Markov sources

Consider two time-invariant Markov sources and with common finite alphabet and with transition distribution and , respectively. Then for any , their

-dimensional joint distributions are given by

 p(n)(in)=P(in|in−1)P(in−1|in−2)...P(i2|i1)q(i1)

and

 q(n)(in)=Q(in|in−1)Q(in−1|in−2)...Q(i2|i1)p(i1),

respectively, with arbitrary initial distributions, and , . Define the Rényi cross-entropy rate between and as

 limn→∞1nHα(Xn;Yn) = limn→∞1n11−αln⎛⎝∑in∈Snp(n)(in)q(n)(in)α−1⎞⎠.

Note that by defining the matrix using the formula

 Rij=P(i|j)Q(i|j)α−1

and the row vector s as having components , the Rényi cross-entropy rate can be written as

 limn→∞1n11−αlnsRn−11, (10)

where 1 is a column vector whose dimension is the cardinatliy of the alphabet and with all its entries equal to 1.

A result derived by [10] for the Rényi divergence between Markov sources can thus be used to find the Rényi cross-entropy rate for Markov sources.

###### Lemma 4.

Let , , s and R be defined as above. If is irreducible, then

 limn→∞1nHα(Xn;Yn)=lnλ1−α, (11)

where

is the largest positive eigenvalue of

.

###### Proof.

Since the non-negative matrix is irreducible, by the Frobenius theorem (e.g., cf. [13, 4]), it has a largest positive eigenvalue

with associated positive eigenvector

b. Let and be the minimum and maximum elements, respectively, of b. Then due to the non-negativity of s,

 λn−1s⋅b=sRn−1b≤% sRn−11bM,

where denotes the Euclidean inner product. Similarly, As a result,

 1nlnλn−1s⋅bbM≤1nlnsRn−11≤1nlnλn−1% s⋅bbm.

Note that for all , is a constant. Thus

 limn→∞1nlnλn−1s⋅bbM =limn→∞n−1nlnλ+limn→∞1nlns⋅bbM =lnλ.

Similarly, we have

 limn→∞1nlnλn−1s⋅bbm=lnλ.

Hence,

 limn→∞1nHα(Xn;Yn)=limn→∞1nlnλn−1s⋅b(1−α)bm=lnλ1−α.

Another technique can be borrowed from [10] to generalize Lemma 4 to the case where is reducible. First is rewritten in the canonical form detailed in Proposition 1 of [10]. Let be the largest positive eigenvalue of each self-communicating sub-matrix of . For each inessential class let be the largest positive eigenvalue of each class that is reachable from . Define . Then (11) holds.

## Appendix B: Distributions listed in Table I

Notes

• is the Beta function.

• is the Gamma function.

## References

• [1] H. Bhatia, W. Paul, F. Alajaji, B. Gharesifard, and P. Burlina (2020) Rényi Generative Adversarial Networks. ArXiv:2006.02479v1. Cited by: §I.
• [2] H. Bhatia, W. Paul, F. Alajaji, B. Gharesifard, and P. Burlina (2021-08) Least kth-order and Rényi generative adversarial networks. Neural Computation 33 (9), pp. 2473–2510. External Links: ISSN 1530-888X, Link, Document Cited by: §I, §II, §II.
• [3] G. Casella and R. L. Berger (2002) Statistical inference. Cengage Learning. Cited by: §III.
• [4] R. G. Gallager (1996) Discrete stochastic processes. Springer. Cited by: §VI.
• [5] M. Gil, F. Alajaji, and T. Linder (2013) Rényi divergence measures for commonly used univariate continuous distributions. Information Sciences 249, pp. 124–131. External Links: ISSN 0020-0255, Document, Link Cited by: §I.
• [6] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio (2014) Generative adversarial nets. In Proceedings of Advances in Neural Information Processing Systems, Vol. 27, pp. 2672–2680. Cited by: §I.
• [7] R. Gray (2001-10,) Toeplitz and circulant matrices: a review. Foundations and Trends® in Communications and Information Theory 2, pp. . External Links: Document Cited by: §V.
• [8] P. A. Kluza (2019) On Jensen-Rényi and Jeffreys-Rényi type -divergences induced by convex functions. Physica A: Statistical Mechanics and its Applications. External Links: Document Cited by: §I.
• [9] J. Lin (1991) Divergence measures based on the Shannon entropy. IEEE Transactions on Information Theory 31, pp. 145–151. Cited by: §I.
• [10] Z. Rached, F. Alajaji, and L. L. Campbell (2001) Rényi’s divergence and entropy rates for finite alphabet Markov sources. IEEE Transactions on Information theory 47 (4), pp. 1553–1561. Cited by: §VI, §VI.
• [11] A. Rényi (1961) On measures of entropy and information. In Proceedings of the Fourth Berkeley Symposium on Mathematical Statistics and Probability, Vol. 1, pp. 547–561. Cited by: §I.
• [12] A. Sarraf and Y. Nie (2021) RGAN: rényi generative adversarial network. SN Computer Science 2 (1), pp. 17. External Links: Document Cited by: §I, §I.
• [13] E. Seneta (2006)

Non-negative matrices and markov chains

.
Springer Science & Business Media. Cited by: §VI.
• [14] K. Song (2001-02) Rényi information, loglikelihood and an intrinsic distribution measure. J. Statistical Planning and Inference 93. External Links: Document Cited by: §I.
• [15] F. J. Valverde-Albacete and C. Peláez-Moreno (2019) The case for shifting the Rényi entropy. Entropy 21, pp. 1–21. External Links: Link Cited by: §I.
• [16] S. Verdú (2015) -Mutual information. In Proceedings of the IEEE Information Theory and Applications Workshop, pp. 1–6. Cited by: §I.