 # Two remarks on generalized entropy power inequalities

This note contributes to the understanding of generalized entropy power inequalities. Our main goal is to construct a counter-example regarding monotonicity and entropy comparison of weighted sums of independent identically distributed log-concave random variables. We also present a complex analogue of a recent dependent entropy power inequality of Hao and Jog, and give a very simple proof.

## Authors

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1. Introduction

The differential entropy (or simply entropy

, henceforth, since we have no need to deal with discrete entropy in this note) of a random vector

with density is defined as

 h(X)=−∫Rdflogf,

provided that this integral exists. When the variance of a real-valued random variable

is kept fixed, it is a long known fact  that the differential entropy is maximized by taking to be Gaussian. A related functional is the entropy power of , defined by As is usual, we abuse notation and write and , even though these are functionals depending only on the density of and not on its random realization.

The entropy power inequality is a fundamental inequality in both Information Theory and Probability, stated first by Shannon

 and proved by Stam . It states that for any two independent random vectors and in such that the entropies of and exist,

 N(X+Y)≥N(X)+N(Y).

In fact, it holds without even assuming the existence of entropies as long as we set an entropy power to 0 whenever the corresponding entropy does not exist, as noted by 

. One reason for the importance of this inequality in Probability Theory comes from its close connection to the Central Limit Theorem (see, e.g.,

[21, 25]). It is also closely related to the Brunn-Minkowski inequality, and thereby to results in Convex Geometry and Geometric Functional Analysis (see, e.g., [7, 31]).

An immediate consequence of the above formulation of the entropy power inequality is its extension to summands: if are independent random vectors, then . Suppose the random vectors are not merely independent but also identically distributed, and that ; these are the normalized partial sums that appear in the vanilla version of the Central Limit Theorem. Then one concludes from the entropy power inequality together with the scaling property that , or equivalently that

 (1) h(Sn)≥h(S1).

There are several refinements or generalizations of the inequality (1) that one may consider. In 2004, Artstein, Ball, Barthe and Naor  proved (see [26, 38, 35, 13] for simpler proofs and [27, 28] for extensions) that in fact, one has monotonicity of entropy along the Central Limit Theorem, i.e., is a monotonically increasing sequence. If

is the standard normal distribution, Barron

 had proved much earlier that as long as has mean 0, variance 1, and . Thus one has the monotone convergence of

to the Gaussian entropy, which is the maximum entropy possible under the moment constraints. By standard arguments, the convergence of entropies is equivalent to the relative entropy between the distribution of

and the standard Gaussian distribution converging to 0, and this in turn implies not just convergence in distribution but also convergence in total variation. This is the way in which entropy illuminates the Central Limit Theorem.

A different variant of the inequality (1) was recently given by Hao and Jog , whose paper may be consulted for motivation and proper discussion. A random vector in is called unconditional if for every choice of signs , the vector has the same distribution as . Hao and Jog  proved that if is an unconditional random vector in , then . If has independent and identically distributed components instead of being unconditional, this is precisely for real-valued random variables (i.e., in dimension ).

The goal of this note is to shed further light on both of these generalized entropy power inequalities. We now explain precisely how we do so.

To motivate our first result, we first recall the notion of Schur-concavity. One vector in is majorised by another one , usually denoted , if the nonincreasing rearrangements and of and satisfy the inequalities for each and . For instance, any vector with nonnegative coordinates adding up to is majorised by the vector and majorises the vector . Let , where is the standard simplex. We say that is Schur-concave if when . Clearly, if is Schur-concave, then one has for any .

Suppose are i.i.d. copies of a random variable with finite entropy, and we define

 (2) Φ(a)=h(∑√aiXi)

for . Then the inequality (1) simply says that , while the monotonicity of entropy in the Central Limit Theorem says that . Both these properties would be implied by (but in themselves are strictly weaker than) Schur-concavity. Thus one is led to the natural question: Is the function defined in (2) a Schur-concave function? For , this would imply in particular that is maximized over when . The question on the Schur-concavity of had been floating around for at least a decade, until  constructed a counterexample showing that cannot be Schur-concave even for . It was conjectured in , however, that for , the Schur-concavity should hold if the random variable has a log-concave distribution, i.e., if and are independent, identically distributed, log-concave random variables, the function should be nondecreasing on . More generally, one may ask: if are i.i.d. copies of a log-concave random variable , is is true that when ? Equivalently, is Schur-concave when is log-concave?

Our first result implies that the answer to this question is negative. The way we show this is the following: since , if Schur-concavity held, then the sequence would be nondecreasing and as it converges to , where is an independent Gaussian random variable with the same variance as , we would have in particular that . We construct examples where the opposite holds.

###### Theorem 1.

There exists a symmetric log-concave random variable with variance such that if are its independent copies and is large enough, we have

 h(X0+X1+…+Xn√n)>h(X0+Z),

where is a standard Gaussian random variable, independent of the . Consequently, even if is drawn from a symmetric, log-concave distribution, the function defined in (2) is not Schur-concave.

Here by a symmetric distribution, we mean one whose density satisfies for each .

In contrast to Theorem 1, does turn out to be Schur-concave if the distribution of is a symmetric Gaussian mixture, as recently shown in 

. We suspect that Schur-concavity also holds for uniform distributions on intervals (cf.

).

Theorem 1 can be compared with the afore-mentioned monotonicity of entropy property of the Central Limit Theorem. It also provides an example of two independent symmetric log-concave random variables and with the same variance such that , where is a Gaussian random variable with the same variance as and , independent of them, which is again in contrast to symmetric Gaussian mixtures (see ). The interesting question posed in  of whether, for two i.i.d. summands, swapping one for a Gaussian with the same variance increases entropy, remains open.

Our proof of Theorem 1 is based on sophisticated and remarkable Edgeworth type expansions recently developed by Bobkov, Chistyakov and Götze  en route to obtaining precise rates of convergence in the entropic central limit theorem, and is detailed in Section 2.

The second contribution of this note is an exploration of a technique to prove inequalities akin to the entropy power inequality by using symmetries and invariance properties of entropy. It is folklore that when and are i.i.d. from a symmetric distribution, one can deduce the inequality in an extremely simple fashion (in contrast to any full proof of the entropy power inequality, which tends to require relatively sophisticated machinery– either going through Fisher information or optimal transport or rearrangement theory or functional inequalities). In Section 3, we will recall this simple proof, and also deduce some variants of the inequality by playing with this basic idea of using invariance, including a complex analogue of a recent entropy power inequality for dependent random variables obtained by Hao and Jog .

###### Theorem 2.

Let be a random vector in which is complex-unconditional, that is for every complex numbers such that for every , the vector has the same distribution as . Then

 1nh(X)≤h(X1+…+Xn√n).

Our proof of Theorem 2, which is essentially trivial thanks to the existence of complex Hadamard matrices, is in contrast to the proof given by  for the real case that proves a Fisher information inequality as an intermediary step.

We make some remarks on complementary results in the literature. Firstly, in contrast to the failure of Schur-concavity of implied by Theorem 1, the function defined by for i.i.d. copies of a random variable , is actually Schur-convex when is log-concave . This is an instance of a reverse entropy power inequality, many more of which are discussed in . Note that the weighted sums that appear in the definition of are relevant to the Central Limit Theorem because they have fixed variance, unlike the weighted sums that appear in the definition of .

Secondly, motivated by the analogies with Convex Geometry mentioned earlier, one may ask if the function defined by , is Schur-concave for any Borel set , where denotes the Lebesgue measure on and the notation for summation is overloaded as usual to also denote Minkowski summation of sets. (Note that unless is convex, is a subset of, but generally not equal to, .) The Brunn-Minkowski inequality implies that . The inequality , which is the geometric analogue of the monotonicity of entropy in the Central Limit Theorem, was conjectured to hold in . However, it was shown in  (cf. ) that this inequality fails to hold, and therefore cannot be Schur-concave, for arbitrary Borel sets . Note that if is convex, is trivially Schur-concave, since it is a constant function equal to .

Finally, it has recently been observed in [40, 33, 32] that majorization ideas are very useful in understanding entropy power inequalities in discrete settings, such as on the integers or on cyclic groups of prime order.

## 2. Failure of Schur-concavity

Recall that a probability density on is said to be log-concave if it is of the form for a convex function . Log-concave distributions emerge naturally from the interplay between information theory and convex geometry, and have recently been a very fruitful and active topic of research (see the recent survey ).

This section is devoted to a proof of Theorem 1, which in particular falsifies the Schur-concavity of defined by (2) even when the distribution under consideration is log-concave.

Let us denote

 Zn=X1+…+Xn√n.

and let be the density of and let be the density of . Since is assumed to be log-concave, it satisfies for all . According to the Edgeworth-type expansion described in  (Theorem 3.2 in Chapter 3), we have (with any )

 (1+|x|m)(pn(x)−φm(x))=o(n−s−22)% uniformly in x,

where

 φm(x)=φ(x)+m−2∑k=1qk(x)n−k/2.

Here the functions are given by

 qk(x)=φ(x)∑Hk+2j(x)1r1!…rk!(γ33!)r1…(γk+2(k+2)!)rk,

where are Hermite polynomials,

 Hn(x)=(−1)nex2/2dndxne−x2/2,

and the summation runs over all nonnegative integer solutions to the equation , and one uses the notation . The numbers are the cumulants of , namely

 γk=i−kdrdtrlogEeitX0∣∣t=0.

Let us calculate . Under our assumption (symmetry of and ), we have and . Therefore and

 q2=14!γ4φH4=14!γ4φ(4),φ4=φ+1n⋅14!(EX40−3)φ(4).

We get that for any

 (1+x4)(pn(x)−φ4(x))=o(n−3−ε2),uniformly in x.

Let be the density of . Let us assume that it is of the form , where is even, smooth and compactly supported (say, supported in ) with bounded derivatives. Moreover, we assume that and that . Multiplying by a very small constant we can ensure that is log-concave.

We are going to use Theorem 1.3 from . To check the assumptions of this theorem, we first observe that for any we have

 Dα(Z1||Z)=1α−1log(∫(φ+δφ)αφ)<∞,

since has bounded support. We have to show that for sufficiently big there is

 EetX0

Since is symmetric, we can assume that . Then

 EetX0 =et2/2+∞∑k=1t2k(2k)!∫x2kδ(x)dx≤et2/2+∞∑k=1t2k(2k)!22k∫2−2|δ(x)|dx

where we have used the fact that , has a bounded support contained in and . We conclude that

 |pn(x)−φ(x)|≤C0ne−x2/64

and thus

 pn(x)≤φ(x)+C0ne−x2/64≤C0e−x2/C0.

(In this proof and denote sufficiently large and sufficiently small universal constants that may change from one line to another. On the other hand, , and denote constants that may depend on the distribution of .) Moreover, for we have

 pn(x)≥φ(x)−C0ne−x2/64≥1ne−x2/64,

so

 pn(x)≥c0nC0,for |x|≤c0√logn.

Let us define . Note that , where . We have

 ∫f∗pnlogf∗pn =∫(f∗φ+c1nf∗φ(4)+f∗hn)logf∗pn =∫f∗φlogf∗pn+c1n∫f∗φ(4)logf∗pn+∫f∗hnlogf∗pn =I1+I2+I3.

We first bound . Note that

 (f∗hn)(x)≤2(φ∗|hn|)(x)≤o(n−5/4)∫e−y2/211+(x−y)4dy.

Assuming without loss of generality that , we have

 ∫e−y2/211+(x−y)4dy ≤∫y∈[12x,2x]+∫y∉[12x,2x] ≤∫y∈[12x,2x]e−x2/8+11+116x4∫y∉[12x,2x]e−y2/2dy ≤32xe−x2/8+√2π1+116x4≤C1+x4.

We also have

 (f∗pn)(x)≤2(φ∗pn)(x)≤C0(φ∗e−y2/C)(x)≤C0.

Moreover, assuming without loss of generality that ,

 (f∗pn)(x)≥12(φ∗pn)(x) ≥c0nC0∫0≤y≤c0√logne−(x−y)2/2 ≥c0nC0∫0≤y≤c0√logne−x2/2e−y2/2≥c0nC0e−x2/2.

Thus

 |logf∗pn(x)|≤log(C0nC0ex2/2).

As a consequence

 I3≤o(n−5/4)∫11+x4|logf∗pn(x)|dx ≤o(n−5/4)∫11+x4log(C0nC0ex2/2)dx =o(n−5/4logn).

For fix and observe that

 ∣∣∣∫|x|≥β√lognf∗φ(4)log(f∗pn)∣∣∣ ≤2∣∣∣∫|x|≥β√lognφ∗|φ(4)|log(f∗pn)∣∣∣ ≤C0∫|x|≥β√logn(1+x4)e−x2/4log(C0nC0ex2/2) =o(n−c).

Hence,

 ∫f∗φ(4)logf∗pn=∫|x|≤β√logn+∫|x|>β√logn=∫|x|≤β√lognf∗φ(4)logf∗pn+o(1).

Writing we get

 ∫|x|≤β√lognf∗φ(4)logf∗pn=∫|x|≤β√logn(f∗φ)(4)[log(f∗φ)+log(1+f∗rnf∗φ)].

Here

 ∫|x|≤β√logn(f∗φ)(4)log(f∗φ)=∫(f∗φ)(4)log(f∗φ)+o(1)

and

 ∫|x|≤β√logn(f∗φ)(4)log(1+f∗rnf∗φ)=o(1),

since for with sufficiently small we have

 ∣∣∣f∗rnf∗φ∣∣∣≤4n∣∣ ∣∣φ∗(C0e−x2/C0)φ∗φ∣∣ ∣∣≤C0neC0x2≤C√n.

By Jensen’s inequality,

 I1=∫f∗φlogf∗pn≤∫f∗φlogf∗φ.

Putting these things together we get

 ∫f∗pnlogf∗pn≤∫f∗φlogf∗φ+c1n∫(f∗φ)(4)log(f∗φ)+o(n−1).

This is

 H(X0+Z)≤H(X0+Zn)+1n⋅14!(EX40−3)∫(f∗φ)(4)log(f∗φ)+o(n−1).

It is therefore enough to construct (satisfying all previous conditions) such that

 (EX40−3)∫(f∗φ)(4)log(f∗φ)<0.

It actually suffices to construct such that but the function satisfies

 ∫(f∗φ)′′′′log(f∗φ)>0

for small . Then we perturb a bit to get instead of . This can be done without affecting log-concavity.

Let . We have

 ∫(f∗φ)′′′′log(f∗φ) =∫(φ2+εφ∗g)′′′′log(φ2+εφ∗g) ∫(φ2+εφ∗g)′′′′(log(φ2)+εφ∗gφ2−12ε2(φ∗gφ2)2).

The leading term and the term in front of vanish (thanks to being orthogonal to ). The term in front of is equal to

 J=∫(φ∗g)′′′′(φ∗g)φ2−12∫φ′′′′2(φ∗g)2φ22=J1−J2.

The first integral is equal to

 J1=∫∫∫2√πex2/4g′′′′(s)g(t)12πe−(x−s)2/2e−(x−t)2/2dxdsdt.

Now,

 ∫2√πex2/412πe−(x−s)2/2e−(x−t)2/2dx=2e16(−s2+4st−t2)√3.

Therefore,

 J1=2√3∫∫e16(−s2+4st−t2)g′′′′(s)g(t)dsdt.

If we integrate the first integral times by parts we get

 J1=281√3∫∫e16(−s2+4st−t2) [27+s4−8s3t−72t2 +16t4−8st(−9+4t2)+6s2(−3+4t2)]g(s)g(t)dsdt

Moreover,

 φ′′′′2φ22=√π16(12−12x2+x4)ex2/4,

so we get

 J2=∫∫∫√π16(12−12x2+x4)ex2/4g(s)g(t)12πe−(x−s)2/2e−(x−t)2/2dxdsdt.

Since

 ∫√π16(12−12x2+x4)ex2/4 12πe−(x−s)2/2e−(x−t)2/2dx =181√3e16(−s2+4st−t2)[27+(s+t)2(−18+(s+t)2)],

we arrive at

 J2=∫∫181√3e16(−s2+4st−t2)[27+(s+t)2(−18+(s+t)2)]g(s)g(t)dsdt.

Thus becomes

 J=181√3∫∫e16(−s2+4st−t2) [27+s4−20s3t−126t2+31t4 +6s2(−3+7t2)+s(180t−68t3)]g(s)g(t)dsdt.

The function

 g(s)=(728069|s|3−1102523s2+4900069|s|−787523)1[1,2](|s|)

satisfies . Numerical computations show that for this , .

## 3. Entropy power inequalities under symmetries

The heart of the folklore proof of for symmetric distributions (see, e.g., ) is that for possibly dependent random variables and , the -invariance of differential entropy combined with subadditivity imply that

 h(X1,X2)=h(X1+X2√2,X1−X2√2)≤h(X1+X2√2)+h(X1−X2√2).

If the distribution of is the same as that of , we deduce that

 (3) h(X1+X2√2)≥h(X1,X2)2.

If, furthermore, and are i.i.d., then , yielding . Note that under the i.i.d. assumption, the requirement that the distributions of and coincide is equivalent to the requirement that (or ) has a symmetric distribution.

Without assuming symmetry but assuming independence, we can use the fact from  that for independent random variables to deduce . In the i.i.d. case, the improved bound holds , which implies . These bounds are, however, not particularly interesting since they are weaker than the classical entropy power inequality; if they had recovered it, these ideas would have represented by far its most elementary proof.

Hao and Jog  generalized the inequality (3) to the case where one has random variables, under a natural -variable extension of the distributional requirement, namely unconditionality. However, they used a proof that goes through Fisher information inequalities, similar to the original Stam proof of the full entropy power inequality. The main observation of this section is simply that under certain circumstances, one can give a direct and simple proof of the Hao–Jog inequality, as well as others like it, akin to the 2-line proof of the inequality (3

) given above. The “certain circumstances” have to do with the existence of appropriate linear transformations that respect certain symmetries– specifically Hadamard matrices.

Let us first outline how this works in the real case. Suppose is a dimension for which there exists a Hadamard matrix– namely, a matrix with all its entries being 1 or , and its rows forming an orthogonal set of vectors. Dividing each row by its length

results in an orthogonal matrix

, all of whose entries are . By unconditionality, each coordinate of the vector has the same distribution as . Hence

 h(X)=h(OX)≤n∑j=1h((OX)j)=nh(X1+…+Xn√n),

where the inequality follows from subadditivity of entropy. This is exactly the Hao-Jog inequality for those dimensions where a Hadamard matrix exists. It would be interesting to find a way around the dimensional restriction, but we do not currently have a way of doing so.

As is well known, other than the dimensions and , Hadamard matrices may only exist for dimensions that are multiples of 4. As of this date, Hadamard matrices are known to exist for all multiples of 4 up to 664 , and it is a major open problem whether they in fact exist for all multiples of 4. (Incidentally, we note that the question of existence of Hadamard matrices can actually be formulated in the entropy language. Indeed, Hadamard matrices are precisely those that saturate the obvious bound for the entropy of an orthogonal matrix .)

In contrast, complex Hadamard matrices exist in every dimension. A complex Hadamard matrix of order is a matrix with complex entries all of which have modulus 1, and whose rows form an orthogonal set of vectors in . To see that complex Hadamard matrices always exist, we merely exhibit the Fourier matrices, which are a well known example of them: these are defined by the entries for

, and are related to the discrete Fourier transform (DFT) matrices. Complex Hadamard matrices play an important role in quantum information theory

. They also yield Theorem 2.

Proof of Theorem 2. Take any unitary matrix which all entries are complex numbers of the same modulus ; such matrices are easily constructed by multiplying a complex Hadamard matrix by . (For instance, one could take .) By complex-unconditionality, each coordinate of the vector has the same distribution, the same as . Therefore, by subadditivity,

 h(X)=h(U