1 Introduction and notation
In this paper, we consider the entropy of Poisson–binomial random variables (sums of independent Bernoulli random variables). Given parameters(where ) we will write
for the probability mass function of the random variable, where are independent with Bernoulli. We can write the Shannon entropy as a function of the parameters:
Shepp and Olkin  made the following conjecture “on the basis of numerical calculations and verification in the special cases ”:
Conjecture 1.1 (Shepp–Olkin monotonicity conjecture).
If all then is a non-decreasing function of .
The main contribution of the present paper is to prove that this conjecture is correct, and to give conditions for equality (see Theorem 3.6
). The heart of our argument will be a representation of one probability distribution in terms of another, via so-called mixing coefficients(see Definition 2.1) previously introduced in  in a form motivated by optimal transport. The key property of these coefficients is that their differences are decreasing functions (see Proposition 2.8). We analyse the resulting expressions using elementary tools; specifically the interplay between the product rule (15) for discrete differentiation and an integration by parts formula, Lemma 3.3. It may be surprising that only such simple tools are required to resolve this conjecture, however we remark that the interplay between product rules and integration by parts lies behind much of the power of the Bakry-Émery
-calculus for continuous random variables (as alluded to in the dedication of). This may suggest that the Shepp–Olkin conjecture should be viewed in the context of an emerging discrete Bakry-Émery theory (see also [4, 9]).
Given the naturalness of Conjecture 1.1, and its simplicity of statement, it is perhaps surprising that it has remained open for over 40 years (although  was published as a book chapter in 1981, the conjecture was first formulated in the corresponding technical report  in July 1978). Indeed, we are not even aware of any published work that resolves special cases.
In the same paper , Shepp and Olkin also conjectured that the entropy is concave in , which was proved in  and , using constructions based on optimal transport which we will describe below. Similarly to the monotonicity conjecture, little published work had previously addressed the entropy concavity conjecture, though limited progress in special cases had been made in  and .
In their original paper , Shepp and Olkin did prove some related results. In particular, they showed [16, Theorem 1] that the entropy is Schur concave in , and hence deduced a maximum entropy property for binomial random variables (see also contemporary work of Mateev , as well as later extensions by [5, 18]). Further, they showed [16, Theorem 2] that the entropy of is concave in a single argument (see (4) below) and [16, Theorem 4]
that the entropy of the binomial distribution is concave. The Schur concavity property of Shepp and Olkin was generalized to a wider range of functionals and parametric families including negative binomials by Karlin and Rinott.
However, none of these results appear to be particularly relevant to the monotonicity conjecture, Conjecture 1.1. We can reformulate the conjecture to say that if for each then . Clearly, by symmetry of the arguments, it is sufficient to verify this in the case where for , which in turn means that it is enough to check that . Without loss of generality we will assume throughout that all are non-zero, so that is non-zero for all in .
In this case, following standard calculations in , we can take where , omit the subscript on for brevity and write
where is the probability mass function of , which is supported on the set and does not depend on . Here and throughout we take if necessary.
The negativity of each term in (4) tells us directly that (as proved in [16, Theorem 2]) the entropy is concave in (of course, we also know this from the full Shepp–Olkin theorem proved in [7, 8]) so it is sufficient to prove that the derivative is non-negative in the case , since the derivative is therefore larger for any smaller values of .
However, at this stage, further progress is elusive. Considering convolution with means that we can express . However, substituting this in (3) does not suggest an obvious way forward in general, though it is possible to use the resulting formula to resolve certain special cases. For example, careful cancellation in the case where and hence is binomial allows us to deduce that, in this case, the entropy derivative (3) equals zero (see Example 2.7 below for an alternative view of this). However, this calculation does not give any particular insight into why the binomial case might be extreme in the sense of the conjecture.
Instead of expressing as a linear combination of , our key observation is that we can express as a weighted linear combination of , as described in the following section.
2 Entropy derivative and mixing coefficients
for certain ‘mixing coefficients’ . The general construction for in the case of Shepp–Olkin paths is given in [7, Proposition 5.1], but in the specific case where only varies, in the case , we can simply define the following values:
For , define
In [7, Proposition 5.2] this result was stated in the form , but the strict inequalities will help us to resolve the case of equality in Conjecture 1.1. It will often be useful for us to observe that Definition 2.1 implies that for
and that for
Summing (8), we can directly calculate that
which will play an important role in our proof of Conjecture 1.1. Further, it is interesting to note by rearranging (6) that if and only if , which
by the unimodality of (see for example ) means that . This may suggest that the Shepp-Olkin conjecture can be understood as relating
to the skewness of the random variables . Direct calcuation shows that the centred third moment of
. This may suggest that the Shepp-Olkin conjecture can be understood as relating to the skewness of the random variables
. Direct calcuation shows that the centred third moment ofis , but it is not immediately clear how this positive skew will affect the entropy of .
In  we used these mixing coefficients to formulate a discrete analogue of the Benamou–Brenier formula
 from optimal transport theory, which gave an understanding of certain interpolation paths of discrete probability measures
(including Shepp–Olkin paths) as geodesics in a metric space. We do not require this interpretation here, but simply
study the properties of
from optimal transport theory, which gave an understanding of certain interpolation paths of discrete probability measures (including Shepp–Olkin paths) as geodesics in a metric space. We do not require this interpretation here, but simply study the properties ofin their own right.
We now define a function which will form the basis of our proof of Conjecture 1.1:
where we take to ensure that is continuous at and at .
Note that we can express
as a power series with only odd terms with all coefficients negative. Further, comparison withshows that this series converges absolutely for all .
Hence the is non-increasing and antisymmetric about (with ) and with , and .
We can express the derivative of entropy in terms of these functions and the mixing coefficients, as follows:
since and so that .
We can exchange the order of summation in
because of Fubini’s theorem, since as mentioned above the power series for converges absolutely. Hence if each odd centered moment is negative then the entropy derivative (3) is positive. ∎
In the case where , since is Binomial, we know that for , so that
Using this, and the fact that , since is Binomial we know that and so each odd centred moment satisfies
by relabelling, and hence equals zero.
We shall argue that the binomial example, Example 2.7, represents the extreme case using the following property, which will be key for us:
If all the then
See Appendix A. ∎
3 Proof of Shepp–Olkin monotonicity conjecture
We are now in a position to complete our proof of Conjecture 1.1.
First we introduce some further notation:
We define the family by and, for , . Here if , we take (this reflects that the product includes the term ).
We define the family by and, for , write . For other values of , is not defined.
The notation stands for the left-derivative operator: . This operator satisfies a product rule of the form:
For and we define the polynomial (symmetric in its inputs)
where the sum is taken over all the -tuples such that . We also set and for all . Clearly, is non-negative if are non-negative.
We now state three technical lemmas that we will require in the proof; each of these are proved in Appendix B. First, the fact that is decreasing in :
Given , we have for .
We next give an integration by parts formula. Note that although we restrict the range of summation for technical reasons, the values and are zero outside the respective ranges:
For any function that is well-defined on and any we have
Finally, a result concerning differences of the polynomials:
For and , we have
We now state and prove the key Proposition:
Fix . The sequence defined by
is increasing in . Here note that since .
(Note that by restricting to this range of summation the is well-defined).
where here and throughout the proof, the refers to a difference in the parameter. Now, using the product rule (15) with and yields:
Further Lemma 3.2 gives that for and , so the second sum is . We finally conclude that
as we have . ∎
We are now able to prove the following theorem, which confirms Conjecture 1.1:
If all then is a non-decreasing function of . Equality holds if and only if each equals or .
As described in Proposition 2.6, it is sufficient for us to prove that for every we have
Using (8) we know that and , so that (subtracting these two expressions)
This means that, using a standard factorization of , since we can write
However, Proposition 3.5 gives
and we are done. Note that an examination of (3) allows us to deduce conditions under which equality holds for the cubic case (). In this case we can rewrite (3) using the integration by parts formula (17) as
Here and are positive for , and Proposition 2.8 tells us that the second bracket is negative, and so the centered third moment equals zero if and only is constant in , which means that . However, (10) tells us that this implies that
so that equality can hold if and only if . ∎
4 Monotonicity of Rényi and Tsallis entropies
As in [8, Section 4] where a similar discussion considered the question of concavity of entropies, we briefly discuss whether Theorem 3.6 may extend to prove that -Rényi and -Tsallis entropies are always increasing functions of for . We make the following definitions, each of which reduce to the Shannon entropy (1) as .
For as defined above, for define
Again, the second term is negative, and therefore will be increasing for all if it is increasing in the case . Clearly for (24) shows that the entropy is constant (indeed we know that in this case and ).
Curiously, we can simplify (24) in the case of collision entropy () by substituting for as a linear combination of (which is the argument that did not work for ).
For , if
It may be natural to conjecture that Tsallis (and hence Rényi) entropy is increasing for all . However, the following example shows that this property in fact can fail for (note that Rényi entropy is not concave in the same range – see [8, Lemma 4.3]).
Consider with and . Direct substitution in (24) gives that the entropy derivative is exactly
and we note that for , so the leading coefficient is negative and so the derivative will be negative for sufficiently small.
However, we conjecture that these entropies are increasing for , since we know that the result holds for :
If all then Tsallis entropy and Rényi entropy are non-decreasing functions of for .
We use an argument similar to that which gave Proposition 2.6 to give a moment-based condition related to this conjecture.
Let us fix . If, for all ,
Oliver Johnson would like to thank the Isaac Newton Institute for Mathematical Sciences, Cambridge, for support and hospitality during the workshop Beyond I.I.D. in information theory where work on this paper was undertaken. This work was supported by EPSRC grant no EP/K032208/1.
Appendix A Proof of Proposition 2.8
Using , we can express the difference
so the property is equivalent to
where we write .
We write for the mass function of , the sum of the first Bernoulli variables with the th one omitted, and write and .
The cases and of [7, Lemma A1] give that
and by direct substitution we deduce that