 # An Effective Bernstein-type Bound on Shannon Entropy over Countably Infinite Alphabets

We prove a Bernstein-type bound for the difference between the average of negative log-likelihoods of independent discrete random variables and the Shannon entropy, both defined on a countably infinite alphabet. The result holds for the class of discrete random variables with tails lighter than or on the same order of a discrete power-law distribution. Most commonly-used discrete distributions such as the Poisson distribution, the negative binomial distribution, and the power-law distribution itself belong to this class. The bound is effective in the sense that we provide a method to compute the constants in it.

## Authors

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Concentration inequalities provide powerful tools for many subjects including information theory , algorithm analysis  and statistics [9, 8]. The goal of the present paper is to prove an exponential decay bound with computable constants for the difference between the negative log-likelihood of discrete random variables and the Shannon entropy, both defined on a countably infinite alphabet.

Let be a discrete random variable on a countably infinite alphabet . Let

be the probability mass at

. Assume, without loss of generality, that for each ; otherwise, simply remove with from . Let be the probability mass function, which is a random variable with if , . Then is the Shannon entropy111Throughout the paper, “log” denotes the natural logarithm., which is a key concept in information theory [7, 2]. Note that neither nor the entropy depends on the elements in . In fact, is not necessarily a set of real numbers. The set can contain generic symbols such as letters, and is therefore named as alphabet.

Entropy on countably infinite alphabets does not always have finite values. We give a simple sufficient condition ensuring its finiteness at the beginning of Section 2, which is also the key assumption for the main result of the paper. The readers are referred to  for a more thorough discussion on conditions for finiteness of entropy on countably infinite alphabets.

Let be independently and identically distributed (i.i.d.) copies of . Then is the joint log-likelihood of

. By the weak law of large numbers,

 P(∣∣ ∣∣1nn∑i=1logP(Xi)−E[logP(X)]∣∣ ∣∣≥ϵ)→0,

provided that the entropy is finite. This result, particularly for the case of being finite, is called the asymptotic equipartition property in the information theory literature, which is the foundation of many important results in this field [2, 3].

In this paper, we strengthen the above result by proving a Bernstein-type bound for the case of countably infinite alphabets:

 P(∣∣ ∣∣1nn∑i=1logP(Xi)−E[logP(X)]∣∣ ∣∣≥ϵ)≤2exp(−nϵ2c1+c2ϵ), (1)

where and are computable constants that depend on .

Concentration inequalities for entropy have been studied recently. Zhao  proved a Bernstein-type inequality for entropy on finite alphabets with convergence rate , where is the sample size and is the size of the alphabet. Zhao  proved an exponential decay bound that improves the rate to and showed that the new rate is optimal. Both papers studied inequalities for finite alphabets while we focus on countably infinite alphabets in this work. In Section 2, we prove (1) under a mild assumption. In Section 3, we show that this assumption holds if the tail of drops faster or on the same order of a discrete power-law distribution; conversely, the assumption cannot be satisfied if the tail drops slower than any power-law distribution. Most commonly-used discrete distributions such as the Poisson distribution, the negative binomial distribution, and the power-law distribution itself satisfy this assumption. Furthermore, we propose a method to compute the constants in the bound (1).

## 2 Main Result

Our result requires only one assumption on :

Assumption 1. There exists such that

 ∞∑k=1p1−rk≤Cr<∞.

Assumption 1 implies that the tail of cannot be too heavy, and in Section 3 we will elaborate this assumption by showing that the assumption holds if the tail of is lighter than or on the same order of a discrete power-law distribution; conversely, it cannot be satisfied if the tail is heavier than any power-law distribution.

First note that Assumption 1 ensures the finiteness of the entropy.

###### Proposition 1.

Under Assumption 1,

###### Proof.
 E[−logP(X)]=−∞∑k=1pklogpk≤∞∑k=1p1−rk(−prklogpk)≤1er∞∑k=1p1−rk.

The last inequality holds because on is maximized at . This result can be easily verified by comparing the function value at the stationary point in , which is unique for this function, with the values on the boundaries. Here we use the convention at , which makes the function continuous on since . ∎

Let

. The key ingredient of the proof is to bound the moment generating function (MGF) of

, which is defined as

 E[eλYi]=(∞∑k=1pλ+1k)exp(−λ∞∑k=1pklogpk).

Denote the MGF of by . Under Assumption 1, is finite for because

 ∞∑k=1pλ+1k≤∞∑k=1p1−rk<∞.

Conversely, if Assumption 1 does not hold then diverges for all , because if converges for a certain negative then it must be in the interval and one can take .

We now give the main result.

###### Theorem 1 (Main result).

Under Assumption 1, that is, if there exists such that

 ∞∑k=1p1−rk≤Cr<∞,

then for ,

 MYi(λ)≤exp⎛⎜⎝Crλ2r211−|λ|r12√π⎞⎟⎠.

Furthermore, for all ,

 P(∣∣ ∣∣1nn∑i=1logP(Xi)−E[logP(X)]∣∣ ∣∣≥ϵ)≤2exp(−nϵ22Cr/(√πr2)+2ϵ/r). (2)
###### Proof.

For ,

 logMYi(λ)= log(∞∑k=1pλ+1k)−λ∞∑k=1pklogpk ≤ ∞∑k=1pλ+1k−1−λ∞∑k=1pklogpk = ∞∑k=1pkexp(λlogpk)−1−λ∞∑k=1pklogpk = ∞∑k=1(pk+λpklogpk+∞∑m=21m!λmpk(logpk)m)−1−λ∞∑k=1pklogpk, (3)

where the inequality follows from for .

For , it is easy to check that, the minimum of on when

is an odd number, and the maximum when

is an even number, are achieved at by comparing the function value at the stationary point in , which is unique, with the values on the boundaries. Here we use the convention at as before, which makes the function continuous on since .

Therefore, for ,

 ∣∣∣1m!λmpk(logpk)m∣∣∣ ≤ p1−rk1m!|λ|m|prk(logpk)m| ≤ p1−rk1m!|λ|me−m(mr)m = p1−rk1m!(|λ|r)m(me)m ≤ p1−rk1m!(|λ|r)mm!√2πm ≤ p1−rk(|λ|r)m12√π, (4)

where the second inequality is obtained by replacing with its maximum and the third inequality follows from Stirling’s formula (see  for example):

 m!≥√2πm(me)m,% for m≥1.

It follows that for ,

 ∣∣ ∣∣∞∑m=21m!λmpk(logpk)m∣∣ ∣∣≤∞∑m=2∣∣∣1m!λmpk(logpk)m∣∣∣≤p1−rk∞∑m=2(|λ|r)m12√π=p1−rkλ2r211−|λ|r12√π,

and

 ∞∑k=1∣∣ ∣∣∞∑m=21m!λmpk(logpk)m∣∣ ∣∣≤Crλ2r211−|λ|r12√π.

Since the three terms under in (3) all converge absolutely for , one can take the sum term by term. Therefore, for ,

 logMYi(λ)≤∞∑k=1∞∑m=21m!λmpk(logpk)m≤Crλ2r211−|λ|r12√π,

and

 MYi(λ)≤exp⎛⎜⎝Crλ2r211−|λ|r12√π⎞⎟⎠. (5)

The second part of the theorem follows from a standard argument using the Chernoff bound, which can be found in Chapter 2 of . We give the details for completeness. For and ,

 P(n∑i=1Yi≥t)=P(eλ∑ni=1Yi≥eλt)≤∏ni=1MYi(λ)eλt≤exp⎧⎪⎨⎪⎩nCrλ2r211−|λ|r12√π−λt⎫⎪⎬⎪⎭,

where the first inequality is Markov’s inequality and the second inequality follows from (5). By setting

 λ=tnCr/(√πr2)+t/r∈(0,r),

we obtain

 P(n∑i=1Yi≥t)≤exp(−t22nCr/(√πr2)+2t/r).

The left tail bound can be obtained similarly by setting . Therefore,

 P(∣∣ ∣∣n∑i=1Yi∣∣ ∣∣≥t)≤2exp(−t22nCr/(√πr2)+2t/r).

Finally, letting ,

 P(∣∣ ∣∣1nn∑i=1Yi∣∣ ∣∣≥ϵ)≤2exp(−nϵ22Cr/(√πr2)+2ϵ/r).

Theorem 1 can be generalized to with independent but non-identical distributions. Let be the probability mass of at and be the entropy of . Furthermore, redefine and accordingly. We have the following result for non-identical distributions:

###### Corollary 1.

If there exists such that

 ∞∑k=1p1−rik≤Cr,i<∞,i=1,...,n,

then for ,

 MYi(λ)≤exp⎛⎜⎝Cr,iλ2r211−|λ|r12√π⎞⎟⎠.

Furthermore, for all ,

 P(∣∣ ∣∣1nn∑i=1(logP(Xi)−E[logP(Xi)])∣∣ ∣∣≥ϵ)≤2exp(−nϵ22∑ni=1Cr,i/(n√πr2)+2ϵ/r).

The proof is the same as of Theorem 1.

## 3 Determining the Constants in the Bound

The radius of convergence in (4) and the upper bound for are the only constants to be determined if one wants to use (2) as an effective upper bound for a given distribution .

We first determine the types of distributions and the range of that can make converge. Intuitively speaking, for distributions that satisfy Assumption 1, the tail of cannot be too heavy. We make the above statement precise in the following theorem.

###### Theorem 2.

The distribution satisfies Assumption 1 if the tail of is lighter than or on the same order of a discrete power-law distribution; conversely, Assumption 1 cannot be satisfied if the tail is heavier than any power-law distribution. Specifically,

1. [label=()]

2. If

 limk→∞pkk−α=0,for all α>1,

then

 ∞∑k=1p1−rk<∞,for all 0
3. If

 01,

then

 ∞∑k=1p1−rk<∞,if and only% if 0
4. If

 limk→∞pkk−α=∞,for all α>1,

then

 ∞∑k=1p1−rk=∞,for all 0
###### Proof.

Recall that converges for , and diverges for . Statement (i) is obvious by taking . Statement (ii) is also obvious by noticing that the assumption implies that there exist positive constants such that for sufficiently large . We prove (iii) by contradiction. If there exists such that , then

 liminfk→∞p1−rkk−1=0.

It implies

 liminfk→∞pkk−1/(1−r)=0,

which contradicts the assumption since . ∎

Theorem 2 implies that there are a wide class of discrete distributions satisfying Assumption 1, including the most commonly-used ones such as the Poisson distribution, the negative binomial distribution, and the power-law distribution itself. The class even contains certain discrete random variables that do not have finite expectations. In fact, if follows a discrete power-law distribution with then since diverges. But such distributions satisfy Assumption 1 by Theorem 2 (ii).

Remark. It may be surprising, at first glance, to get an exponential decay bound for a power-law distribution, which itself is heavy-tailed. But note that (2) is a concentration bound for , not for . The log-likelihood is typically better-behaved than that takes values on non-negative integers and follows a heavy-tailed distribution. For example, for a power-law distribution with , ; on the contrary, the entropy is finite by Proposition 1 and Theorem 2 (ii). This phenomenon can be explained by noticing that grows much slower than . Moreover, the MGF of is infinite if follows a power-law distribution while the MGF of is finite. The tail of is not heavy in this sense, which makes (2) possible.

Finally, we discuss how to compute after is selected by Theorem 2. In practice, one can compute the partial sum of until the increment is negligible. The value obtained in this way, however, is a lower bound for and a generic truncation error bound does not exist for positive infinite series because in principle, the tail behavior cannot be predicted by a finite number of terms222This issue is minor in practice especially when drops exponentially. The series usually converges very fast in this case. It is nothing wrong to take the partial sum until the increment is negligible. The method in Theorem 3 is useful to someone who needs a rigorous upper bound..

If the tail of is dominated by a power-law distribution, we propose a method that can compute an upper bound for at any tolerance level. Specifically, the next theorem shows how to compute an upper bound for with smaller than a pre-specified tolerance level if we find such that for . Note that such exists if satisfies the condition in (i) or (ii) in Theorem 2.

###### Theorem 3.

Suppose is a positive integer such that for certain and all , where . Pick such that . For all , let

 k1=max⎧⎪⎨⎪⎩k0,⎡⎢ ⎢ ⎢⎢(ϵ(α(1−r)−1)c0)−1α(1−r)−1⎤⎥ ⎥ ⎥⎥⎫⎪⎬⎪⎭,

where means rounding up to the next integer. Then

 Cr=k1∑k=1p1−rk+ϵ

satisfies

 0≤Cr−∞∑k=1p1−rk≤ϵ.
###### Proof.

We only need to bound the tail probability for .

 ∞∑k=k1+1p1−rk ≤c0∞∑k=k1+1k−α(1−r) =c0∞∑k=k1∫k+1k(k+1)−α(1−r)dx ≤c0∫∞k1x−α(1−r)dx =c0α(1−r)−1k−(α(1−r)−1)1≤ϵ,

where the first inequality holds because for all and the last inequality holds because .

Therefore,

 ∞∑k=1p1−rk =k1∑k=1p1−rk+∞∑k=k1+1p1−rk≤k1∑k=1p1−rk+ϵ.

## Acknowledgements

This research was supported by the National Science Foundation grant DMS-1840203.

## References

• Baccetti and Visser  Baccetti, V. and Visser, M. (2013). Infinite shannon entropy. Journal of Statistical Mechanics: Theory and Experiment, 2013(04):P04010.
• Cover and Thomas  Cover, T. M. and Thomas, J. A. (2006). Elements of information theory 2nd edition (Wiley series in telecommunications and signal processing). Wiley-Interscience.
• Csiszár and Körner  Csiszár, I. and Körner, J. (2011). Information theory: coding theorems for discrete memoryless systems. Cambridge University Press.
• Dubhashi and Panconesi  Dubhashi, D. P. and Panconesi, A. (2009). Concentration of measure for the analysis of randomized algorithms. Cambridge University Press.
• Raginsky and Sason  Raginsky, M. and Sason, I. (2012). Concentration of measure inequalities in information theory, communications and coding. arXiv preprint arXiv:1212.4663.
• Robbins  Robbins, H. (1955). A remark on stirling’s formula. The American mathematical monthly, 62(1):26–29.
• Shannon  Shannon, C. E. (1948). A mathematical theory of communication. The Bell System Technical Journal, 27(3):379–423.
• Vershynin  Vershynin, R. (2018).

High-dimensional probability: An introduction with applications in data science

, volume 47.
Cambridge university press.
• Wainwright  Wainwright, M. J. (2019). High-dimensional statistics: A non-asymptotic viewpoint. Cambridge Series in Statistical and Probabilistic Mathematics. Cambridge University Press.
• Zhao [2020a] Zhao, Y. (2020a). A note on new Bernstein-type inequalities for the log-likelihood function of Bernoulli variables. Statistics & Probability Letters, page 108779.
• Zhao [2020b] Zhao, Y. (2020b). On optimal uniform concentration inequalities for discrete entropy in the high-dimensional setting. arXiv preprint arXiv:2007.04547.