1 Introduction
Concentration inequalities provide powerful tools for many subjects including information theory [5], algorithm analysis [4] and statistics [9, 8]. The goal of the present paper is to prove an exponential decay bound with computable constants for the difference between the negative loglikelihood of discrete random variables and the Shannon entropy, both defined on a countably infinite alphabet.
Let be a discrete random variable on a countably infinite alphabet . Let
be the probability mass at
. Assume, without loss of generality, that for each ; otherwise, simply remove with from . Let be the probability mass function, which is a random variable with if , . Then is the Shannon entropy^{1}^{1}1Throughout the paper, “log” denotes the natural logarithm., which is a key concept in information theory [7, 2]. Note that neither nor the entropy depends on the elements in . In fact, is not necessarily a set of real numbers. The set can contain generic symbols such as letters, and is therefore named as alphabet.Entropy on countably infinite alphabets does not always have finite values. We give a simple sufficient condition ensuring its finiteness at the beginning of Section 2, which is also the key assumption for the main result of the paper. The readers are referred to [1] for a more thorough discussion on conditions for finiteness of entropy on countably infinite alphabets.
Let be independently and identically distributed (i.i.d.) copies of . Then is the joint loglikelihood of
. By the weak law of large numbers,
provided that the entropy is finite. This result, particularly for the case of being finite, is called the asymptotic equipartition property in the information theory literature, which is the foundation of many important results in this field [2, 3].
In this paper, we strengthen the above result by proving a Bernsteintype bound for the case of countably infinite alphabets:
(1) 
where and are computable constants that depend on .
Concentration inequalities for entropy have been studied recently. Zhao [10] proved a Bernsteintype inequality for entropy on finite alphabets with convergence rate , where is the sample size and is the size of the alphabet. Zhao [11] proved an exponential decay bound that improves the rate to and showed that the new rate is optimal. Both papers studied inequalities for finite alphabets while we focus on countably infinite alphabets in this work. In Section 2, we prove (1) under a mild assumption. In Section 3, we show that this assumption holds if the tail of drops faster or on the same order of a discrete powerlaw distribution; conversely, the assumption cannot be satisfied if the tail drops slower than any powerlaw distribution. Most commonlyused discrete distributions such as the Poisson distribution, the negative binomial distribution, and the powerlaw distribution itself satisfy this assumption. Furthermore, we propose a method to compute the constants in the bound (1).
2 Main Result
Our result requires only one assumption on :
Assumption 1. There exists such that
Assumption 1 implies that the tail of cannot be too heavy, and in Section 3 we will elaborate this assumption by showing that the assumption holds if the tail of is lighter than or on the same order of a discrete powerlaw distribution; conversely, it cannot be satisfied if the tail is heavier than any powerlaw distribution.
First note that Assumption 1 ensures the finiteness of the entropy.
Proposition 1.
Under Assumption 1,
Proof.
The last inequality holds because on is maximized at . This result can be easily verified by comparing the function value at the stationary point in , which is unique for this function, with the values on the boundaries. Here we use the convention at , which makes the function continuous on since . ∎
Let
. The key ingredient of the proof is to bound the moment generating function (MGF) of
, which is defined asDenote the MGF of by . Under Assumption 1, is finite for because
Conversely, if Assumption 1 does not hold then diverges for all , because if converges for a certain negative then it must be in the interval and one can take .
We now give the main result.
Theorem 1 (Main result).
Under Assumption 1, that is, if there exists such that
then for ,
Furthermore, for all ,
(2) 
Proof.
For ,
(3) 
where the inequality follows from for .
For , it is easy to check that, the minimum of on when
is an odd number, and the maximum when
is an even number, are achieved at by comparing the function value at the stationary point in , which is unique, with the values on the boundaries. Here we use the convention at as before, which makes the function continuous on since .Therefore, for ,
(4) 
where the second inequality is obtained by replacing with its maximum and the third inequality follows from Stirling’s formula (see [6] for example):
It follows that for ,
and
Since the three terms under in (3) all converge absolutely for , one can take the sum term by term. Therefore, for ,
and
(5) 
The second part of the theorem follows from a standard argument using the Chernoff bound, which can be found in Chapter 2 of [9]. We give the details for completeness. For and ,
where the first inequality is Markov’s inequality and the second inequality follows from (5). By setting
we obtain
The left tail bound can be obtained similarly by setting . Therefore,
Finally, letting ,
∎
Theorem 1 can be generalized to with independent but nonidentical distributions. Let be the probability mass of at and be the entropy of . Furthermore, redefine and accordingly. We have the following result for nonidentical distributions:
Corollary 1.
If there exists such that
then for ,
Furthermore, for all ,
The proof is the same as of Theorem 1.
3 Determining the Constants in the Bound
The radius of convergence in (4) and the upper bound for are the only constants to be determined if one wants to use (2) as an effective upper bound for a given distribution .
We first determine the types of distributions and the range of that can make converge. Intuitively speaking, for distributions that satisfy Assumption 1, the tail of cannot be too heavy. We make the above statement precise in the following theorem.
Theorem 2.
The distribution satisfies Assumption 1 if the tail of is lighter than or on the same order of a discrete powerlaw distribution; conversely, Assumption 1 cannot be satisfied if the tail is heavier than any powerlaw distribution. Specifically,

[label=()]

If
then

If
then

If
then
Proof.
Recall that converges for , and diverges for . Statement (i) is obvious by taking . Statement (ii) is also obvious by noticing that the assumption implies that there exist positive constants such that for sufficiently large . We prove (iii) by contradiction. If there exists such that , then
It implies
which contradicts the assumption since . ∎
Theorem 2 implies that there are a wide class of discrete distributions satisfying Assumption 1, including the most commonlyused ones such as the Poisson distribution, the negative binomial distribution, and the powerlaw distribution itself. The class even contains certain discrete random variables that do not have finite expectations. In fact, if follows a discrete powerlaw distribution with then since diverges. But such distributions satisfy Assumption 1 by Theorem 2 (ii).
Remark. It may be surprising, at first glance, to get an exponential decay bound for a powerlaw distribution, which itself is heavytailed. But note that (2) is a concentration bound for , not for . The loglikelihood is typically betterbehaved than that takes values on nonnegative integers and follows a heavytailed distribution. For example, for a powerlaw distribution with , ; on the contrary, the entropy is finite by Proposition 1 and Theorem 2 (ii). This phenomenon can be explained by noticing that grows much slower than . Moreover, the MGF of is infinite if follows a powerlaw distribution while the MGF of is finite. The tail of is not heavy in this sense, which makes (2) possible.
Finally, we discuss how to compute after is selected by Theorem 2. In practice, one can compute the partial sum of until the increment is negligible. The value obtained in this way, however, is a lower bound for and a generic truncation error bound does not exist for positive infinite series because in principle, the tail behavior cannot be predicted by a finite number of terms^{2}^{2}2This issue is minor in practice especially when drops exponentially. The series usually converges very fast in this case. It is nothing wrong to take the partial sum until the increment is negligible. The method in Theorem 3 is useful to someone who needs a rigorous upper bound..
If the tail of is dominated by a powerlaw distribution, we propose a method that can compute an upper bound for at any tolerance level. Specifically, the next theorem shows how to compute an upper bound for with smaller than a prespecified tolerance level if we find such that for . Note that such exists if satisfies the condition in (i) or (ii) in Theorem 2.
Theorem 3.
Suppose is a positive integer such that for certain and all , where . Pick such that . For all , let
where means rounding up to the next integer. Then
satisfies
Proof.
We only need to bound the tail probability for .
where the first inequality holds because for all and the last inequality holds because .
Therefore,
∎
Acknowledgements
This research was supported by the National Science Foundation grant DMS1840203.
References
 Baccetti and Visser [2013] Baccetti, V. and Visser, M. (2013). Infinite shannon entropy. Journal of Statistical Mechanics: Theory and Experiment, 2013(04):P04010.
 Cover and Thomas [2006] Cover, T. M. and Thomas, J. A. (2006). Elements of information theory 2nd edition (Wiley series in telecommunications and signal processing). WileyInterscience.
 Csiszár and Körner [2011] Csiszár, I. and Körner, J. (2011). Information theory: coding theorems for discrete memoryless systems. Cambridge University Press.
 Dubhashi and Panconesi [2009] Dubhashi, D. P. and Panconesi, A. (2009). Concentration of measure for the analysis of randomized algorithms. Cambridge University Press.
 Raginsky and Sason [2012] Raginsky, M. and Sason, I. (2012). Concentration of measure inequalities in information theory, communications and coding. arXiv preprint arXiv:1212.4663.
 Robbins [1955] Robbins, H. (1955). A remark on stirling’s formula. The American mathematical monthly, 62(1):26–29.
 Shannon [1948] Shannon, C. E. (1948). A mathematical theory of communication. The Bell System Technical Journal, 27(3):379–423.

Vershynin [2018]
Vershynin, R. (2018).
Highdimensional probability: An introduction with applications in data science
, volume 47. Cambridge university press.  Wainwright [2019] Wainwright, M. J. (2019). Highdimensional statistics: A nonasymptotic viewpoint. Cambridge Series in Statistical and Probabilistic Mathematics. Cambridge University Press.
 Zhao [2020a] Zhao, Y. (2020a). A note on new Bernsteintype inequalities for the loglikelihood function of Bernoulli variables. Statistics & Probability Letters, page 108779.
 Zhao [2020b] Zhao, Y. (2020b). On optimal uniform concentration inequalities for discrete entropy in the highdimensional setting. arXiv preprint arXiv:2007.04547.