1 Main result
As a powerful toolset in probability theory[3, 11], concentration inequalities have wide applications in statistics [18, 16], information theory , and algorithm analysis . The information entropy, or just entropy, is one of the central concepts in information theory . The goal of the present paper is to prove a concentration inequality with the optimal rate to bound the tail probability of the difference between the log-likelihood of discrete random variables and the negative entropy when the number of possible values of the variable grows. In this section, we first state the main result of the paper. We will explain the motivation of this work and review related work in Section 2.
Let be a discrete random variable with possible values and probability mass function . The negative entropy of is defined as . Note that the definition of entropy 111Throughout the paper, “log” denotes the natural logarithm. does not depend on the values of the variable but only depends the probabilities of taking each value
. One can therefore equivalently define entropy on a categorical variable. Letbe a dummy coding of a categorical variable with categories, in which one and only one entry is 1 and the others are 0. Let with . The log-likelihood of is and the negative entropy of is .
Given a sequence of independent and identically distributed random variables , a natural question is to derive a concentration bound for the difference between the mean of log-likelihoods of and its expectation, i.e., the negative entropy. We consider a slightly more general setting, in which the variables are assumed to be independent but not necessarily identical. Specifically, let follow a categorical distribution with parameters where . We assume are independent but can be different for each .
We are interested in deriving an exponential decay concentration bound, which holds uniformly over parameter values , for the tail probability of
Specifically, let . We aim to derive a bound for
where “sup” is understood as taking the supremum over all possible values of in , not the maximum of values.
We now give the main theorem.
Theorem 1 (Main result).
For sufficiently small positive and ,
Furthermore, if , for all ,
Most results in the paper in fact hold for . The corresponding log-likelihood is however non-random so we exclude this trivial case.
We comment on the contributions of the main theorem before proceeding. Firstly, as aforementioned, inequality (2) is uniform over parameter values as the right hand side does not depend on . Note that we do not assume are bounded away222We use the convention , which is consistent with the limit . from the boundaries. Removing this restriction is a challenge and a significant contribution of this paper. Secondly, inequality (2) implies that if , for all ,
The rate falls into the high-dimensional setting – that is, the number of parameters can grow much faster than the sample size. Thirdly, (3) and (4) imply that the rate is optimal in the asymptotic sense.
2 Motivation and related work
Consider the most classical case where is fixed and are independently and identically distributed with
. By the law of large numbers,
This result, called the asymptotic equipartition property (AEP), is one of the most classical results in information theory . The AEP has been generalized to stationary ergodic processes [12, 5] and is referred to as the Shannon-McMillan-Breiman theorem.
We improve the AEP for independent variables from a different perspective. We aim to prove a non-asymptotic concentration inequality, more specifically, an exponential decay bound, for the tail probability. The study of exponential decay concentration inequalities for sums of binary variables dates back to at least the 1920s. The Chernoff-Hoeffding theorem  gives the sharpest bound that can be derived by the Chernoff bound technique for sums of independent Bernoulli variables. Bernstein’s inequality and Hoeffding’s inequality  give concentration bounds with more tractable forms and can be generalized to bounded variables and more general settings, such as sub-Gaussian variables and sub-exponential variables.
Most of these studies focused on tail bounds for sums of variables. There is a lack of research on concentration inequalities for log-likelihoods, i.e., sums weighted by logarithms of the parameters. Uniform bounds that are independent of parameter values are particularly under-explored, despite their applications in statistics [6, 13, 20].
We first discuss the difficulty in classical results when applied to log-likelihoods and then explain the motivation of the present research. For , and , . Then . Assume . By Bernstein’s inequality (see , Theorem 1.2), for all ,
The reader is referred to  for an application of (5) to community detection in networks. A drawback of (5) is that the condition requires to be bounded away from 0 and 1. Otherwise, the bound can become trivial if grows with too fast. One may apply alternative forms of Bernstein’s inequality (for example, Theorem 2.8.2 in ) or other commonly-used concentration inequalities to log-likelihoods and faces a similar problem.
The essential problem is that should not be treated as an arbitrary set of coefficients because is also a part of the model that controls the probabilistic behavior of . To the best of our knowledge, Zhao  first overcame this technical difficulty. The paper removed the constraint and proved a new Bernstein-type bound that does not depend on :
Theorem 2 (, Corollary 1).
For and ,
The above theorem implies that if , for all ,
Our goal is to improve the above rate. We approach this problem by first considering an elementary probability inequality – Chebyshev’s inequality:
By taking the first and the second derivatives of , one can easily prove
. We can therefore give a rough estimate of the right hand side of (6):
First note that the above bound is independent of , which is in line with the observation in . Moreover, the bound implies that if , for all ,
which clearly suggests that there is room for improvement in Theorem 2 when is large.
The above estimate of is rough because it ignores the constraint . In Section 4, we will show that the correct order of is .
, we elaborate the main theorem and prove it through a series of lemmas and theorems. Our approach relies on bounding certain moment generating functions (MGFs). To bound the MGFs, we borrow the idea of primal and dual from the literature of optimization. In Section5, we extend the results to misspecified log-likelihoods for grouped random variables.
3 Inequalities for fixed parameters
We prove concentration inequalities for fixed in this section, where each is an interior point of . The proofs are not challenging. But the results do not exist in the literature as the form we present below, to the best of our knowledge. So we include them for completeness. Moreover, the proofs shed light on the asymmetry of the two sides of the bound and the challenge in proving the uniform bound.
Let . Let be the MGF333Note that the MGF is well defined on the boundaries of for . Specifically, for , which is continuous at 0 because for . of .
Theorem 3 (Right-tail bound for fixed parameters).
For and ,
For and ,
First note two elementary inequalities: for , and for . For ,
The rest of the proof follows from a standard Chernoff bound argument on sub-Gaussian variables (for example, see Chapter 2 of ). For , by Markov’s inequality,
We obtain the result by letting . ∎
Theorem 4 (Left-tail bound for fixed parameters).
Let . For and ,
For and ,
the second inequality follows from and for . Therefore, , , for .
The rest of the proof follows from a standard Chernoff bound argument on sub-exponential variables. For , by a similar argument in (9),
We obtain the result by letting for and for . ∎
A key difference between the two results is in the second inequality of (8) and (12), where in the two theorems have opposite signs. The inequality in (8) always holds since , but the inequality in (12) cannot be true for a large when because of the exponential growth. Therefore, it is not difficult to find a quantity that does not depend on to bound the right-tail probability according to the discussion in Section 2 (obtaining the optimal order is however nontrivial and will be shown in Section 4). To find a uniform bound for the left-tail probability is, however, more challenging since a positive does not exist in Theorem 4 if for some . We develop a new technique to uniformly control when in Section 4.
The asymmetry of the left and right tails can be understood by the following heuristic argument. Note thatcan only contribute to the positive part of when . The contribution is however negligible when is close to 0. On the other hand, contributes to the negative part of when , and the contribution blows up when is close to 0.
4 Proof of the main result
We break up Theorem 1 into a number of intermediate results. Firstly, we prove the uniform convergence of (1) under the condition by establishing a polynomial decay bound for (1). Secondly, we prove that (1) does not converge to 0 if , which implies is the optimal rate. Finally, we prove the most difficult part, i.e., the exponential decay bound for (1).
Recall that , which gives the constraints each must satisfy. Let be another domain which excludes the constraint .
We begin by a lemma on the upper bound for , which may be of independent interest. Below we omit the index since the result is independent of .
The statement in the lemma is equivalent to the following optimization problem:
Consider the Lagrangian function :
Define . Since , for all and . Furthermore, by noting that for all and ,
we have for , which further implies .
The argument above shows that for all , is an upper bound for the original problem . Below we pick . Note that the optimization problem
is equivalent to separate problems: for ,
By taking the derivative with respect , the local maximizer satisfies
There are two candidate solutions of the quadratic equation :
The corresponding solutions of are
Because and for , is the only local maximizer in . Furthermore, because and for , is the global maximizer.
Therefore, . The equality holds because . ∎
Lemma 1 can be viewed as a second-order version of a well-known inequality for entropies: , which can be proved by Jensen’s inequality (see Theorem 2.6.4 in ). But Lemma 1 is more difficult to prove because the function involved is neither convex nor concave.
Theorem 5 (Uniform convergence).
If , for all ,
We now prove that is the optimal rate.
Theorem 6 (Rate optimality).
and for all ,
We only need to find a parameter setting such that generated under satisfy
Let for and let be the corresponding random variables generated under .
Let be i.i.d. variables where
Then it is easy to check that
Therefore, if .
It is a useful idea to get a sense of the strongest possible concentration inequality by checking the convergence rate of the variance (see the introduction of for example). It is also worth mentioning that does not automatically imply for an arbitrary sequence . The result in (13) relies on tail behavior of the sum of independent variables.
Theorem 7 (Uniform bound for the right tail).
For and ,
For and ,