# On Optimal Uniform Concentration Inequalities for Discrete Entropies in the High-dimensional Setting

We prove an exponential decay concentration inequality to bound the tail probability of the difference between the log-likelihood of discrete random variables and the negative entropy. The concentration bound we derive holds uniformly over all parameter values. The new result improves the convergence rate in an earlier result of Zhao (2020), from (K^2log K)/n=o(1) to (log K)^2/n=o(1), where n is the sample size and K is the number of possible values of the discrete variable. We further prove that the rate (log K)^2/n=o(1) is optimal. The results are extended to misspecified log-likelihoods for grouped random variables.

## Authors

• 10 publications
• ### Concentration Inequalities for Multinoulli Random Variables

We investigate concentration inequalities for Dirichlet and Multinomial ...
01/30/2020 ∙ by Jian Qian, et al. ∙ 0

• ### A Note on New Bernstein-type Inequalities for the Log-likelihood Function of Bernoulli Variables

We prove a new Bernstein-type inequality for the log-likelihood function...
08/31/2019 ∙ by Yunpeng Zhao, et al. ∙ 0

• ### Chernoff-type Concentration of Empirical Probabilities in Relative Entropy

We study the relative entropy of the empirical probability vector with r...
03/19/2020 ∙ by F. Richard Guo, et al. ∙ 0

• ### A note on concentration inequality for vector-valued martingales with weak exponential-type tails

We present novel martingale concentration inequalities for martingale di...
09/06/2018 ∙ by Chris Junchi Li, et al. ∙ 0

• ### A Berry-Esseen theorem for sample quantiles under association

In this paper, the uniformly asymptotic normality for sample quantiles o...
06/17/2020 ∙ by L. Douge, et al. ∙ 0

• ### On the Compressibility of Affinely Singular Random Vectors

There are several ways to measure the compressibility of a random measur...
01/12/2020 ∙ by Mohammad-Amin Charusaie, et al. ∙ 0

• ### Estimating Ising Models from One Sample

Given one sample X ∈{± 1}^n from an Ising model [X=x]∝(x^ J x/2), whose ...
04/20/2020 ∙ by Yuval Dagan, et al. ∙ 0

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Main result

As a powerful toolset in probability theory

[3, 11], concentration inequalities have wide applications in statistics [18, 16], information theory [14], and algorithm analysis [8]. The information entropy, or just entropy, is one of the central concepts in information theory [7]. The goal of the present paper is to prove a concentration inequality with the optimal rate to bound the tail probability of the difference between the log-likelihood of discrete random variables and the negative entropy when the number of possible values of the variable grows. In this section, we first state the main result of the paper. We will explain the motivation of this work and review related work in Section 2.

Let be a discrete random variable with possible values and probability mass function . The negative entropy of is defined as . Note that the definition of entropy 111Throughout the paper, “log” denotes the natural logarithm. does not depend on the values of the variable but only depends the probabilities of taking each value

. One can therefore equivalently define entropy on a categorical variable. Let

be a dummy coding of a categorical variable with categories, in which one and only one entry is 1 and the others are 0. Let with . The log-likelihood of is and the negative entropy of is .

Given a sequence of independent and identically distributed random variables , a natural question is to derive a concentration bound for the difference between the mean of log-likelihoods of and its expectation, i.e., the negative entropy. We consider a slightly more general setting, in which the variables are assumed to be independent but not necessarily identical. Specifically, let follow a categorical distribution with parameters where . We assume are independent but can be different for each .

We are interested in deriving an exponential decay concentration bound, which holds uniformly over parameter values , for the tail probability of

 1nn∑i=1(K∑k=1ziklogpik−K∑k=1piklogpik).

Specifically, let . We aim to derive a bound for

 supp1∈C,...,pn∈CP(∣∣ ∣∣1nn∑i=1(K∑k=1ziklogpik−K∑k=1piklogpik)∣∣ ∣∣≥ϵ), (1)

where “sup” is understood as taking the supremum over all possible values of in , not the maximum of values.

We now give the main theorem.

###### Theorem 1 (Main result).

For sufficiently small positive and ,

 supp1∈C,...,pn∈CP(∣∣ ∣∣1nn∑i=1(K∑k=1ziklogpik−K∑k=1piklogpik)∣∣ ∣∣≥ϵ)≤2exp(−nϵ24(max{logK,log5})2). (2)

Furthermore, if , for all ,

 supp1∈C,...,pn∈CP(∣∣ ∣∣1nn∑i=1(K∑k=1ziklogpik−K∑k=1piklogpik)∣∣ ∣∣≥ϵ)↛0. (3)

Most results in the paper in fact hold for . The corresponding log-likelihood is however non-random so we exclude this trivial case.

We comment on the contributions of the main theorem before proceeding. Firstly, as aforementioned, inequality (2) is uniform over parameter values as the right hand side does not depend on . Note that we do not assume are bounded away222We use the convention , which is consistent with the limit . from the boundaries. Removing this restriction is a challenge and a significant contribution of this paper. Secondly, inequality (2) implies that if , for all ,

 supp1∈C,...,pn∈CP(∣∣ ∣∣1nn∑i=1(K∑k=1ziklogpik−K∑k=1piklogpik)∣∣ ∣∣≥ϵ)→0. (4)

The rate falls into the high-dimensional setting – that is, the number of parameters can grow much faster than the sample size. Thirdly, (3) and (4) imply that the rate is optimal in the asymptotic sense.

## 2 Motivation and related work

Consider the most classical case where is fixed and are independently and identically distributed with

. By the law of large numbers,

 1nn∑i=1K∑k=1ziklogpkp→K∑k=1pklogpk,as n→∞.

This result, called the asymptotic equipartition property (AEP), is one of the most classical results in information theory [15]. The AEP has been generalized to stationary ergodic processes [12, 5] and is referred to as the Shannon-McMillan-Breiman theorem.

We improve the AEP for independent variables from a different perspective. We aim to prove a non-asymptotic concentration inequality, more specifically, an exponential decay bound, for the tail probability. The study of exponential decay concentration inequalities for sums of binary variables dates back to at least the 1920s

[2]. The Chernoff-Hoeffding theorem [10] gives the sharpest bound that can be derived by the Chernoff bound technique for sums of independent Bernoulli variables. Bernstein’s inequality and Hoeffding’s inequality [10] give concentration bounds with more tractable forms and can be generalized to bounded variables and more general settings, such as sub-Gaussian variables and sub-exponential variables.

Most of these studies focused on tail bounds for sums of variables. There is a lack of research on concentration inequalities for log-likelihoods, i.e., sums weighted by logarithms of the parameters. Uniform bounds that are independent of parameter values are particularly under-explored, despite their applications in statistics [6, 13, 20].

We first discuss the difficulty in classical results when applied to log-likelihoods and then explain the motivation of the present research. For , and , . Then . Assume . By Bernstein’s inequality (see [8], Theorem 1.2), for all ,

 P(∣∣ ∣∣1nn∑i=1(2∑k=1ziklogpik−2∑k=1piklogpik)∣∣ ∣∣≥ϵ)≤2exp⎧⎪⎨⎪⎩−n2ϵ2/2∑ni=1% Var(zi1logpi11−pi1)+Mnϵ/3⎫⎪⎬⎪⎭. (5)

The reader is referred to [6] for an application of (5) to community detection in networks. A drawback of (5) is that the condition requires to be bounded away from 0 and 1. Otherwise, the bound can become trivial if grows with too fast. One may apply alternative forms of Bernstein’s inequality (for example, Theorem 2.8.2 in [16]) or other commonly-used concentration inequalities to log-likelihoods and faces a similar problem.

The essential problem is that should not be treated as an arbitrary set of coefficients because is also a part of the model that controls the probabilistic behavior of . To the best of our knowledge, Zhao [19] first overcame this technical difficulty. The paper removed the constraint and proved a new Bernstein-type bound that does not depend on :

###### Theorem 2 ([19], Corollary 1).

For and ,

 supp1∈C,...,pn∈CP(∣∣ ∣∣1nn∑i=1(K∑k=1ziklogpik−K∑k=1piklogpik)∣∣ ∣∣≥ϵ)≤2Kexp{−nϵ22K(K+ϵ)}.

The above theorem implies that if , for all ,

 supp1∈C,...,pn∈CP(∣∣ ∣∣1nn∑i=1(K∑k=1ziklogpik−K∑k=1piklogpik)∣∣ ∣∣≥ϵ)→0.

Our goal is to improve the above rate. We approach this problem by first considering an elementary probability inequality – Chebyshev’s inequality:

 P(∣∣ ∣∣1nn∑i=1(K∑k=1ziklogpik−K∑k=1piklogpik)∣∣ ∣∣≥ϵ)≤∑ni=1Var(L(zi))n2ϵ2, (6)

where

 Var(L(zi))=K∑k=1pik(logpik)2−(K∑k=1piklogpik)2≤K∑k=1pik(logpik)2.

By taking the first and the second derivatives of , one can easily prove

. We can therefore give a rough estimate of the right hand side of (

6):

 ∑ni=1Var(L(zi))n2ϵ2≤4Knϵ2e2.

First note that the above bound is independent of , which is in line with the observation in [19]. Moreover, the bound implies that if , for all ,

 supp1∈C,...,pn∈CP(∣∣ ∣∣1nn∑i=1(K∑k=1ziklogpik−K∑k=1piklogpik)∣∣ ∣∣≥ϵ)→0,

which clearly suggests that there is room for improvement in Theorem 2 when is large.

The above estimate of is rough because it ignores the constraint . In Section 4, we will show that the correct order of is .

The rest of the paper is organized as follows. In Section 3, we prove the concentration inequalities for fixed parameters by classical techniques. In Section 4

, we elaborate the main theorem and prove it through a series of lemmas and theorems. Our approach relies on bounding certain moment generating functions (MGFs). To bound the MGFs, we borrow the idea of primal and dual from the literature of optimization. In Section

5, we extend the results to misspecified log-likelihoods for grouped random variables.

## 3 Inequalities for fixed parameters

We prove concentration inequalities for fixed in this section, where each is an interior point of . The proofs are not challenging. But the results do not exist in the literature as the form we present below, to the best of our knowledge. So we include them for completeness. Moreover, the proofs shed light on the asymmetry of the two sides of the bound and the challenge in proving the uniform bound.

Let . Let be the MGF333Note that the MGF is well defined on the boundaries of for . Specifically, for , which is continuous at 0 because for . of .

###### Theorem 3 (Right-tail bound for fixed parameters).

For and ,

 MY(λ,pi)≤exp(λ2K∑k=1pik(logpik)2),i=1,...,n. (7)

For and ,

 P(1nn∑i=1(K∑k=1ziklogpik−K∑k=1piklogpik)≥ϵ)≤exp(−n2ϵ24∑ni=1∑Kk=1pik(logpik)2).
###### Proof.

First note two elementary inequalities: for , and for . For ,

 logMY(λ,pi)= log(K∑k=1pλ+1ik)−λK∑k=1piklogpik ≤ K∑k=1pλ+1ik−1−λK∑k=1piklogpik = K∑k=1pikexp(λlogpik)−1−λK∑k=1piklogpik ≤ K∑k=1pik(1+λlogpik+λ2(logpik)2)−1−λK∑k=1piklogpik=λ2K∑k=1pik(logpik)2. (8)

Therefore,

The rest of the proof follows from a standard Chernoff bound argument on sub-Gaussian variables (for example, see Chapter 2 of [18]). For , by Markov’s inequality,

 P(n∑i=1Yi≥nϵ)=P(eλ∑ni=1Yi≥eλnϵ)≤∏ni=1E[eλYi]eλnϵ≤exp{λ2n∑i=1K∑k=1pik(logpik)2−λnϵ}. (9)

We obtain the result by letting . ∎

###### Theorem 4 (Left-tail bound for fixed parameters).

Let . For and ,

 MY(λ,pi)≤exp(λ2K∑k=1pik(logpik)2),i=1,...,n. (10)

For and ,

 P(1nn∑i=1(K∑k=1ziklogpik−K∑k=1piklogpik)≤−ϵ) ≤ ⎧⎪ ⎪ ⎪⎨⎪ ⎪ ⎪⎩exp(−n2ϵ24∑ni=1∑Kk=1pik(logpik)2)for 0≤ϵ≤2∑ni=1∑Kk=1pik(logpik)2nbexp(−nϵ2b)for ϵ>2∑ni=1∑Kk=1pik(logpik)2nb. (11)
###### Proof.

For ,

 logMY(λ,pi)≤ K∑k=1pikexp(λlogpik)−1−λK∑k=1piklogpik ≤ K∑k=1pik(1+λlogpik+λ2(logpik)2)−1−λK∑k=1piklogpik=λ2K∑k=1pik(logpik)2, (12)

the second inequality follows from and for . Therefore, , , for .

The rest of the proof follows from a standard Chernoff bound argument on sub-exponential variables. For , by a similar argument in (9),

 P(n∑i=1Yi≤−nϵ)=P(eλ∑ni=1Yi≥e−λnϵ)≤∏ni=1E[eλYi]e−λnϵ≤exp{λ2n∑i=1K∑k=1pik(logpik)2+λnϵ}.

We obtain the result by letting for and for . ∎

A key difference between the two results is in the second inequality of (8) and (12), where in the two theorems have opposite signs. The inequality in (8) always holds since , but the inequality in (12) cannot be true for a large when because of the exponential growth. Therefore, it is not difficult to find a quantity that does not depend on to bound the right-tail probability according to the discussion in Section 2 (obtaining the optimal order is however nontrivial and will be shown in Section 4). To find a uniform bound for the left-tail probability is, however, more challenging since a positive does not exist in Theorem 4 if for some . We develop a new technique to uniformly control when in Section 4.

The asymmetry of the left and right tails can be understood by the following heuristic argument. Note that

can only contribute to the positive part of when . The contribution is however negligible when is close to 0. On the other hand, contributes to the negative part of when , and the contribution blows up when is close to 0.

## 4 Proof of the main result

We break up Theorem 1 into a number of intermediate results. Firstly, we prove the uniform convergence of (1) under the condition by establishing a polynomial decay bound for (1). Secondly, we prove that (1) does not converge to 0 if , which implies is the optimal rate. Finally, we prove the most difficult part, i.e., the exponential decay bound for (1).

Recall that , which gives the constraints each must satisfy. Let be another domain which excludes the constraint .

We begin by a lemma on the upper bound for , which may be of independent interest. Below we omit the index since the result is independent of .

###### Lemma 1.

For ,

 maxp∈CK∑k=1pk(logpk)2=(logK)2.
###### Proof.

The statement in the lemma is equivalent to the following optimization problem:

 maxp∈DK∑k=1pk(logpk)2, subject to K∑k=1pk=1.

Consider the Lagrangian function [4]:

 L(p,ν)=K∑k=1pk(logpk)2+ν(K∑k=1pk−1).

Define . Since , for all and . Furthermore, by noting that for all and ,

 L(~p,ν)=K∑k=1~pk(log~pk)2+ν(K∑k=1~pk−1)=K∑k=1~pk(log~pk)2,

we have for , which further implies .

The argument above shows that for all , is an upper bound for the original problem . Below we pick . Note that the optimization problem

 maxp∈DK∑k=1pk(logpk)2+ν(K∑k=1pk−1)

is equivalent to separate problems: for ,

 max0≤pk≤1h(pk):=pk(logpk)2+νpk.

By taking the derivative with respect , the local maximizer satisfies

 (logpk)2+2logpk+ν=0.

There are two candidate solutions of the quadratic equation :

 y=−1−√1−ν,y=−1+√1−ν,

which are

 y=−1−(logK−1),y=−1+(logK−1).

The corresponding solutions of are

 pk=1/K,pk=exp(logK−2).

Because and for , is the only local maximizer in . Furthermore, because and for , is the global maximizer.

Therefore, . The equality holds because . ∎

Lemma 1 can be viewed as a second-order version of a well-known inequality for entropies: , which can be proved by Jensen’s inequality (see Theorem 2.6.4 in [7]). But Lemma 1 is more difficult to prove because the function involved is neither convex nor concave.

The next theorem immediately follows from Lemma 1 and (6).

###### Theorem 5 (Uniform convergence).

If , for all ,

 supp1∈C,...,pn∈CP(∣∣ ∣∣1nn∑i=1(K∑k=1ziklogpik−K∑k=1piklogpik)∣∣ ∣∣≥ϵ)→0.

We now prove that is the optimal rate.

###### Theorem 6 (Rate optimality).

If ,

 supp1∈C,...,pn∈CVar(1nn∑i=1K∑k=1ziklogpik)↛0,

and for all ,

 supp1∈C,...,pn∈CP(∣∣ ∣∣1nn∑i=1(K∑k=1ziklogpik−K∑k=1piklogpik)∣∣ ∣∣≥ϵ)↛0. (13)
###### Proof.

We only need to find a parameter setting such that generated under satisfy

 Var(1nn∑i=1K∑k=1z∗iklogp∗ik)↛0,

and

 P(∣∣ ∣∣1nn∑i=1(K∑k=1z∗iklogp∗ik−K∑k=1p∗iklogp∗ik)∣∣ ∣∣≥ϵ)↛0.

Let for and let be the corresponding random variables generated under .

Let be i.i.d. variables where

 wi={1w.p. 1/2−1w.p. 1/2.

Then it is easy to check that

 n∑i=1(K∑k=1z∗iklogp∗ik−K∑k=1p∗iklogp∗ik)d=12log(K−1)n∑i=1wi.

Therefore, if .

By the Berry-Esseen theorem (see Theorem 3.4.17 in [9] for example), for any ,

 ∣∣ ∣∣P(1√nn∑i=1wi≤x)−Φ(x)∣∣ ∣∣≤C√n,

where

is the cumulative distribution function of the standard normal distribution and

is an absolute constant. Therefore,

 P(1nn∑i=1(K∑k=1z∗iklogp∗ik−K∑k=1p∗iklogp∗ik)≤−ϵ)≥Φ(−2√nϵlog(K−1))−C√n,

where if . ∎

It is a useful idea to get a sense of the strongest possible concentration inequality by checking the convergence rate of the variance (see the introduction of

[17] for example). It is also worth mentioning that does not automatically imply for an arbitrary sequence . The result in (13) relies on tail behavior of the sum of independent variables.

We now prove the exponential decay bound (2). The right-tail bound is relatively easy to prove as pointed out in Section 3.

###### Theorem 7 (Uniform bound for the right tail).

For and ,

 MY(λ,pi)≤exp(λ2(logK)2),i=1,...,n. (14)

For and ,

 supp1∈C,...,pn∈CP(1nn∑i=1(K∑k=1ziklogpik−K∑k=1piklogpik)≥ϵ)≤exp(−nϵ24(logK)2).
###### Proof.

The first conclusion immediately follows from (7) and Lemma 1. By the standard Chernoff bound argument on sub-Gaussian variables as in Theorem 3, for all ,

 P(1nn∑i=1(K∑