 # A Note on New Bernstein-type Inequalities for the Log-likelihood Function of Bernoulli Variables

We prove a new Bernstein-type inequality for the log-likelihood function of Bernoulli variables. In contrast to classical Bernstein's inequality and Hoeffding's inequality when applied to the log-likelihood, the new bound is independent of the parameters of the Bernoulli variables and therefore does not blow up as the parameters approach 0 or 1. The new inequality strengthens certain theoretical results on likelihood-based methods for community detection in networks and can be applied to other likelihood-based methods for binary data.

## Authors

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Let

be independent Bernoulli random variables, where

takes the value 1 with probability

, denoted by Ber(). We are interested in deriving a concentration bound, which decays exponentially and is independent of parameters , for the joint log-likelihood function of . That is,

 P(∣∣ ∣∣1nn∑i=1(Xilogpi+(1−Xi)log(1−pi))−1nn∑i=1(pilogpi+(1−pi)log(1−pi))∣∣ ∣∣≥ϵ)≤c1e−c2n, (1)

where and are constants that only depend on .

This research is motivated by theoretical studies of likelihood-based methods for binary data, in particular likelihood-based methods for community detection in networks. For example, Theorem 2 in Choi et al. (2012) uses on an inequality of this type and so does Theorem 2 in Paul and Chen (2016).

We begin with classical results. By symmetry, we only consider

 P(∣∣ ∣∣1nn∑i=1Xilogpi−1nn∑i=1pilogpi∣∣ ∣∣≥ϵ).

Since almost surely when or 0 (using the convention ), the term can be dropped. Without loss of generality, assume for . Noticing that for , we have Hoeffding’s inequality (Hoeffding, 1963): for all ,

 P(1n∣∣ ∣∣n∑i=1(Xi−pi)logpi∣∣ ∣∣≥ϵ)≤2exp{−2n2ϵ2∑ni=1(logpi)2}.

Let be the smallest value among . Then for . Bernstein’s inequality (see Dubhashi and Panconesi (2009), Theorem 1.2) gives: for all ,

 P(1n∣∣ ∣∣n∑i=1(Xi−pi)logpi∣∣ ∣∣≥ϵ)≤2exp{−n2ϵ2/2∑ni=1Var(Xilogpi)+|logp(1)|nϵ/3}.

Note that both inequalities depend on . As a result, when goes to 0 fast enough as grows, the bounds can be trivial due to the divergence of . When applying these inequalities, technical assumptions are therefore needed to control the rate of the parameters going to the boundaries, for example, the condition on in Theorem 2 of Choi et al. (2012).

In this note, we prove a Bernstein-type inequality where the bound is independent of . In other words, we show that is in fact well-behaved when the parameters are near the boundary. The results such as in Choi et al. (2012) and Paul and Chen (2016) can therefore be strengthened by removing the technical assumptions. The new inequality is particularly useful in cases where those assumptions are not convenient to be made.

## 2 Main Result

###### Theorem 1.

Let be independent for where . For all ,

 (2)
###### Proof.

Let . Let be the moment generating function of , which is

 G(pi,λ)=E[eλYi]=pieλ(1−pi)logpi+(1−pi)e−λpilogpi.

The key step is to derive an exponential upper bound for which is independent of . First consider the case where .

We prove a Bernstein’s condition (see Wainwright (2019), p. 27 for an introduction) for the moments of . That is, find constants and , such that

 |E[Ymi]|≤12m!σ2bm−2for m=3,4,.... (3)

Different from Wainwright (2019), here we look for constants and which are independent of .

Consider

 E[Ymi]=pi(1−pi)m(logpi)mA1+(1−pi)(−pilogpi)mA2.

By taking the first and the second derivatives of , one can easily check that its optimum is achieved at . Therefore,

 |A1|≤|pi(logpi)m|≤(me)m≤m!√2πm,

where the last inequality follows from Stirling’s formula (Robbins, 1955). Similarly,

 |A2|≤(1−pi)(−pilogpi)m≤e−m.

It follows that

 |E[Ymi]|≤m!√2πm+1em≤12m!for m=3,4,....

Therefore, the Bernstein’s condition (3) holds when and .

We now use the Bernstein’s condition to derive an upper bound for . The argument is similar to Wainwright (2019), pp. 27-28. We give the details for completeness.

By the power series expansion of the exponential function and Fubini’s theorem (for exchanging the expectation and summation),

 G(pi,λ)=E[eλYi] =1+λ2Var(Yi)2+∞∑m=3λmE[Ymi]m! ≤1+λ22+λ22∞∑m=1|λ|m,

where the inequality follows from the Bernstein’s condition (3) and . For any , the geometrics series converges, and

 G(pi,λ)≤1+λ2211−|λ|≤exp{λ22(1−|λ|)}, (4)

where the second inequality follows from . Notice that for or so the inequality holds for all .

The rest of the proof follows from a standard argument using the Chernoff bound, which can be found in a standard textbook on concentration inequalities, for example, Dubhashi and Panconesi (2009), Chapter 1. We give the details for readers who are unfamiliar with this technique. For ,

 P(n∑i=1Yi≤−t)=P(eλ∑ni=1Yi≥e−λt)≤∏ni=1E[eλYi]e−λt≤exp{nλ22(1−|λ|)+λt},

where the first inequality is Markov’s inequality and the second inequality follows from (4). By setting , we obtain

 P(n∑i=1Yi≤−t)≤exp{−t22(n+t)}.

The bound for the right tail can be obtained similarly by setting . ∎

###### Remark 1.

is dominated by the term , which has a bump near the boundary, – that is, its value achieves the order of at . This value is, however, still bounded by , which implies the left-tail bound of is well-behaved when the parameters are near the boundary.

###### Remark 2.

The constant in the Bernstein’s condition (3) is not the optimal value. We simply choose this value for obtaining a nice form in (2). On the contrary, is optimal because dominates for any . This fact can also be seen from the following proposition:

###### Proposition 1.

For , , which implies cannot be bounded by any function that takes finite values. For , .

###### Proof.

The result is obvious by noticing that . ∎

We now prove (1). We state a slightly more general result for multinoulli variables. Let be a multinoulli variable with , and assume are independent.

###### Corollary 1.

For , , , and all ,

 P(∣∣ ∣∣1nn∑i=1K∑k=1(Xik−pik)logpik∣∣ ∣∣≥ϵ)≤2Kexp{−nϵ22K(K+ϵ)}.
###### Proof.

The result is obvious by noticing that

 P(∣∣ ∣∣1nn∑i=1K∑k=1(Xik−pik)logpik∣∣ ∣∣≥ϵ)≤K∑k=1P(∣∣ ∣∣n∑i=1(Xik−pik)logpik∣∣ ∣∣≥nϵK),

and setting in (2). ∎

## 3 Extension to Grouped Observations

We now extend our result to a setup where the observations are grouped into different classes. In fact, this is the setup that can be directly applied to the community detection literature, for example, Theorem 2 in Choi et al. (2012) and Theorem 2 in Paul and Chen (2016). We will also apply the result in a working paper by the author and collaborators on the theory of hub models, a special latent class model for binary data proposed by Zhao and Weko (2019).

Let be independent Bernoulli variables, where is the parameter for . Let . And let for , where .

###### Theorem 2.

For all ,

 P(∣∣ ∣∣I∑i=1ni∑j=1(X(i)j−p(i)j)log¯p(i)∣∣ ∣∣≥t)≤2exp{−t22(n+t)}. (5)

Note that here the model assumption on is identical to the setup in Section 2, where each Bernoulli variable has its own parameter. The function we consider in the inequality is, however, defined differently. Moreover, this theorem reduces to Theorem 1 when for .

###### Proof.

Let . Consider the moment generating function for .

 E[eλZ(i)]= ni∏j=1(p(i)jeλ(1−¯p(i))log¯p(i)+(1−p(i)j)e−λ¯p(i)log¯p(i)) ≤ (¯p(i)eλ(1−¯p(i))log¯p(i)+(1−¯p(i))e−λ¯p(i)log¯p(i))ni=(G(¯p(i),λ))ni,

where the inequality follows from the inequality of arithmetic and geometric means:

for non-negative . From (4), . It follows that . The inequality also holds for or 1 as . The rest of the proof follows from the standard argument using the Chernoff bound as shown in the proof of Theorem 1. ∎

We conclude this note with a corollary that is easily proved by the same argument for Corollary 1. Let be independent multinoulli variables, where each , and for . As before, let . And let for and , where .

###### Corollary 2.

For all ,

 P(1n∣∣ ∣∣I∑i=1ni∑j=1K∑k=1(X(i)jk−p(i)jk)log¯p(i)k∣∣ ∣∣≥ϵ)≤2Kexp{−nϵ22K(K+ϵ)}.

## Acknowledgements

This research was supported by the National Science Foundation grant DMS-1840203.

## References

• Choi et al. (2012) Choi, D. S., Wolfe, P. J., and Airoldi, E. M. (2012). Stochastic blockmodels with a growing number of classes. Biometrika, 99(2):273–284.
• Dubhashi and Panconesi (2009) Dubhashi, D. P. and Panconesi, A. (2009). Concentration of measure for the analysis of randomized algorithms. Cambridge University Press.
• Hoeffding (1963) Hoeffding, W. (1963). Probability inequalities for sums of bounded random variables. Journal of the American Statistical Association, 58(301):13–30.
• Paul and Chen (2016) Paul, S. and Chen, Y. (2016). Consistent community detection in multi-relational data through restricted multi-layer stochastic blockmodel. Electronic Journal of Statistics, 10(2):3807–3870.
• Robbins (1955) Robbins, H. (1955). A remark on stirling’s formula. The American mathematical monthly, 62(1):26–29.
• Wainwright (2019) Wainwright, M. J. (2019). High-Dimensional Statistics: A Non-Asymptotic Viewpoint. Cambridge Series in Statistical and Probabilistic Mathematics. Cambridge University Press.
• Zhao and Weko (2019) Zhao, Y. and Weko, C. (2019). Network inference from grouped observations using hub models. Statistica Sinica, 29(1):225–244.