be independent Bernoulli random variables, where
takes the value 1 with probability, denoted by Ber(). We are interested in deriving a concentration bound, which decays exponentially and is independent of parameters , for the joint log-likelihood function of . That is,
where and are constants that only depend on .
This research is motivated by theoretical studies of likelihood-based methods for binary data, in particular likelihood-based methods for community detection in networks. For example, Theorem 2 in Choi et al. (2012) uses on an inequality of this type and so does Theorem 2 in Paul and Chen (2016).
We begin with classical results. By symmetry, we only consider
Since almost surely when or 0 (using the convention ), the term can be dropped. Without loss of generality, assume for . Noticing that for , we have Hoeffding’s inequality (Hoeffding, 1963): for all ,
Let be the smallest value among . Then for . Bernstein’s inequality (see Dubhashi and Panconesi (2009), Theorem 1.2) gives: for all ,
Note that both inequalities depend on . As a result, when goes to 0 fast enough as grows, the bounds can be trivial due to the divergence of . When applying these inequalities, technical assumptions are therefore needed to control the rate of the parameters going to the boundaries, for example, the condition on in Theorem 2 of Choi et al. (2012).
In this note, we prove a Bernstein-type inequality where the bound is independent of . In other words, we show that is in fact well-behaved when the parameters are near the boundary. The results such as in Choi et al. (2012) and Paul and Chen (2016) can therefore be strengthened by removing the technical assumptions. The new inequality is particularly useful in cases where those assumptions are not convenient to be made.
2 Main Result
Let be independent for where . For all ,
Let . Let be the moment generating function of , which is
The key step is to derive an exponential upper bound for which is independent of . First consider the case where .
We prove a Bernstein’s condition (see Wainwright (2019), p. 27 for an introduction) for the moments of . That is, find constants and , such that
Different from Wainwright (2019), here we look for constants and which are independent of .
By taking the first and the second derivatives of , one can easily check that its optimum is achieved at . Therefore,
where the last inequality follows from Stirling’s formula (Robbins, 1955). Similarly,
It follows that
Therefore, the Bernstein’s condition (3) holds when and .
We now use the Bernstein’s condition to derive an upper bound for . The argument is similar to Wainwright (2019), pp. 27-28. We give the details for completeness.
By the power series expansion of the exponential function and Fubini’s theorem (for exchanging the expectation and summation),
where the inequality follows from the Bernstein’s condition (3) and . For any , the geometrics series converges, and
where the second inequality follows from . Notice that for or so the inequality holds for all .
The rest of the proof follows from a standard argument using the Chernoff bound, which can be found in a standard textbook on concentration inequalities, for example, Dubhashi and Panconesi (2009), Chapter 1. We give the details for readers who are unfamiliar with this technique. For ,
where the first inequality is Markov’s inequality and the second inequality follows from (4). By setting , we obtain
The bound for the right tail can be obtained similarly by setting . ∎
is dominated by the term , which has a bump near the boundary, – that is, its value achieves the order of at . This value is, however, still bounded by , which implies the left-tail bound of is well-behaved when the parameters are near the boundary.
For , , which implies cannot be bounded by any function that takes finite values. For , .
The result is obvious by noticing that . ∎
We now prove (1). We state a slightly more general result for multinoulli variables. Let be a multinoulli variable with , and assume are independent.
For , , , and all ,
The result is obvious by noticing that
and setting in (2). ∎
3 Extension to Grouped Observations
We now extend our result to a setup where the observations are grouped into different classes. In fact, this is the setup that can be directly applied to the community detection literature, for example, Theorem 2 in Choi et al. (2012) and Theorem 2 in Paul and Chen (2016). We will also apply the result in a working paper by the author and collaborators on the theory of hub models, a special latent class model for binary data proposed by Zhao and Weko (2019).
Let be independent Bernoulli variables, where is the parameter for . Let . And let for , where .
For all ,
Note that here the model assumption on is identical to the setup in Section 2, where each Bernoulli variable has its own parameter. The function we consider in the inequality is, however, defined differently. Moreover, this theorem reduces to Theorem 1 when for .
Let . Consider the moment generating function for .
where the inequality follows from the inequality of arithmetic and geometric means:for non-negative . From (4), . It follows that . The inequality also holds for or 1 as . The rest of the proof follows from the standard argument using the Chernoff bound as shown in the proof of Theorem 1. ∎
We conclude this note with a corollary that is easily proved by the same argument for Corollary 1. Let be independent multinoulli variables, where each , and for . As before, let . And let for and , where .
For all ,
This research was supported by the National Science Foundation grant DMS-1840203.
- Choi et al. (2012) Choi, D. S., Wolfe, P. J., and Airoldi, E. M. (2012). Stochastic blockmodels with a growing number of classes. Biometrika, 99(2):273–284.
- Dubhashi and Panconesi (2009) Dubhashi, D. P. and Panconesi, A. (2009). Concentration of measure for the analysis of randomized algorithms. Cambridge University Press.
- Hoeffding (1963) Hoeffding, W. (1963). Probability inequalities for sums of bounded random variables. Journal of the American Statistical Association, 58(301):13–30.
- Paul and Chen (2016) Paul, S. and Chen, Y. (2016). Consistent community detection in multi-relational data through restricted multi-layer stochastic blockmodel. Electronic Journal of Statistics, 10(2):3807–3870.
- Robbins (1955) Robbins, H. (1955). A remark on stirling’s formula. The American mathematical monthly, 62(1):26–29.
- Wainwright (2019) Wainwright, M. J. (2019). High-Dimensional Statistics: A Non-Asymptotic Viewpoint. Cambridge Series in Statistical and Probabilistic Mathematics. Cambridge University Press.
- Zhao and Weko (2019) Zhao, Y. and Weko, C. (2019). Network inference from grouped observations using hub models. Statistica Sinica, 29(1):225–244.