 # Concentration Inequalities for Multinoulli Random Variables

We investigate concentration inequalities for Dirichlet and Multinomial random variables.

## Authors

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Problem Formulation

We analyse the concentration properties of the random variable defined as:

 Zn:=maxv∈[0,D]S{(ˆpn−p)Tv} (1)

where

is a random vector,

is deterministic and is the -dimensional simplex. It is easy to show that the maximum in Eq. 1 is equivalent to computing the (scaled) -norm of the vector :

 (2)

where we have used the fact that . As a consequence, is a bounded random variable in . While the following discussion apply to Dirichlet distributions, we focus on . The results previously available in the literature are summarized in the following.

The literature has analysed the concentration of the -discrepancy of the true distribution and the empirical one in this setting.

###### Proposition 1.

(Weissman et al., 2003) Let and . Then, for any and :

 P(∥ˆp−p∥1≥√2Sln(\sfrac2δ)n)≤P(∥ˆp−p∥1≥ ⎷2ln(\sfrac(2S−2)δ)n)≤δ (3)

This concentration inequality is at the core of the proof of UCRL, see (Jaksch et al., 2010, App. C.1). Another inequality is provided in (Devroye, 1983, Lem. 3).

###### Proposition 2.

(Devroye, 1983) Let and . Then, for any :

 P(∥ˆpn−p∥1≥5√ln(\sfrac3δ)n)≤δ (4)

While Prop. 1 shows an explicit dependence on the dimension of the random variable, such dependence is hidden in Prop. 2 by the constraint on . Note that for any , . This shows that the -deviation always scales proportionally to the dimension of the random variable, i.e., as .

A better inequality. The natural question is whether is possible to derive a concentration inequality independent from the dimension of by exploiting the correlation between and the maximizer vector . This question has been recently addressed in (Agrawal and Jia, 2017, Lem. C.2):

###### Lemma 3.

(Agrawal and Jia, 2017) Let and . Then, for any :

 P(∥ˆpn−p∥1≥√2ln(\sfrac1δ)n)≤δ

Their results resemble the one in Prop. 2 but removes the constraint on . As a consequence, the implicit or explicit dependence on the dimension is removed. In the following, we will show that Lem. 3 may not be correct.

## 2 Theoretical Analysis (the asymptotic case)

In this section, we provide a counter-argument to the Lem. 3 in the asymptotic regime (i.e., ). The overall idea is to show that the expected value of asymptotically grows as and

itself is well concentrated around its expectation. As a result, we can deduce that all quantiles of

grow as as well.

We consider the true vector to be uniform, i.e., and .111The analysis holds also in the case , see (Osband and Roy, 2017). The following lemma provides a characterization of the variable .

###### Lemma 4.

Consider , and

be the uniform distribution on

. Let be the vector of ones of dimension . Define where is the matrix with in all the diagonal entry and elsewhere, and . Then:

 ZS=limn→+∞√nZn∼∥Y+∥1D√S−1S2.

Furthermore,

 E[ZS]=√S−1S2⋅E[S∑i=1Y+i]=√S−1⋅E[Y+1]=√S−12π.

While the previous lemma may already suggest that should grow as as its expectation, it is still possible that a large part of the distribution is concentrated around a value independent from

, with limited probability assigned to, e.g., values growing as

, which could justify the growth of the expectation. Thus, in order to conclude the analysis, we need to show that is concentrated “enough” around its expectation.

Since the random variables are correlated, it is complicated to directly analyze the deviation of from its mean. Thus we first apply an orthogonal transformation on

to obtain independent r.v. (recall that jointly normally distributed variables are independent if uncorrelated).

###### Lemma 5.

Consider the same settings of Lem. 4 and recall that . There exists an orthogonal transformation , s.t.

 W=√S−1SUY∼N(0,[IS−1000]).

By exploiting the transformation we can write that . Since are i.i.d. standard Gaussian random variables and is -Lipschitz, we can finally characterize the mean and the deviations of and derive the following anticoncentration inequality for .

###### Theorem 6.

Let and . Define and . Then, for any :

This result shows that every quantile of is dependent on the dimension of the random variable, i.e., . Similarly to Lem. 2, it is possible to lower bound the quantile by a dimension-free quantity at the price of having an exponential dependence on in .

## Appendix A Proof for the asymptotic scenario

In this section we report the proofs of lemmas and theorem stated in Sec. 2.

### a.1 Proof of Lem. 4

Let and . Then:

 √nZn =√nmaxv∈[0,D]S(ˆp−p)Tv=√nmaxv∈[0,D]SS∑i=1vinn∑j=1(Xji−1S) =maxv∈[0,D]SS∑i=1Yn,ivi√S−1S2=D√S−1S2⋅eTY+n,

where we used the fact that the maximizing takes the largest value for all positive components and is equal to otherwise. We recall that the covariance of the normalized multinoulli variable with probabilities is

. As a result, a direct application of the central limit theorem gives

. Then we can apply the functional CLT and obtain , where is a random vector obtained by truncating from below at the multi-variate Gaussian vector . Since the marginal distribution of each random variable is , i.e., are identically distributed (see definition in Lem. 4), has a distribution composed by a Dirac distribution in and a half normal distribution, and its expected value is , while leads to the final statement on the expectation.

### a.2 Proof of Lem. 5

Denote

the set of eigenvalues of square matrix

. Let such that , where is a matrix full of zeros. Then, we can write the eigenvalues of the covariance matrix of as

 λ(IS−1S−1N) =λ(SS−1IS−1S−1eSeTS)=λ(SS−1IS−1S−1BBT) =SS−1λ(IS−1S−1BTB)=SS−1λ(IS−[0S−1001]),

where we use the fact that . As a result, the covariance of has one eigenvalue at and eigenvalues equal to with multiplicity

. As a result, we can diagonalize it with an orthogonal matrix

(obtained using the normalized eigenvectors) and obtain

 U(IS−1S−1N)UT=[SS−1IS−1000].

Define , then:

 Cov(W,W) =S−1SCov(UY,UY)=S−1SUCov(Y,Y)UT =S−1SU(IS−1S−1N)UT=[IS−1000].

Thus .

### a.3 Proof of Thm. 6

By exploiting Lem. 4 and Lem. 5 we can write:

 ZS ∼eTSY+⋅√S−1S2=eTS(√SS−1UTW)+⋅√S−1S2=eTS(UTW)+⋅1√S

Let . Then is -Lipschitz:

 |g(x)−g(y)| ≤Lip(eTs⋅)Lip(UT⋅)Lip((⋅)+)1√S∥x−y∥2=√S⋅1⋅1⋅1√S∥x−y∥2

where denotes the Lipschitz constant of a function and we exploit the fact that is an orthonormal matrix.

We can now study the concentration of the variable . Given that is a vector of i.i.d. standard Gaussian variables222Note that we can drop the last component of since it is deterministically zero. and is -Lipschitz, we can use (Wainwright, 2017, Thm. 2.4) to prove that for all :

 P(ZS≥E[ZS]−t) ≥1−P(|ZS−E[ZS]|≥t)≥1−2e−t22.

Substituting the value of and inverting the bound gives the desired statement.

## References

• Agrawal and Jia  Shipra Agrawal and Randy Jia.

Optimistic posterior sampling for reinforcement learning: worst-case regret bounds.

In NIPS, pages 1184–1194, 2017.
• Devroye  Luc Devroye. The equivalence of weak, strong and complete convergence in The Annals of Statistics, 11(3):896–904, 09 1983.
• Jaksch et al.  Thomas Jaksch, Ronald Ortner, and Peter Auer. Near-optimal regret bounds for reinforcement learning.

Journal of Machine Learning Research

, 11:1563–1600, 2010.
• Osband and Roy  Ian Osband and Benjamin Van Roy. Why is posterior sampling better than optimism for reinforcement learning? In ICML, volume 70 of Proceedings of Machine Learning Research, pages 2701–2710. PMLR, 2017.
• Wainwright  Wainwright. High-dimensional statistics: A non-asymptotic viewpoint. 2017.
• Weissman et al.  Tsachy Weissman, Erik Ordentlich, Gadiel Seroussi, Sergio Verdu, and Marcelo J Weinberger. Inequalities for the l1 deviation of the empirical distribution. Technical Report HPL-2003-97R1, Hewlett-Packard Labs, 2003.