# Near-optimal Reinforcement Learning using Bayesian Quantiles

We study model-based reinforcement learning in finite communicating Markov Decision Process. Algorithms in this settings have been developed in two different ways: the first view, which typically provides frequentist performance guarantees, uses optimism in the face of uncertainty as the guiding algorithmic principle. The second view is based on Bayesian reasoning, combined with posterior sampling and Bayesian guarantees. In this paper, we develop a conceptually simple algorithm, Bayes-UCRL that combines the benefits of both approaches to achieve state-of-the-art performance for finite communicating MDP. In particular, we use Bayesian Prior similarly to Posterior Sampling. However, instead of sampling the MDP, we construct an optimistic MDP using the quantiles of the Bayesian prior. We show that this technique enjoys a high probability worst-case regret of order Õ(√(DSAT)). Experiments in a diverse set of environments show that our algorithms outperform previous methods.

## Authors

• 7 publications
• 9 publications
• 30 publications
• ### Near-optimal Optimistic Reinforcement Learning using Empirical Bernstein Inequalities

We study model-based reinforcement learning in an unknown finite communi...
05/27/2019 ∙ by Aristide Tossou, et al. ∙ 0

• ### Near-optimal Bayesian Solution For Unknown Discrete Markov Decision Process

We tackle the problem of acting in an unknown finite and discrete Markov...
06/20/2019 ∙ by Aristide Tossou, et al. ∙ 0

• ### Reinforcement Learning: a Comparison of UCB Versus Alternative Adaptive Policies

In this paper we consider the basic version of Reinforcement Learning (R...
09/13/2019 ∙ by Wesley Cowan, et al. ∙ 1

• ### Accelerating the Computation of UCB and Related Indices for Reinforcement Learning

In this paper we derive an efficient method for computing the indices as...
09/28/2019 ∙ by Wesley Cowan, et al. ∙ 17

• ### Posterior Sampling for Large Scale Reinforcement Learning

Posterior sampling for reinforcement learning (PSRL) is a popular algori...
11/21/2017 ∙ by Georgios Theocharous, et al. ∙ 0

• ### Efficient Reinforcement Learning via Initial Pure Exploration

In several realistic situations, an interactive learning agent can pract...
06/07/2017 ∙ by Sudeep Raja Putta, et al. ∙ 0

• ### Near-Optimal BRL using Optimistic Local Transitions

Model-based Bayesian Reinforcement Learning (BRL) allows a found formali...
06/18/2012 ∙ by Mauricio Araya, et al. ∙ 0

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Markov Decision Process (MDP) is a framework that is of central importance in computer science. Indeed, MDPs are a generalization of (stochastic) shortest path problems and can thus be used for routing problems (Psaraftis et al., 2016), scheduling and resource allocation problems (Gocgun et al., 2011). One of its most successful application comes in reinforcement learning where it has been used to achieve human-level performance for a variety of games such as Go Silver et al. (2017b), Chess Silver et al. (2017a). It is also a generalization for online learning problems (such as multi-armed bandit problems) and as such has been used for online advertisement (Lu et al., 2009) and movie recommendations Qin et al. (2014).

#### Problem Formulation

In this paper, we focus on the problem of online learning of a near optimal policy for an unknown Markov Decision Process. An MDP consists of states and possible actions per state. Upon choosing an action at state , one receives a real value reward , then one transits to a next state . The reward is generated from a fixed reward distribution depending only on and similarly, the next state is generated from a fixed transition distribution depending only on . The objective is to maximize the accumulated (and undiscounted) rewards after interactions. An MDP is characterized by a quantity (called ) known as the diameter. It indicates an upper bound on the expected shortest path from any state to any other state. When this diameter (formally defined by Definition 1) is finite, the MDP is called communicating.

###### Definition 1 (Diameter of an MDP).

The diameter of an MDP is defined as the minimum number of rounds needed to go from one state and reach any other state while acting using some deterministic policy. Formally,

 D(M)=maxs≠s′,s,s′∈Sminπ:S→AT(s′|s,π)

where is the expected number of rounds it takes to reach state from using policy .

In this paper, we consider the case where the reward distributions , the transitions , and are all unknown. Given that the rewards are undiscounted, a good measure of performance is the gain, i.e. the infinite horizon average rewards. The gain of a policy starting from state s is defined by:

 V(s|π)≜limsupT→∞1TE[T∑t=1r(st,π(st))∣s1=s].

Puterman (2014) shows that there is a policy whose gain, is greater than that of any other policy. In addition, this gain is the same for all states in a communicating MDP. We can then characterize the performance of the agent by its regret defined as:

 Regret(T)≜T∑t=1(V∗−r(st,at)).

Thus our goal is equivalent to obtaining a regret as low as possible.

#### Related Work

It has been shown that any algorithm must incur a regret of in the worst case. Jaksch et al. (2010)

. Since the establishment of this lower bound on the regret, there has been numerous algorithms for the problem. They can be classified in two ways: Frequentist and Bayesian. The frequentist algorithms usually construct explicit confidence interval while the Bayesian algorithms start with a prior distribution and uses the posterior derived from Bayes Theorem. Following a long line of algorithms KL-UCRL

(Filippi et al., 2010), REGAL.C (Bartlett & Tewari, 2009), UCBVI (Azar et al., 2017), SCAL (Fruit et al., 2018) the authors of (Tossou et al., 2019) derived a frequentist algorithm that achieved the lower bound up to logarithmic factors.

In contrast, the situation is different for Bayesian algorithms. One of the first to prove theoretical guarantees for posterior sampling is Osband et al. (2013), for their PSRL algorithm. However, they only consider reinforcement learning problems with a finite and known episode length111Informally, it is known that the MDP resets to a starting state after a fixed number of steps. and prove an upper bound of on the expected Bayesian regret where is the length of the episode. Ouyang et al. (2017) generalises Osband et al. (2013) results to weakly communicating MDP and proves a on the expected Bayesian regret where is a bound on the span of the MDP. Other Bayesian algorithms have also been derived in the litterature however, none of them is able to attain the lower bound for the general communicating MDP considered in this paper. Also many of the previous Bayesian algorithms only provide guarantees about the Bayesian regret (i.e, the regret under the assumption that the true MDP is being sampled from the prior). It was thus an open-ended question whether or not one can design Bayesian algorithms with optimal worst-case regret guarantees(Osband & Van Roy, 2017, 2016). In this work, we provide guarantees for the worst-case (frequentist) regret. We solve the challenge by designing the first Bayesian algorithm with provable upper bound on the regret that matches the lower bound up to logarithmic factors. Our algorithm start with a prior on MDP and computes the posterior similarly to previous works. However, instead of sampling from the posterior, we compute a quantile from the posterior. We then uses all the MDPs possible under the quantile as a set of statistically plausible MDPs and then follow the same steps as the state-of-the art UCRL-V (Tossou et al., 2019). The idea of using quantiles have already been explored in the algorithm named Bayes-UCB (Kaufmann et al., 2012) for multi-armed bandit (a special case of MDP where there is only one single state). Our work can also be considered as a generalization to Bayes-UCB.

#### Our Contributions.

Hereby, we summarise the contributions of this paper that we elaborate in the upcoming sections.

• We provide a conceptually simple Bayesian algorithm BUCRL for reinforcement learning that achieves near-optimal worst case regret. Rather than actually sampling from the posterior distribution, we simply construct upper confidence bounds through Bayesian quantiles.

• Based on our analysis, we explain why Bayesian approaches are often superior in performance than ones based on concentration inequalities.

• We perform experiments in a variety of environments that validates the theoretical bounds as well as proves BUCRL to be better than the state-of-the-art algorithms. (Section 3)

We conclude by summarising the techniques involved in this paper and discussing the possible future works they can lead to (Section 4).

## 2 Algorithms Description and Analysis

In this section, we describe our Bayesian algorithm BUCRL. We combine Bayesian priors and posterior together with optimism in the face of uncertainty to achieve a high probability upper bound of 222 is used to hide log factors. on the worst-case regret in any finite communicating MDP. Our algorithm can be summarized as follow:

1. Consider a prior distribution over MDPs and update the prior after each observation

2. Construct a set of statistically plausible MDPs using the set of all MDPs inside a Quantile of the posterior distribution.

3. Compute a policy (called optimistic) whose gain is the maximum among all MDPs in the plausible set. We used a modified extended value iteration algorithm derived in (Tossou et al., 2019).

4. Play the computed optimistic policy for an artificial episode that lasts until the average number of times state-action pairs has been doubled reaches 1. This is known as the extended doubling trick (Tossou et al., 2019).

They are multiple variants of quantiles definition for MDP (since an unknown MDP can be viewed as a multi-variate random variable). In this paper, we adopt a specific definition of quantiles for multi-variate random variable called marginal quantiles. More precisely,

###### Definition 2 (Marginal Quantile Babu & Rao (1989)).

Let

be a multivariate random vector with joint d.f.( distribution function)

, the i-th marginal d.f. . We denote the ith marginal quantile function by:

 Qi(F,q)=inf{x:Fi(x)≥q},0≤q≤1.

Unless otherwise specified, we will refer to marginal quantile as simply quantile. For univariate distributions, the subscript can be omitted, as the quantile and the marginal quantile coincide.

Our analysis is based on the choice of a specific prior distribution for MDP with bounded rewards.

#### Prior Distribution

We consider two different prior distributions. One for computing lower bound on rewards/transitions, that is when computing -marginal quantile. One for computing upper bound on rewards/transitions, that is when computing -marginal quantile.

For the lower bound, we used independent distribution for the rewards and transitions. We also used independent distribution for the rewards of each state-action . And independent distribution for the transition from any state-action to any next subset of states

. The prior distribution for any of those components is a beta distribution of parameter

: 333Technically, beta distributions are only defined for parameter strictly greater than 0. In this paper, when the parameter is 0, we compute the posterior and the quantiles by considering the limit when tends to 0. .

The situation is similar with the upper bound. However, here the prior distribution for any component is a beta distribution of parameter : 3.

#### Posterior Distribution

Let’s start by assuming that the rewards come from the Bernoulli distribution. For the upper bounds, using Bayes rule, the posterior at round

are:

For the rewards of any :

 Beta(α+∑t≤tk:st=(s,a)rt,β+Ntk(s,a)−∑t≤tk:st=(s,a)rt)

For the transitions from any to any subset of next state are:

 Beta(α+∑t≤tk:st=(s,a)pt,β+Ntk(s,a)−∑t≤tk:st=(s,a)pt)

where if ; otherwise.

for the upper posteriors and for the lower posteriors.

#### Dealing with non-Bernoulli rewards

We deal with non-Bernoulli rewards by performing a Bernoulli trials on the observed rewards. In other words, upon observing we used where is a sample from the Bernoulli distribution of parameter . This technique is already used in Agrawal & Goyal (2012) and ensures that our prior remain valid.

#### Quantiles

When the lower and upper quantiles are respectively and . When the first parameter of the posterior is , the lower quantile is . When the second parameter of the posterior is , the upper quantile is . In all other cases, the

quantile corresponds to the inverse cumulative distribution function of the posterior at the point

. To achieve a high probability bound of on our regret, we used the following parameters respectively for the rewards and transitions , , where is the desired confidence level of the set of plausible MDPs.

###### Theorem 1 (Upper Bound on the Regret of BUCRL).

With probability at least for any , any , the regret of BUCRL is bounded by:

 R(T) ≤20⋅√min{S,log222D}DTSAlogTln(Bδ)+9DSAln(Bδ)

for .

###### Proof.

Our proof is based on the generic proof provided in Tossou et al. (2019). To apply that generic proof, we need to show that with high probability the true rewards/transitions of any state-action is contained in the lower and upper quantiles of the Bayesian Posterior. In other words we need to show that the Bayesian quantiles provide exact coverage probabilities. For that we notice that our prior lead to the same confidence interval as the Clopper-Pearson interval (See Lemma 1). Furthermore, we need to provide upper and lower bound for the maximum deviation of the Bayesian posterior quantiles from the empirical values. This is a direct consequence of Proposition 2 and 3. ∎

The following results were all useful in establishing our main result in Theorem 1. Our main contribution in Proposition 4 is the upper bound (the first term of the upper bound) for the KL-divergence of two bernoulli random variables. The last term of the upper bound is a direct derivation from the upper bounds in (Dragomir et al., 2000). Our result in Proposition 4 shows a factor of

improvement in the leading term of the upper bound. The KL divergence of Bernoulli random is useful for many online learning problems and we used it here to bound the quantile of the Binomial distributions in term of simple functions.

###### Proposition 4 (Bernoulli KL-Divergence).

For any number and such that , where , we have:

 x22(pq+x(q−p)/3)≤D(p+x∥p)≤x22(pq−xp/2)≤x2pq.

where is used to denote the KL-divergence between two Bernoulli random variables of parameters and .

###### Proof Sketch.

The main idea to prove the upper bound is by studying the sign of the function in obtained by taking the difference of the KL-divergence and the upper bound. We used Sturm’ theorem to basically show that this function starts as a decreasing function then after a point becomes increasing for the remaining of its domain. This together with the observation that at the end of its domain the function is non-positive concludes our proof. Full detailed are available in the appendix. ∎

Proposition 1 provides tight lower and upper bound for the quantile of the binomial distribution in the same simple form as Bernstein inequalities. Binomial distributions and their quantiles are useful for a lot of applications and we use it here to derive the bounds for the quantile from a Beta distribution in Proposition 2 and 3.

###### Proposition 1 (Lower and Upper bound on the Binomial Quantile).

Let . For any such that , the quantile of obeys:

 ⌊np+Cl(p,Φ−1(1−δ))⌋≤Q(Binom(n,p),1−δ)≤⌈np+Cu(p,Φ−1(1−δ))⌉

where

 Cu(x,y) =min⎧⎨⎩n(1−x), ⎷y2[nx(1−x)+(1−2x)2y236]+(1−2x)y26⎫⎬⎭ (1) Cl(x,y) =max{0,min{n(1−x)−1,√y2[nx(1−x)+x2y216]−xy24−1}} (2)

with

the quantile function of the standard normal distribution.

###### Proof Sketch.

We used the tights bounds for the cdf of Binomial in Zubkov & Serov (2013). We inverted those bounds and then use the upper and lower bound for KL divergence in Proposition 4 to conclude. Full detailed is available in the appendix. ∎

Proposition 2 and 3 provides lower and upper bound for the Beta quantiles in term of simple functions similar to the one for Bernstein inequalities. We used it to prove our main result in Theorem 1.

###### Proposition 2 (Upper bound on the Beta Quantile).

Let be for integers such that and . The th quantile of denoted by with satisfies:

 Q(Beta(x+1,n−x),1−δ) ≤xn+√(xn)(1−xn)y2n+1n(y2(56+√712)+2y+2) (3)

where , the quantile function of the standard normal distribution

###### Proof Sketch.

These bounds comes directly from the relation between Beta and Binomial cdfs. We apply Proposition 1 which gives a bounds for the quantile in term of . We then applies again Proposition 1 to bound in term of . Full proof is available in the appendix. ∎

###### Proposition 3 (Lower bound on the Beta Quantile).

Let be for integers such that and . The th quantile of denoted by with satisfies:

 Q(Beta(x,n−x+1),δ) ≥xn−√(xn)(1−xn)y2n−1n(y2(56+√712)+2y+2) (4)

where , the quantile function of the standard normal distribution.

###### Proof Sketch.

The proof comes almost exclusively by performing the same steps as in the proof of Proposition 2. ∎

## 3 Experimental Analysis

We empirically evaluate the performance of BUCRL in comparison with that of UCRL-V (Tossou et al., 2019), KL-UCRL (Filippi et al., 2010) and UCRL2 (Jaksch et al., 2010). We also compared against TSDE (Ouyang et al., 2017) which is a variant of posterior sampling for reinforcement learning suited for infinite horizon problems. We used the environments Bandits, Riverswim, GameOfSkill-v1, GameOfSkill-v2 as described in Tossou et al. (2019)

. We also eliminate unintentional bias and variance in the exact way described in

Tossou et al. (2019). Figure 1

illustrates the evolution of the average regret along with confidence region (standard deviation). Figure

1 is a log-log plot where the ticks represent the actual values.

#### Experimental Setup.

The confidence hyper-parameter of UCRL-V, KL-UCRL, and UCRL2 is set to . TSDE is initialized with independent priors for each reward and a Dirichlet prior with parameters for the transition functions , where . We plot the average regret of each algorithm over rounds computed using independent trials.

#### Implementation Notes on BUCRL

We note here that the quantiles to any subset of next states can be computed efficiently with of a complexity linear in and not the naive exponential complexity. This is because The posterior to any subset of next states only depend on the sum of the rewards of its constituent.

#### Results and Discussion.

We can see that BUCRL outperforms UCRL-V over all environments except in the Bandits one. This is in line with the theoretical regret whereby we can see that using the Bernstein bound is a factor times worse than the Bayesian quantile. Note that this is not an artifact of the proof. Indeed, pure optimism can be seen as using the proof inside the algorithm whereas the Bayesian version provides a general algorithm that has to be proven separately. Consequently, the actual performance of the Bayesian algorithm can often be much better than the bounds provided.

## 4 Conclusion

In conclusion, using Bayesian quantiles lead to an algorithm with strong performance while enjoying the best of both frequentist and Bayesian view. It also provides a conceptually simple and very general algorithm for different scenarios. Although we were only able to prove its performance for bounded rewards in

and a specific prior, we believe it should be possible to provide proof for other rewards distribution and prior such as Gaussian. As future work, it would be interesting to explore how one can re-use the idea of BUCRL for non-tabular settings such as with linear function approximation or deep learning.

## References

• Agrawal & Goyal (2012) Agrawal, S. and Goyal, N.

Analysis of thompson sampling for the multi-armed bandit problem.

In Conference on Learning Theory, pp. 39–1, 2012.
• Azar et al. (2017) Azar, M. G., Osband, I., and Munos, R. Minimax regret bounds for reinforcement learning. arXiv preprint arXiv:1703.05449, 2017.
• Babu & Rao (1989) Babu, G. J. and Rao, C. R. Joint asymptotic distribution of marginal quantiles and quantile functions in samples from a multivariate population. In Multivariate Statistics and Probability, pp. 15–23. Elsevier, 1989.
• Bartlett & Tewari (2009) Bartlett, P. L. and Tewari, A. Regal: A regularization based algorithm for reinforcement learning in weakly communicating mdps. In

Proceedings of the Twenty-Fifth Conference on Uncertainty in Artificial Intelligence

, UAI ’09, pp. 35–42. AUAI Press, 2009.
• Chiani et al. (2003) Chiani, M., Dardari, D., and Simon, M. K. New exponential bounds and approximations for the computation of error probability in fading channels. IEEE Transactions on Wireless Communications, 2(4):840–845, 2003.
• Dragomir et al. (2000) Dragomir, S. S., Scholz, M., and Sunde, J. Some upper bounds for relative entropy and applications. Computers & Mathematics with Applications, 39(9-10):91–100, 2000.
• Filippi et al. (2010) Filippi, S., Cappé, O., and Garivier, A.

Optimism in reinforcement learning and kullback-leibler divergence.

In Communication, Control, and Computing (Allerton), 2010 48th Annual Allerton Conference on, pp. 115–122. IEEE, 2010.
• Fruit et al. (2018) Fruit, R., Pirotta, M., Lazaric, A., and Ortner, R. Efficient bias-span-constrained exploration-exploitation in reinforcement learning. arXiv preprint arXiv:1802.04020, 2018.
• Gocgun et al. (2011) Gocgun, Y., Bresnahan, B. W., Ghate, A., and Gunn, M. L. A markov decision process approach to multi-category patient scheduling in a diagnostic facility. Artificial intelligence in medicine, 53(2):73–81, 2011.
• Jaksch et al. (2010) Jaksch, T., Ortner, R., and Auer, P. Near-optimal regret bounds for reinforcement learning.

Journal of Machine Learning Research

, 11(Apr):1563–1600, 2010.
• Janson (2016) Janson, S. Large deviation inequalities for sums of indicator variables. arXiv preprint arXiv:1609.00533, 2016.
• Kaas & Buhrman (1980) Kaas, R. and Buhrman, J. M. Mean, median and mode in binomial distributions. Statistica Neerlandica, 34(1):13–18, 1980.
• Kaufmann et al. (2012) Kaufmann, E., Garivier, A., and Paristech, T. On bayesian upper confidence bounds for bandit problems. In In AISTATS, 2012.
• Lu et al. (2009) Lu, T., Pál, D., and Pál, M. Showing relevant ads via context multi-armed bandits. In Proceedings of AISTATS, 2009.
• Osband & Van Roy (2016) Osband, I. and Van Roy, B. Posterior sampling for reinforcement learning without episodes. arXiv preprint arXiv:1608.02731, 2016.
• Osband & Van Roy (2017) Osband, I. and Van Roy, B. Why is posterior sampling better than optimism for reinforcement learning? In Precup, D. and Teh, Y. W. (eds.), Proceedings of the 34th International Conference on Machine Learning, volume 70 of Proceedings of Machine Learning Research, pp. 2701–2710, International Convention Centre, Sydney, Australia, 06–11 Aug 2017. PMLR.
• Osband et al. (2013) Osband, I., Russo, D., and Van Roy, B. (more) efficient reinforcement learning via posterior sampling. In Advances in Neural Information Processing Systems, pp. 3003–3011, 2013.
• Ouyang et al. (2017) Ouyang, Y., Gagrani, M., Nayyar, A., and Jain, R. Learning unknown markov decision processes: A thompson sampling approach. In Advances in Neural Information Processing Systems, pp. 1333–1342, 2017.
• Psaraftis et al. (2016) Psaraftis, H. N., Wen, M., and Kontovas, C. A. Dynamic vehicle routing problems: Three decades and counting. Networks, 67(1):3–31, 2016.
• Puterman (2014) Puterman, M. L. Markov decision processes: discrete stochastic dynamic programming. John Wiley & Sons, 2014.
• Pébay et al. (2019) Pébay, P., Rojas, J., and C Thompson, D. Sturm’s theorem with endpoints. 07 2019.
• Qin et al. (2014) Qin, L., Chen, S., and Zhu, X. Contextual combinatorial bandit and its application on diversified online recommendation. In Proceedings of the 2014 SIAM International Conference on Data Mining, pp. 461–469. SIAM, 2014.
• Short (2013) Short, M. Improved inequalities for the poisson and binomial distribution and upper tail quantile functions. ISRN Probability and Statistics, 2013, 2013.
• Silver et al. (2017a) Silver, D., Hubert, T., Schrittwieser, J., Antonoglou, I., Lai, M., Guez, A., Lanctot, M., Sifre, L., Kumaran, D., Graepel, T., et al. Mastering chess and shogi by self-play with a general reinforcement learning algorithm. arXiv preprint arXiv:1712.01815, 2017a.
• Silver et al. (2017b) Silver, D., Schrittwieser, J., Simonyan, K., Antonoglou, I., Huang, A., Guez, A., Hubert, T., Baker, L., Lai, M., Bolton, A., et al. Mastering the game of go without human knowledge. Nature, 550(7676):354, 2017b.
• Thulin et al. (2014) Thulin, M. et al. The cost of using exact confidence intervals for a binomial proportion. Electronic Journal of Statistics, 8(1):817–840, 2014.
• Tossou et al. (2019) Tossou, A., Basu, D., and Dimitrakakis, C. Near-optimal Optimistic Reinforcement Learning using Empirical Bernstein Inequalities. arXiv e-prints, art. arXiv:1905.12425, May 2019.
• Yap (2000) Yap, C.-K. Fundamental problems of algorithmic algebra, volume 49. Oxford University Press Oxford, 2000.
• Zubkov & Serov (2013) Zubkov, A. M. and Serov, A. A. A complete proof of universal inequalities for the distribution function of the binomial law. Theory of Probability & Its Applications, 57(3):539–544, 2013.

## Appendix A Proofs

### a.1 Proof of Theorem 1

Our proof is a direct application of the generic proof provided in Section B.2 of Tossou et al. (2019). To use that generic proof we need to show that with high probability the true rewards/transitions of any state-action is contained in the lower and upper interval of the Bayesian Posterior. This is a direct consequence of Lemma 1 and the fact that our posterior matches the Beta Distribution used in Lemma 1.

Furthermore, we need to provide lower and upper bounds for the maximum deviation of the Bayesian posteriors from their empirical values. This comes directly from using Proposition 2 and Proposition 3, and bounding using equation (15) in Chiani et al. (2003).

###### Lemma 1 (Coverage probability of Beta Quantile for Bernoulli random variable).

Let be independent Bernoulli random variable with common parameter such that and . Let denote the corresponding Binomial random variable. Let the (random) th quantile of the distribution and the th quantile of the distribution . If , we have:

 P[Uδ(X)≤μ≤U1−δ(X)|μ]≥1−2δ.
###### Proof.

Since each is a Bernoulli random variable with parameter , then is a Binomial random variable with parameter . According to Thulin et al. (2014) equation (4) the quantile of the Beta distribution used in this lemma corresponds exactly to the upper one sided Clopper–Pearson interval (for Binomial distribution) whose coverage probability is at least by construction (Thulin et al., 2014). The same argument holds for the lower one sided Clopper–Pearson interval. Combining them concludes the proof. ∎

###### Proposition 1 (Lower and Upper bound on the Binomial Quantile).

Let . For any such that , the quantile of obeys:

 ⌊np+Cl(p,Φ−1(1−δ))⌋≤Q(Binom(n,p),1−δ)≤⌈np+Cu(p,Φ−1(1−δ))⌉

where

 Cu(x,y) =min⎧⎨⎩n(1−x), ⎷y2[nx(1−x)+(1−2x)2y236]+(1−2x)y26⎫⎬⎭ (5) Cl(x,y) =max{0,min{n(1−x)−1,√y2[nx(1−x)+x2y216]−xy24−1}} (6)

with the quantile function of the standard normal distribution.

###### Proof.

Using basic computation, we can verify that the bounds hold trivially for , for and . Furthermore, it is known that any median of the binomial satisfies (Kaas & Buhrman, 1980). So, our bounds also holds for . As a result, we can focus the proof on the case where , and .

From equation (1) in Zubkov & Serov (2013) we have:

 Φ(sgn(kn−p)√2nD(kn∥p)) ≤P{Xn,p≤k} (7) ≤Φ(sgn(k+1n−p)√2nD(k+1n∥p))

for . Let’s also observe that when , the lower bound in (7) trivially holds since

 P{Xn,p≤k}=1≥Φ(sgn(kn−p)√2nD(kn∥p)).

#### Proof of the upper bound

Our upper bound provides a correction to the Theorem 5 in Short (2013).

Consider any () such that:

 Φ(sgn(kn−p)√2nD(kn∥p))≥1−δ. (8)

Combining (8) with the left side of (7) we have that and as a result:

 Q(Binom(n,p),1−δ) =inf{x:P{Xn,p≤x}≥1−δ}≤k (9)

So we just need to find a value satisfying (8). Remarking that is the CDF of the normal distribution (since it is the inverse of the normal quantile) we can conclude that is continuous and increasing. Applying to (8), we have:

 sgn(kn−p)√2nD(kn∥p)≥Φ−1(1−δ). (10)
##### The sign of kn−p:

Assume that . In that case, we can see that our upper bound trivially holds since . Then we can focus on the case where . Since the binomial distribution is discrete with domain the set of integers, implies that . As a result we have and .

Let a number such that . Using this in (10), we thus need to find an such that:

 D(p+x∥p)≥Φ−1(1−δ)22n

Consider a function such that . If we find an such that , then it would mean that . We will pick to be the lower bound on in Theorem 4. Now let’s observe that since , it means that . Also so that .

So the condition of Theorem 4 are satisfied and our goal becomes finding an such that:

 x22(pq+x(q−p)/3)≥Φ−1(1−δ)22n

Solving for this inequality leads to the upper bound part of the Theorem.

#### Proof of the Lower bound

If , it is easy to verify that our lower bound trivially holds. So we can focus on the case where .

Consider any () such that:

 Φ(sgn(k+1n−p)√2nD(k+1n∥p))≤1−δ (11)

Combining (11) with the right side of (7) we have that and since the CDF of a Binomial is an increasing function, we have:

 Q(Binom(n,p),1−δ) ≥k (12)
##### The sign of k+1n−p:

Let’s note that the quantile function of the binomial distribution is increasing (since it is the inverse of the cdf and the cdf is increasing). So, we have: . As a result, there exists a number satisfying both (12) and:

 Q(Binom(n,p),12)≤k.

We will try to find this number. Let’s observe that is the (smallest) median of the binomial distribution and thus we have: (Kaas & Buhrman, 1980).

So,

 k ≥Q(Binom(n,p),1−δ) (13) ≥Q(Binom(n,p),12) (14) ≥⌊np⌋ (15)

As a result, we have and .

Then our objective is to find a satisfying (11). Let a number such that .

Applying the inverse to (11) and replacing by , our objective becomes finding an such that:

 D(p+x∥p)≤Φ−1(1−δ)22n

Our objective is equivalent to finding an such that for a function such that .

We can easily verify that and (). And as a result, we pick as the first upper bound on in Theorem 4.

Our objective is thus to find () such that:

 x22(pq−xp/2)≤Φ−1(1−δ)22n

Solving for this equation and picking a value for such that , leads to the first lower bound part of the Theorem.

###### Fact 1 (See Kaufmann et al. (2012)).

Let where and some integers such that a random variable from the Beta distribution. Then, for any :

 P(Ya,b≤p) =P(Xa+b−1,1−p≤b−1) (16) P(Ya,b≥p) =P(Xa+b−1,p≤a−1) (17)

where is used to denote a random variable distributed according to the binomial distribution of parameters ( ).

###### Proposition 2 (Upper bound on the Beta Quantile).

Let be for integers such that and . The th quantile of denoted by with satisfies:

 Q(Beta(x+1,n−x),1−δ) ≤xn+√(xn)(1−xn)y2n+1n(y2(56+√712)+2y+2) (18)

where , the quantile function of the standard normal distribution

###### Proof.

For simplicity, in this proof we used and . Using Equation (16), we have:

. Since the CDF of the beta distribution is continuous, we know that . So we have

Using the upper bound for Binomial quantile in Lemma 1, we have:

where is the function defined in (5).

 p ≤xn+Cu(1−p,y)+2n=xn+√p(1−p)y2n+(2p−1)2y436n2+(2p−1)y26n+2n (19)

We would like to find an upper bound for in (19) that depends on .

Using Equation (16) with the lower bound for binomial quantile in Lemma 1, we have

 p ≥xn+max{0,min{np−1,Cl(1−p,Φ−1(1−δ))}}n≥xn (20)

Multiplying equations (20) and (19) together (both are all positive) leads to:

 p(1−p) ≤(1−xn)⎛⎝xn+√p(1−p)y2n+(2p−1)2y436n2+(2p−1)y26n+2n⎞⎠ (21)

Using the fact that , and using for the terms not involving , we have:

 p(1−p) ≤(xn)(1−xn)+√p(1−p)y2n+13y2+6n (22)

Letting in (22) leads to an inequality involving a polynomial of degree in . Solving for this inequality and then using :

 √p(1−p) ≤√(xn)(1−xn)+1√n⎛⎝√7y2+2412+√y24⎞⎠ (23)

Replacing (23) into (19) and using the fact that , we have the desired upper bound of the lemma ∎

###### Proposition 3 (Lower bound on the Beta Quantile).

Let be for integers such that and . The th quantile of denoted by with satisfies:

 Q(Beta(x,n−x+1),δ) ≥xn−√(xn)(1−xn)y2n−1n(y2(56+√712)+2y+2) (24)

where , the quantile function of the standard normal distribution.

###### Proof.

Let’s denote . Using (17), we have that: .

Since the Beta distribution is continuous and also have a continuous cdf, then there exists a unique such that:

.

As a result, we have .

Using the upper and lower bound for Binomial quantile in Lemma 1, we have respectively:

 p ≥xn−Cu(p,y)+2n=xn−√p(1−p)y2n+(1−2p)2y436n2−(1−2p)y26n−2n (25)
 p ≤xn−Cl(p,Φ−1(1−δ))n≤xn (26)

We would like to find a lower bound for in (25) that depends on .

(26) implies that:

 1−p≥1−xn (27)

Note that we can multiply (27) by (25) to get a lower bound for even if the left hand side of (25) is negative since both and are always positive.

After this multiplication, we follow the exact same steps as in the equivalent part of the proof for Lemma 2. We can do that since even if we are looking for a lower bound, all the term previously upper bounded in Lemma 2 are multiplied by .

This completes the proof for this lemma. ∎

### a.2 Useful Results

###### Proposition 4 (Lower and Upper Bound on Bernoulli KL-Divergence).

For any number and such that , where , we have:

 x22(pq+x(q−p)/3)≤D(p+x∥p)≤x22(pq−xp/2)≤x2pq.

where is used to denote the KL-divergence between two Bernoulli random variables of parameters and .

###### Proof.

The proof of the lower bound already appear in Janson (2016) (after equation (2.1)).

First, let’s observe that:

 D(p+x∥p)=p(1+xp)ln(1+xp)+h(x)

with

 h(x)={q(1−xq)ln(1−xq) if x

Note that this is a valid definition for the KL-divergence since for any ,

 limx→q−(1−xq)ln(1−xq)=0

Let a parametric function defined by:

 g(x)=p(1+xp)ln(1+xp)+h(x)−x22(pq+x(q−p−a)/b)

Where and are constants (independent of but possibly depending on ) such that for all .

We can immediately see that is continuous and differentiable in its domain since it is the sum of continuous and differentiable functions. For any , the derivative of is:

 g′(x)=ln(1+xp)−ln(1−xq)−4pqx+2x2(q−p−a)/b(2(pq+x(q−p−a)/b))2

And .

We can see that is a continuous and differentiable in .

The second derivative for any is

 g′′(x) =1x+p+1x+q−p2q2(pq+x(q−p−a)/b)3 (28) =x3(q−p−a)3b