# Boltzmann Exploration Done Right

Boltzmann exploration is a classic strategy for sequential decision-making under uncertainty, and is one of the most standard tools in Reinforcement Learning (RL). Despite its widespread use, there is virtually no theoretical understanding about the limitations or the actual benefits of this exploration scheme. Does it drive exploration in a meaningful way? Is it prone to misidentifying the optimal actions or spending too much time exploring the suboptimal ones? What is the right tuning for the learning rate? In this paper, we address several of these questions in the classic setup of stochastic multi-armed bandits. One of our main results is showing that the Boltzmann exploration strategy with any monotone learning-rate sequence will induce suboptimal behavior. As a remedy, we offer a simple non-monotone schedule that guarantees near-optimal performance, albeit only when given prior access to key problem parameters that are typically not available in practical situations (like the time horizon T and the suboptimality gap Δ). More importantly, we propose a novel variant that uses different learning rates for different arms, and achieves a distribution-dependent regret bound of order K^2 T/Δ and a distribution-independent bound of order √(KT) K without requiring such prior knowledge. To demonstrate the flexibility of our technique, we also propose a variant that guarantees the same performance bounds even if the rewards are heavy-tailed.

## Authors

• 42 publications
• 22 publications
• 22 publications
• 25 publications
• ### Optimal Algorithms for Stochastic Multi-Armed Bandits with Heavy Tailed Rewards

In this paper, we consider stochastic multi-armed bandits (MABs) with he...
10/24/2020 ∙ by Kyungjae Lee, et al. ∙ 0

• ### Corruption Robust Exploration in Episodic Reinforcement Learning

We initiate the study of multi-stage episodic reinforcement learning und...
11/20/2019 ∙ by Thodoris Lykouris, et al. ∙ 12

• ### Almost Boltzmann Exploration

Boltzmann exploration is widely used in reinforcement learning to provid...
01/25/2019 ∙ by Harsh Gupta, et al. ∙ 0

• ### Explore no more: Improved high-probability regret bounds for non-stochastic bandits

This work addresses the problem of regret minimization in non-stochastic...
06/10/2015 ∙ by Gergely Neu, et al. ∙ 0

• ### Distribution-dependent and Time-uniform Bounds for Piecewise i.i.d Bandits

We consider the setup of stochastic multi-armed bandits in the case when...
05/30/2019 ∙ by Subhojyoti Mukherjee, et al. ∙ 0

• ### Optimal Rates of (Locally) Differentially Private Heavy-tailed Multi-Armed Bandits

In this paper we study the problem of stochastic multi-armed bandits (MA...
06/04/2021 ∙ by Youming Tao, et al. ∙ 0

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Exponential weighting strategies are fundamental tools in a variety of areas, including Machine Learning, Optimization, Theoretical Computer Science, and Decision Theory

(Arora et al., 2012). Within Reinforcement Learning (Sutton and Barto, 1998; Szepesvári, 2010), exponential weighting schemes are broadly used for balancing exploration and exploitation, and are equivalently referred to as Boltzmann, Gibbs, or softmax exploration policies (Sutton, 1990; Kaelbling et al., 1996; Sutton et al., 1999; Perkins and Precup, 2003)

. In the most common version of Boltzmann exploration, the probability of choosing an arm is proportional to an exponential function of the empirical mean of the reward of that arm. Despite the popularity of this policy, very little is known about its theoretical performance, even in the simplest reinforcement learning setting of

stochastic bandit problems.

The variant of Boltzmann exploration we focus on in this paper is defined by

 pt,i∝eηtˆμt,i, (1)

where is the probability of choosing arm in round , is the empirical average of the rewards obtained from arm up until round , and is the learning rate. This variant is broadly used in reinforcement learning (Sutton and Barto, 1998; Szepesvári, 2010; Kaelbling et al., 1996; Vermorel and Mohri, 2005; Kuleshov and Precup, 2014; Osband et al., 2016). In the multiarmed bandit literature, exponential-weights algorithms are also widespread, but they typically use importance-weightedestimators for the rewards —see, e.g., (Auer et al., 1995, 2002b) (for the nonstochastic setting), (Cesa-Bianchi and Fischer, 1998) (for the stochastic setting), and (Seldin and Slivkins, 2014) (for both stochastic and nonstochastic regimes). The theoretical behavior of these algorithms is generally well understood. For example, in the stochastic bandit setting Seldin and Slivkins (2014) show a regret bound of order , where is the suboptimality gap (i.e., the smallest difference between the mean reward of the optimal arm and the mean reward of any other arm).

In this paper, we aim to achieve a better theoretical understanding of the basic variant of the Boltzmann exploration policy that relies on the empirical mean rewards. We first show that any monotone learning-rate schedule will inevitably force the policy to either spend too much time drawing suboptimal arms or completely fail to identify the optimal arm. Then, we show that a specific non-monotone schedule of the learning rates can lead to regret bound of order . However, the learning schedule has to rely on full knowledge of the gap and the number of rounds . Moreover, our negative result helps us to identify a crucial shortcoming of the Boltzmann exploration policy: it does not reason about the uncertainty of the empirical reward estimates. To alleviate this issue, we propose a variant that takes this uncertainty into account by using separate learning rates for each arm, where the learning rates account for the uncertainty of each reward estimate. We show that the resulting algorithm guarantees a distribution-dependent regret bound of order , and a distribution-independent bound of order .

Our algorithm and analysis is based on the so-called Gumbel–softmax

trick that connects the exponential-weights distribution with the maximum of independent random variables from the Gumbel distribution.

## 2 The stochastic multi-armed bandit problem

Consider the setting of stochastic multi-armed bandits: each arm yields a reward with distribution , mean , with the optimal mean reward being . Without loss of generality, we will assume that the optimal arm is unique and has index 1. The gap of arm is defined as . We consider a repeated game between the learner and the environment, where in each round , the following steps are repeated:

1. The learner chooses an arm ,

2. the environment draws a reward independently of the past,

3. the learner receives and observes the reward .

The performance of the learner is measured in terms of the pseudo-regret defined as

 RT=μ∗T−T∑t=1E[Xt,It]=μ∗T−E[T∑t=1μIt]=E[T∑t=1ΔIt]=K∑i=1ΔiE[NT,i], (2)

where we defined , that is, the number of times that arm has been chosen until the end of round . We aim at constructing algorithms that guarantee that the regret grows sublinearly.

We will consider the above problem under various assumptions of the distribution of the rewards. For most of our results, we will assume that each is -subgaussian with a known parameter , that is, that

holds for all and . It is easy to see that any random variable bounded in an interval of length is -subgaussian. Under this assumption, it is well known that any algorithm will suffer a regret of at least , as shown in the classic paper of Lai and Robbins (1985). There exist several algorithms guaranteeing matching upper bounds, even for finite horizons (Auer et al., 2002a; Cappé et al., 2013; Kaufmann et al., 2012). We refer to the survey of Bubeck and Cesa-Bianchi (2012) for an exhaustive treatment of the topic.

## 3 Boltzmann exploration done wrong

We now formally describe the heuristic form of Boltzmann exploration that is commonly used in the reinforcement learning literature

(Sutton and Barto, 1998; Szepesvári, 2010; Kaelbling et al., 1996). This strategy works by maintaining the empirical estimates of each defined as

 ˆμt,i=∑ts=1Xs,iI{Is=i}Nt,i (3)

and computing the exponential-weights distribution (1) for an appropriately tuned sequence of learning rate parameters (which are often referred to as the inverse temperature). As noted on several occasions in the literature, finding the right schedule for can be very difficult in practice (Kaelbling et al., 1996; Vermorel and Mohri, 2005). Below, we quantify this difficulty by showing that natural learning-rate schedules may fail to achieve near-optimal regret guarantees. More precisely, they may draw suboptimal arms too much even after having estimated all the means correctly, or commit too early to a suboptimal arm and never recover afterwards. We partially circumvent this issue by proposing an admittedly artificial learning-rate schedule that actually guarantees near-optimal performance. However, a serious limitation of this schedule is that it relies on prior knowledge of problem parameters and that are typically unknown at the beginning of the learning procedure. These observations lead us to the conclusion that the Boltzmann exploration policy as described by Equations (1) and (3) is no more effective for regret minimization than the simplest alternative of -greedy exploration (Sutton and Barto, 1998; Auer et al., 2002a).

Before we present our own technical results, we mention that Singh et al. (2000) propose a learning-rate schedule for Boltzmann exploration that simultaneously guarantees that all arms will be drawn infinitely often as goes to infinity, and that the policy becomes greedy in the limit. This property is proven by choosing a learning-rate schedule adaptively to ensure that in each round , each arm gets drawn with probability at least , making it similar in spirit to -greedy exploration. While this strategy clearly leads to sublinear regret, it is easy to construct examples on which it suffers a regret of at least for any small . In this paper, we pursue a more ambitious goal: we aim to find out whether Boltzmann exploration can actually guarantee polylogarithmic regret. In the rest of this section, we present both negative and positive results concerning the standard variant of Boltzmann exploration, and then move on to providing an efficient generalization that achieves consistency in a more universal sense.

### 3.1 Boltzmann exploration with monotone learning rates is suboptimal

In this section, we study the most natural variant of Boltzmann exploration that uses a monotone learning-rate schedule. It is easy to see that in order to achieve sublinear regret, the learning rate needs to increase with so that the suboptimal arms are drawn with less and less probability as time progresses. For the sake of clarity, we study the simplest possible setting with two arms with a gap of between their means. We first show that, asymptotically, the learning rate has to increase at least at a rate even when the mean rewards are perfectly known. In other words, this is the minimal affordable learning rate.

###### Proposition 1.

Let us assume that for all and both . If , then the regret grows at least as fast as .

###### Proof.

Let us define for all . The probability of pulling the suboptimal arm can be asymptotically bounded as

 P[It=2] =11+eηtΔ≥e−ηtΔ2=ω(e−η∗tΔ2)=ω(1Δ2t).

Summing up for all , we get that the regret is at least

 RT=ΔT∑t=1P[It=2]=ω(T∑t=11Δ2t)=ω(logTΔ),

thus proving the statement. ∎

This simple proposition thus implies an asymptotic lower bound on the schedule of learning rates . In contrast, Theorem 1 below shows that all learning rate sequences that grow faster than yield a linear regret, provided this schedule is adopted since the beginning of the game. This should be contrasted with Theorem 2, which exhibits a schedule achieving logarithmic regret where grows faster than only after the first rounds.

###### Theorem 1.

There exists a 2-armed stochastic bandit problem with rewards bounded in where Boltzmann exploration using any learning rate sequence such that for all has regret .

###### Proof.

Consider the case where arm gives a reward deterministically equal to whereas the optimal arm

has a Bernoulli distribution of parameter

for some . Note that the regret of any algorithm satisfies . Without loss of generality, assume that and . Then for all , independent of the algorithm, and

 pt,1=eηtBin(Nt−1,1,p)eηt/2+eηtBin(Nt−1,1,p)andpt,2=eηt/2eηt/2+eηtBin(Nt−1,1,p) .

For , Let be the event that , that is, up to time , arm gives only zero reward whenever it is sampled. Then

 P[∀t>t0 It=2] ≥P[Et0](1−P[∃t>t0 It=1∣Et0]) ≥(12−Δ)t0(1−P[∃t>t0 It=1∣Et0]).

For , let be the event that arm is sampled at time but not at any of the times . Then, for any ,

 P[∃t>t0 It=1∣Et0] =∑t>t011+eηt/2t−1∏s=t0+1(1−11+eηs/2)≤∑t>t0e−ηt/2 .

Therefore

 RT≥Δ(T−t0)(12−Δ)t0(1−∑t>t0e−ηt/2) .

Assume for some and for all . Then

 ∑t>t0e−ηt/2≤∑t>t0t−c2≤∫∞t0x−c2dx=(c2−1)t−(c2−1)0≤12

whenever where . This implies . ∎

### 3.2 A learning-rate schedule with near-optimal guarantees

The above negative result is indeed heavily relying on the assumption that holds since the beginning. If we instead start off from a constant learning rate which we keep for a logarithmic number of rounds, then a logarithmic regret bound can be shown. Arguably, this results in a rather simplistic exploration scheme, which can be essentially seen as an explore-then-commit strategy (e.g., Garivier et al. (2016)). Despite its simplicity, this strategy can be shown to achieve near-optimal performance if the parameters are tuned as a function the suboptimality gap (although its regret scales at the suboptimal rate of with this parameter). The following theorem (proved in Appendix A.1) states this performance guarantee.

###### Theorem 2.

Assume the rewards of each arm are in and let . Then the regret of Boltzmann exploration with learning rate satisfies

 RT≤16eKlogTΔ2+9KΔ2 .

## 4 Boltzmann exploration done right

We now turn to give a variant of Boltzmann exploration that achieves near-optimal guarantees without prior knowledge of either or . Our approach is based on the observation that the distribution can be equivalently specified by the rule , where is a standard Gumbel random variable111The cumulative density function of a standard Gumbel random variable is where is the Euler-Mascheroni constant. drawn independently for each arm (see, e.g., Abernethy et al. (2014) and the references therein). As we saw in the previous section, this scheme fails to guarantee consistency in general, as it does not capture the uncertainty of the reward estimates. We now propose a variant that takes this uncertainty into account by choosing different scaling factors for each perturbation. In particular, we will use the simple choice with some constant that will be specified later. Our algorithm operates by independently drawing perturbations from a standard Gumbel distribution for each arm , then choosing action

 It+1= arg\,maxi{ˆμt,i+βt,iZt,i}. (4)

We refer to this algorithm as Boltzmann–Gumbel exploration, or, in short, BGE. Unfortunately, the probabilities no longer have a simple closed form, nevertheless the algorithm is very straightforward to implement. Our main positive result is showing the following performance guarantee about the algorithm.222We use the notation .

###### Theorem 3.

Assume that the rewards of each arm are -subgaussian and let be arbitrary. Then, the regret of Boltzmann–Gumbel exploration satisfies

 RT≤K∑i=29C2log2+(TΔi/c2)Δi+K∑i=2c2eγ+18C2eσ2/2C2(1+e−γ)Δi+K∑i=2Δi.

In particular, choosing and guarantees a regret bound of

 RT=O(K∑i=2σ2log2(TΔ2i/σ2)Δi).

Notice that, unlike any other algorithm that we are aware of, Boltzmann–Gumbel exploration still continues to guarantee meaningful regret bounds even if the subgaussianity constant is underestimated—although such misspecification is penalized exponentially in the true . A downside of our bound is that it shows a suboptimal dependence on the number of rounds : it grows asymptotically as , in contrast to the standard regret bounds for the UCB algorithm of Auer et al. (2002a) that grow as . However, our guarantee improves on the distribution-independent regret bounds of UCB that are of order . This is shown in the following corollary.

###### Corollary 1.

Assume that the rewards of each arm are -subgaussian. Then, the regret of Boltzmann–Gumbel exploration with satisfies .

Notably, this bound shows optimal dependence on the number of rounds , but is suboptimal in terms of the number of arms. To complement this upper bound, we also show that these bounds are tight in the sense that the factor cannot be removed.

###### Theorem 4.

For any and such that , there exists a bandit problem with rewards bounded in where the regret of Boltzmann–Gumbel exploration with is at least .

The proofs can be found in the Appendices A.5 and A.6. Note that more sophisticated policies are known to have better distribution-free bounds. The algorithm MOSS Audibert and Bubeck (2009) achieves minimax-optimal distribution-free bounds, but distribution-dependent bounds of the form where is the suboptimality gap. A variant of UCB using action elimination and due to Auer and Ortner (2010) has regret corresponding to a

distribution-free bound. The same bounds are achieved by the Gaussian Thompson sampling algorithm of

Agrawal and Goyal (2013), given that the rewards are subgaussian.

We finally provide a simple variant of our algorithm that allows to handle heavy-tailed rewards, intended here as reward distributions that are not subgaussian. We propose to use technique due to Catoni (2012) based on the influence function

 ψ(x)={log(1+x+x2/2),for x≥0,−log(1−x+x2/2),for x≤0.

Using this function, we define our estimates as

We prove the following result regarding Boltzmann–Gumbel exploration run with the above estimates.

###### Theorem 5.

Assume that the second moment of the rewards of each arm are bounded uniformly as

and let be arbitrary. Then, the regret of Boltzmann–Gumbel exploration satisfies

 RT≤K∑i=29C2log2+(TΔi/c2)Δi+K∑i=2c2eγ+18C2eV/2C2(1+e−γ)Δi+K∑i=2Δi.

Notably, this bound coincides with that of Theorem 3, except that is replaced by . Thus, by following the proof of Corollary 1, we can show a distribution-independent regret bound of order .

## 5 Analysis

Let us now present the proofs of our main results concerning Boltzmann–Gumbel exploration, Theorems 3 and 5. Our analysis builds on several ideas from Agrawal and Goyal (2013). We first provide generic tools that are independent of the reward estimator and then move on to providing specifics for both estimators.

We start with introducing some notation. We define , so that the algorithm can be simply written as . Let be the sigma-algebra generated by the actions taken by the learner and the realized rewards up to the beginning of round . Let us fix thresholds satisfying and define . Furthermore, we define the events and . With this notation at hand, we can decompose the number of draws of any suboptimal as follows:

 E[NT,i]=T∑t=1P[It=i,E˜μt,i,Eˆμt,i]+T∑t=1P[It=i,¯¯¯¯¯¯¯¯E˜μt,i,Eˆμt,i]+T∑t=1P[It=i,¯¯¯¯¯¯¯¯Eˆμt,i]. (5)

It remains to choose the thresholds and in a meaningful way: we pick and . The rest of the proof is devoted to bounding each term in Eq. (5). Intuitively, the individual terms capture the following events:

• The first term counts the number of times that, even though the estimated mean reward of arm is well-concentrated and the additional perturbation is not too large, arm was drawn instead of the optimal arm . This happens when the optimal arm is poorly estimated or when the perturbation is not large enough. Intuitively, this term measures the interaction between the perturbations and the random fluctuations of the reward estimate around its true mean, and will be small if the perturbations tend to be large enough and the tail of the reward estimates is light enough.

• The second term counts the number of times that the mean reward of arm is well-estimated, but it ends up being drawn due to a large perturbation. This term can be bounded independently of the properties of the mean estimator and is small when the tail of the perturbation distribution is not too heavy.

• The last term counts the number of times that the reward estimate of arm is poorly concentrated. This term is independent of the perturbations and only depends on the properties of the reward estimator.

As we will see, the first and the last terms can be bounded in terms of the moment generating function of the reward estimates, which makes subgaussian reward estimators particularly easy to treat. We begin by the most standard part of our analysis: bounding the third term on the right-hand-side of (5

) in terms of the moment-generating function.

###### Lemma 1.

Let us fix any and define as the ’th time that arm was drawn. We have

Interestingly, our next key result shows that the first term can be bounded by a nearly identical expression:

###### Lemma 2.

Let us define as the ’th time that arm was drawn. For any , we have

 T∑t=1P[It=i,E˜μt,i,Eˆμt,i]≤T−1∑k=0E[exp(μ1−ˆμτk,1βτk,1)]e−γ−Δi√k3C.

It remains to bound the second term in Equation (5), which we do in the following lemma:

###### Lemma 3.

For any and any constant , we have

 T∑t=1P[It=i,¯¯¯¯¯¯¯¯E˜μt,i,Eˆμt,i]≤9C2log2+(TΔ2i/c2)+c2eγΔ2i.

The proofs of these three lemmas are included in the supplementary material.

### 5.1 The proof of Theorem 3

For this section, we assume that the rewards are -subgaussian and that is the empirical-mean estimator. Building on the results of the previous section, observe that we are left with bounding the terms appearing in Lemmas 1 and 2. To this end, let us fix a and an and notice that by the subgaussianity assumption on the rewards, the empirical mean is -subgaussian (as ). In other words,

 E[eα(ˆμτk,i−μi)]≤eα2σ2/2k

holds for any . In particular, using this above formula for , we obtain

 E[exp(ˆμτk,i−μiβτk,i)]≤eσ2/2C2.

Thus, the sum appearing in Lemma 1 can be bounded as

 T−1∑k=1E[exp(ˆμτk,i−μiβτk,i)]⋅e−Δi√k3C ≤eσ2/2C2T−1∑k=1e−Δi√k3C≤18C2eσ2/2C2Δ2i,

where the last step follows from the fact333This can be easily seen by bounding the sum with an integral. that holds for all . The statement of Theorem 3 now follows from applying the same argument to the bound of Lemma 2, using Lemma 3, and the standard expression for the regret in Equation (2). ∎

### 5.2 The proof of Theorem 5

We now drop the subgaussian assumption on the rewards and consider reward distributions that are possibly heavy-tailed, but have bounded variance. The proof of Theorem

5 trivially follows from the arguments in the previous subsection and using Proposition 2.1 of Catoni (2012) (with ) that guarantees the bound

 E[exp(±μi−ˆμt,iβt,i)∣∣ ∣∣Nt,i=n]≤exp(E[X2i]2C2). (6)

## 6 Experiments

This section concludes by illustrating our theoretical results through some experiments, highlighting the limitations of Boltzmann exploration and contrasting it with the performance of Boltzmann–Gumbel exploration. We consider a stochastic multi-armed bandit problem with arms each yielding Bernoulli rewards with mean for all suboptimal arms and for the optimal arm. We set the horizon to and the gap parameter to . We compare three variants of Boltzmann exploration with inverse learning rate parameters

• (BE-const),

• (BE-log), and

• (BE-sqrt)

for all , and compare it with Boltzmann–Gumbel exploration (BGE), and UCB with exploration bonus .

We study two different scenarios: (a) all rewards drawn i.i.d. from the Bernoulli distributions with the means given above and (b) the first rewards set to for arm . The latter scenario simulates the situation described in the proof of Theorem 1, and in particular exposes the weakness of Boltzmann exploration with increasing learning rate parameters. The results shown on Figure 1 (a) and (b) show that while some variants of Boltzmann exploration may perform reasonably well when initial rewards take typical values and the parameters are chosen luckily, all standard versions fail to identify the optimal arm when the initial draws are not representative of the true mean (which happens with a small constant probability). On the other hand, UCB and Boltzmann–Gumbel exploration continue to perform well even under this unlikely event, as predicted by their respective theoretical guarantees. Notably, Boltzmann–Gumbel exploration performs comparably to UCB in this example (even slightly outperforming its competitor here), and performs notably well for the recommended parameter setting of (noting that Bernoulli random variables are -subgaussian).

#### Acknowledgements

Gábor Lugosi was supported by the Spanish Ministry of Economy and Competitiveness, Grant MTM2015-67304-P and FEDER, EU. Gergely Neu was supported by the UPFellows Fellowship (Marie Curie COFUND program n 600387).

## References

• Abernethy et al. [2014] J. Abernethy, C. Lee, A. Sinha, and A. Tewari. Online linear optimization via smoothing. In M.-F. Balcan and Cs. Szepesvári, editors, Proceedings of The 27th Conference on Learning Theory, volume 35 of JMLR Proceedings, pages 807–823. JMLR.org, 2014.
• Agrawal and Goyal [2013] S. Agrawal and N. Goyal. Further optimal regret bounds for thompson sampling. In AISTATS, pages 99–107, 2013.
• Arora et al. [2012] S. Arora, E. Hazan, and S. Kale. The multiplicative weights update method: A meta-algorithm and applications. Theory of Computing, 8:121–164, 2012.
• Audibert and Bubeck [2009] J.-Y. Audibert and S. Bubeck. Minimax policies for bandits games. In S. Dasgupta and A. Klivans, editors, Proceedings of the 22nd Annual Conference on Learning Theory. Omnipress, June 18–21 2009.
• Auer and Ortner [2010] P. Auer and R. Ortner. UCB revisited: Improved regret bounds for the stochastic multi-armed bandit problem. Periodica Mathematica Hungarica, 61:55–65, 2010. ISSN 0031-5303.
• Auer et al. [1995] P. Auer, N. Cesa-Bianchi, Y. Freund, and R. E. Schapire. Gambling in a rigged casino: The adversarial multi-armed bandit problem. In Foundations of Computer Science, 1995. Proceedings., 36th Annual Symposium on, pages 322–331. IEEE, 1995.
• Auer et al. [2002a] P. Auer, N. Cesa-Bianchi, and P. Fischer. Finite-time analysis of the multiarmed bandit problem. Mach. Learn., 47(2-3):235–256, May 2002a. ISSN 0885-6125.
• Auer et al. [2002b] P. Auer, N. Cesa-Bianchi, Y. Freund, and R. E. Schapire. The nonstochastic multiarmed bandit problem. SIAM J. Comput., 32(1):48–77, 2002b. ISSN 0097-5397.
• Bubeck and Cesa-Bianchi [2012] S. Bubeck and N. Cesa-Bianchi. Regret Analysis of Stochastic and Nonstochastic Multi-armed Bandit Problems. Now Publishers Inc, 2012.
• Cappé et al. [2013] O. Cappé, A. Garivier, O.-A. Maillard, R. Munos, G. Stoltz, et al. Kullback–leibler upper confidence bounds for optimal sequential allocation. The Annals of Statistics, 41(3):1516–1541, 2013.
• Catoni [2012] O. Catoni. Challenging the empirical mean and empirical variance: A deviation study. Annales de l’Institut Henri Poincaré, Probabilités et Statistiques, 48(4):1148–1185, 11 2012.
• Cesa-Bianchi and Fischer [1998] N. Cesa-Bianchi and P. Fischer. Finite-time regret bounds for the multiarmed bandit problem. In ICML, pages 100–108, 1998.
• Garivier et al. [2016] A. Garivier, E. Kaufmann, and T. Lattimore. On explore-then-commit strategies. In NIPS, 2016.
• Kaelbling et al. [1996] L. P. Kaelbling, M. L. Littman, and A. W. Moore. Reinforcement learning: A survey.

Journal of artificial intelligence research

, 4:237–285, 1996.
• Kaufmann et al. [2012] E. Kaufmann, N. Korda, and R. Munos. Thompson sampling: An asymptotically optimal finite-time analysis. In ALT’12, pages 199–213, 2012.
• Kuleshov and Precup [2014] V. Kuleshov and D. Precup. Algorithms for multi-armed bandit problems. arXiv preprint arXiv:1402.6028, 2014.
• Lai and Robbins [1985] T. L. Lai and H. Robbins. Asymptotically efficient adaptive allocation rules. Advances in Applied Mathematics, 6:4–22, 1985.
• Osband et al. [2016] I. Osband, B. Van Roy, and Z. Wen. Generalization and exploration via randomized value functions. 2016.
• Perkins and Precup [2003] T. Perkins and D. Precup. A convergent form of approximate policy iteration. In S. Becker, S. Thrun, and K. Obermayer, editors, Advances in Neural Information Processing Systems 15, pages 1595–1602, Cambridge, MA, USA, 2003. MIT Press.
• Seldin and Slivkins [2014] Y. Seldin and A. Slivkins. One practical algorithm for both stochastic and adversarial bandits. In Proceedings of the 30th International Conference on Machine Learning (ICML 2014), pages 1287–1295, 2014.
• Singh et al. [2000] S. P. Singh, T. Jaakkola, M. L. Littman, and Cs. Szepesvári. Convergence results for single-step on-policy reinforcement-learning algorithms. Machine Learning, 38(3):287–308, 2000.
• Sutton [1990] R. Sutton. Integrated architectures for learning, planning, and reacting based on approximating dynamic programming. In Proceedings of the Seventh International Conference on Machine Learning, pages 216–224. San Mateo, CA, 1990.
• Sutton and Barto [1998] R. Sutton and A. Barto. Reinforcement Learning: An Introduction. MIT Press, 1998.
• Sutton et al. [1999] R. S. Sutton, D. A. McAllester, S. P. Singh, and Y. Mansour. Policy gradient methods for reinforcement learning with function approximation. In S. Solla, T. Leen, and K. Müller, editors, Advances in Neural Information Processing Systems 12, pages 1057–1063, Cambridge, MA, USA, 1999. MIT Press.
• Szepesvári [2010] Cs. Szepesvári. Algorithms for Reinforcement Learning. Synthesis Lectures on Artificial Intelligence and Machine Learning. Morgan & Claypool Publishers, 2010.
• Vermorel and Mohri [2005] J. Vermorel and M. Mohri. Multi-armed bandit algorithms and empirical evaluation. In European conference on machine learning, pages 437–448. Springer, 2005.

## Appendix A Technical proofs

### a.1 The proof of Theorem 2

For any round and action ,

 e−ηtK≤eηtˆμt−1,i∑Kj=1eηtˆμt−1,j≤eηt(ˆμt−1,i−ˆμt−1,1) . (7)

Now, for any , we can write

 I{It=i} =I{It=i,ˆμt−1,i−ˆμt−1,1<−Δi2}+I{It=i,ˆμt−1,i−ˆμt−1,1≥−Δi2} ≤I{It=i,ˆμt−1,i−ˆμt−1,1<−Δi2}+I{ˆμt−1,1≤μ1−Δi4}+I{ˆμt−1,i≥μi+Δi4} .

We take expectation of the three terms above and sum over . Because of (7), the first term is simply bounded as

 T∑t=τ+1P[It=i,ˆμt−1,i−ˆμt−1,1<−Δi2]≤T∑t=τ+1e−ηtΔi/2≤T∑t=τ+11tΔ2≤log(T+1)Δ2 .

We control the second and third term in the same way. For the second term we have that holds for any fixed and for any . Hence

 T∑t=τ+1P[ˆμt−1,1≤μ1−Δi4]≤T∑t=τ+1P[Nt−1,1≤t1]+T∑t=τ+1P[ˆμt−1,1≤μ1−Δi4,Nt−1,1>t1] .

Now observe that, because of (7) applied to the initial rounds, holds for all . By setting , Chernoff bounds (in multiplicative form) give . Standard Chernoff bounds, instead, give

 P[ˆμt−1,1≤μ1−Δi4,Nt−1,1>t1]≤t−1∑s=t1+1e−sΔ28≤8Δ2e−t1Δ28≤8Δ2e−τΔ216eK .

Therefore, for the second term we can write

 T∑t=τ+1P[ˆμt−1,1≤μ1−Δi4]≤T(e−τ8eK+8Δ2e−τΔ216eK)≤1+8Δ2 .

The third term can be bounded exactly in the same way. Putting together, we have thus obtained, for all actions ,

 ∑i>1E[NT,i]≤τ+K+8KΔ2≤16eK(logT)Δ2+9KΔ2 .

This concludes the proof. ∎

### a.2 The proof of Lemma 1

Let denote the index of the round when arm is drawn for the ’th time. We let and for . Then,

 T∑t=1P[It=i,¯¯¯¯¯¯¯¯Eˆμt,i] ≤E⎡⎣T−1∑k=0τk+1∑t=τk+1I{It=i}I{¯¯¯¯¯¯¯¯Eˆμt,i}⎤⎦ =E⎡⎣T−1∑k=0I{¯¯¯¯¯¯¯¯¯¯Eˆμτk,i}τk+1∑t=τk+1I{It=i}⎤⎦ =E⎡⎣T−1∑k=0I{¯¯¯¯¯¯¯¯¯¯Eˆμτk,i}⎤⎦ ≤1+T−1∑k=1P[ˆμτk,i≥xi] ≤1+T−1∑k=1P[ˆμτk,i−μi≥Δi3].

Now, using the fact that , we bound the last term by exploiting the subgaussianity of the rewards through Markov’s inequality:

 P[ˆμτk,i−μi≥Δi3] =P[eα(ˆμτk,i−μi)≥eαΔi3](for any α>0) ≤E[eα(ˆμτk,i−μi)]⋅e−αΔi3(Markov's % inequality) ≤eα2σ2/2k⋅e−αΔi3(the subgaussian property) ≤eσ2/2C2⋅e−Δi√k3C(choosing α=√k/C2)

Now, using the fact444This can be easily seen by bounding the sum with an integral. that holds for all , the proof is concluded. ∎

### a.3 The proof of Lemma 2

The proof of this lemma crucially builds on Lemma 1 of Agrawal and Goyal [2013], which we state and prove below.

###### Lemma 4 (cf. Lemma 1 of Agrawal and Goyal [2013]).
 P[It=i,Eˆμt,i,E˜μt,i∣∣Ft−1]≤1−qt,iqt,i⋅P[It=1,Eˆμt,i,E˜μt,i∣∣Ft−1]
###### Proof.

First, note that . We only have to care about the case when holds, otherwise both sides of the inequality are zero and the statement trivially holds. Thus, we only have to prove

Now observe that under the event implies for all (which follows from ). Thus, for any , we have

 P[It=i∣∣Ft−1,E˜μt,i]≤ P[∀j:˜μt,j≤yi∣∣Ft−1,E˜μt,i] = P[˜μt,1≤yi∣∣Ft−1,E˜μt,i]⋅P[∀j>1:˜μt,j≤yi∣∣Ft−1,E˜μt,i] = (1−qt,i)⋅P[∀j>1:˜μt,j≤yi∣∣Ft−1,E˜μt,i],

where the last equality holds because the event in question is independent of . Similarly,

 P[It=1∣∣Ft−1,E˜μt,i]≥ P[∀j>1:˜μt,1>yi≥˜μt,j∣∣Ft−1,E˜μt,i] = P[˜μt,1>yi∣∣Ft−1,E˜μt,i]⋅P[∀j>1:˜μt,j≤yi∣∣Ft−1,E˜μt,i] = qt,i⋅P[∀j>1:˜μt,j≤yi∣∣Ft−1,E˜μt,i].

Combining the above two inequalities and multiplying both sides with gives the result. ∎

We are now ready to prove Lemma 2.

###### Proof of Lemma 2.

Following straightforward calculations and using Lemma 4,

 T∑t=1P[It=i,E˜μt,i,Eˆμt,i] ≤T−1∑k=0E[1−qτk,iqτk,i].

Thus, it remains to bound the summands on the right-hand side. To achieve this, we start with rewriting as

 qτk,i =P[˜μτk,1>yi∣∣Fτk−1]=P⎡⎢⎣Zτk,1>μ1−ˆμτk,1−Δi3βτk,1∣∣ ∣∣Fτk−1⎤⎥⎦ =1−exp⎛⎜⎝−exp⎛⎜⎝