# A minimax and asymptotically optimal algorithm for stochastic bandits

We propose the kl-UCB ++ algorithm for regret minimization in stochastic bandit models with exponential families of distributions. We prove that it is simultaneously asymptotically optimal (in the sense of Lai and Robbins' lower bound) and minimax optimal. This is the first algorithm proved to enjoy these two properties at the same time. This work thus merges two different lines of research with simple and clear proofs.

## Authors

• 20 publications
• 27 publications
• ### KL-UCB-switch: optimal regret bounds for stochastic bandits from both a distribution-dependent and a distribution-free viewpoints

In the context of K-armed stochastic bandits with distribution only assu...
05/14/2018 ∙ by Aurélien Garivier, et al. ∙ 0

• ### Nearly Minimax-Optimal Regret for Linearly Parameterized Bandits

We study the linear contextual bandit problem with finite action sets. W...
03/30/2019 ∙ by Yingkai Li, et al. ∙ 0

• ### Asymptotically Optimal Algorithms for Budgeted Multiple Play Bandits

We study a generalization of the multi-armed bandit problem with multipl...
06/30/2016 ∙ by Alexander Luedtke, et al. ∙ 0

• ### Optimal Non-Asymptotic Lower Bound on the Minimax Regret of Learning with Expert Advice

We prove non-asymptotic lower bounds on the expectation of the maximum o...
11/06/2015 ∙ by Francesco Orabona, et al. ∙ 0

• ### Efficient Algorithms for Stochastic Repeated Second-price Auctions

Developing efficient sequential bidding strategies for repeated auctions...
11/10/2020 ∙ by Juliette Achddou, et al. ∙ 0

• ### On Limited-Memory Subsampling Strategies for Bandits

There has been a recent surge of interest in nonparametric bandit algori...
06/21/2021 ∙ by Dorian Baudry, et al. ∙ 0

• ### Profitable Bandits

Originally motivated by default risk management applications, this paper...
05/08/2018 ∙ by Mastane Achab, et al. ∙ 0

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

For regret minimization in stochastic bandit problems, two notions of time-optimality coexist. On the one hand, one may consider a fixed model: the famous lower bound by Lai and Robbins (1985) showed that the regret of any consistent strategy should grow at least as when the horizon goes to infinity. Here, is a constant depending solely on the model. A strategy with a regret upper-bounded by will be called in this paper asymptotically-optimal. Lai and Robbins provided a first example of such a strategy in their seminal work. Later, Garivier and Cappé (2011) and Maillard et al. (2011) provided finite-time analysis for variants of the UCB algorithm (see Agrawal (1995); Burnetas and Katehakis (1996); Auer et al. (2002a)) which imply asymptotic optimality. Since then, other algorithms like Bayes-UCB (Kaufmann et al., 2012)

(Korda et al., 2013) have also joined the family.

On the other hand, for a fixed horizon one may assess the quality of a strategy by the greatest regret suffered in all possible bandit models. If the regret of a bandit strategy is upper-bounded by (the optimal rate: see Auer et al. (2002b) and Cesa-Bianchi and Lugosi (2006)) for some numeric constant , this strategy is called minimax-optimal. The PolyINF and the MOSS strategies by Audibert and Bubeck (2009) were the first proved to be minimax-optimal.

Hitherto, as far as we know, no algorithm was proved to be at the same time asymptotically- and minimax-optimal. Two limited exceptions may be mentioned: the case of two Gaussian arms is treated in Garivier et al. (2016a); and the OC-UCB algorithm of Lattimore (2015) is proved to be minimax-optimal and almost problem-dependent optimal for Gaussian multi-armed bandit problems. Notably, the OC-UCB algorithm satisfies another worthwhile property of finite-time instance near-optimality, see Section 2 of Lattimore (2015) for a detailed discussion.

### Contributions.

In this work, we put forward the kl-UC algorithm, a slightly modified version of kl-UC algorithm discussed in Garivier et al. (2016a) as an empirical improvement of UCB, and analyzed in Kaufmann (2016)

. This bandit strategy is designed for some exponential distribution families, including for example Bernoulli and Gaussian laws. It borrows from the MOSS algorithm of

Audibert and Bubeck (2009) the idea to divide the horizon by the number of arms in order to reach minimax optimality. We prove that it is at the same time asymptotically- and minimax-optimal. This work thus merges the progress which has been made in different directions towards the understanding of the optimism principle, finally reconciling the two notions of time-optimality.

Insofar, our contribution answers a very simple and natural question. The need for simultaneous minimax- and problem-dependent optimality could only be addressed in very limited settings by means that could not be generalized to the framework adopted in our paper. Indeed, for a given horizon , the worst problem depends on : it involves arms separated by a gap of order . Treating the -dependent problems correctly for all appears as a quite different task than catching the optimal, problem-dependent speed of convergence for every fixed bandit model. We show in this paper that the two goals can indeed be achieved simultaneously.

Combining the two notions of optimality requires a modified exploration rate. We stick as much as possible to existing algorithms and methods, introducing just what is necessary to obtain the desired results. Starting from that of kl-UCB (so as to have a tight asymptotic analysis), one has to completely cancel the exploration bonus of the arms that have been drawn roughly

times. The consequence is very slight and harmless in the case where the best arm is much better than the others, but essential in order to minimize the regret in the worst case where the best arm is barely distinguishable from the others. Indeed, when the best arm is separated by a gap of order from the suboptimal arms, we can not afford to draw more than times a suboptimal arm so as to get a regret of order .

We present a general yet simple proof, combining the best elements of the above-cited sources which are simplified as much as possible and presented in a unified way. To this end, we develop new deviation inequalities, improving the analysis of the different terms contributing to the regret. This analysis is made in the framework which we believe is the best compromise between simplicity and generality (simple exponential families). This permits us to treat, among others, the Bernoulli and the Gaussian case at the same time. More fundamentally, this appears to us as the right, simple framework for the analysis, which emphasizes what is really required to have simple lower- and upper-bounds (the possibility to make adequate changes of measure, and Chernoff-type deviation bounds).

The paper is organized as follows. In Section 2, we introduce the setting and assumptions required for the main results, Theorems 3 and 3, which are presented in Section 3. We give the entire proofs of these results in Sections 4 and 5, with only a few technical lemmas proved in Appendix A. We conclude in Section 6 with some brief references to possible future prospects.

## 2 Notation and Setting

### Exponential families.

We consider a simple stochastic bandit problem with arms indexed by , with

. Each arm is assumed to be a probability distribution of some canonical one-dimensional exponential family

indexed by . The probability law is assumed to be absolutely continuous with respect to a dominating measure on , with a density given by

 dνθdρ(x)=exp(xθ−b(θ)),%whereb(θ)=log∫Rexθdρ(x) and Θ={θ∈R: b(θ)<+∞}.

It is well-known that is convex, twice differentiable on , that and

are respectively the mean and the variance of the distribution

. The family can thus be parametrized by the mean , for

. The Kullback-Leibler divergence between two distributions is

. This permits to define the following divergence on the set of arm expectations: for and , we write

 kl(μ,μ′):=KL(νθ,νθ′).

For a minimax analysis, we need to restrict the set of means to bounded interval: we suppose that each arm satisfies for two fixed real numbers . Our analysis requires a Pinsker-like inequality; we therefore assume that the variance is bounded in the exponential family: there exists such that

 supμ∈Ib′′(b′−1(μ))=supμ∈IV(νb′−1(μ))⩽V<+∞.

This implies that for all ,

 kl(μ,μ′)⩾12V(μ−μ′)2. (1)

In the sequel, we denote by the set of bandit problems satisfying these assumptions. By the usual Pinsker inequality, this setting includes in particular Bernoulli bandits with and (by convention, ). This also includes (bounded) Gaussian bandits with known variance , with the choice and .

### Regret.

The arms are denoted , and the expectation of arm is denoted by . At each round , the player pulls an arm and receives an independent draw of the distribution . This reward is the only piece of information available to the player. The best mean is . We denote by the number of draws of arm up to and including time . In this work, the goal is to minimize the expected regret

 RT=Tμ⋆−E[T∑t=1Yt]=E[T∑t=1(μ⋆−μAt)]=K∑a=1(μ⋆−μa)E[Na(T)].

Lai and Robbins (1985) proved that if a strategy is uniformly efficient, that is if it is such that under any bandit model of a sufficiently rich family (such as an exponential family described above) holds for every , then it needs to draw any suboptimal arm at least as often as

 E[Na(T)]⩾log(T)kl(μa,μ⋆)(1−o(1)).

In light of the previous equality, this directly implies an asymptotic lower bound on .

On the other side, a straightforward adaptation of the the proof of Theorem A.2 of Auer et al. (2002b) shows that there exists a constant depending only on the considered family of distributions such that

 supν∈FRT⩾C′min(√KT,T),

where the supremum is taken over all bandit problems in . Note that the notion of minimax-optimality is defined here up to a multiplicative constant, in contrast to the definition of (problem-dependent) asymptotic optimality. For a discussion on the minimax and asymptotic lower bounds, we refer to Garivier et al. (2016b) and references therein.

## 3 The kl-UCB++ Algorithm

We denote by the empirical mean of the first rewards from arm . The empirical mean of arm after rounds is

 ˆμa(t)=ˆμa,Na(t)=1Na(t)t∑s=1YsI{As=a}.

Parameters: The horizon and an exploration function . Initialization: Pull each arm of once. For to , do Compute for each arm the quantity

(2)
Play .

The kl-UC algorithm is a slight modification of algorithm kl-UC of Garivier and Cappé (2011) and of the kl-UCB- analyzed in Kaufmann (2016). It uses the exploration function given by

 g(n)=log+(TKn(log2+(TKn)+1)), (3)

where . The exploration function borrows the general form with the extra exploration rate from the kl-UCB algorithm, the division by the number of draws from kl-UC, and the division by the number of arm from MOSS.

The following results state that the kl-UC algorithm is simultaneously minimax- and asymptotically-optimal. [Minimax optimality] For any family satisfying the assumptions detailed in Section 2, and for any bandit model , the expected regret of the kl-UC algorithm is upper-bounded as

 RT⩽76√VKT+(μ+−μ−)K. (4)

[Asymptotic optimality] For any bandit model , for any suboptimal arm and any such that ,

 E[Na(T)]⩽log(T)kl(μa+δ,μ⋆−δ)+O(loglog(T)δ2) (5)

which implies the asymptotic optimality (see the end of the proof in Section 5 for an explicit bound). Theorems 3 and 3 are proved in Sections 4 and 5 respectively. The main differences between the two proofs are discussed at the beginning of Section 5. Note that the two regret bounds of Theorems 3 and 3 also apply to all -valued bandit models, with the value , as the deviations of

-valued random variables are dominated by those of a Bernoulli distribution with the same mean (this is discussed for example in

Cappé et al. (2013)). However, the kl-UC algorithm is not asymptotically optimal then: the regret bound in is not optimal in that case. Asymptotic optimality would require tight distribution-dependent, non-parametric upper confidence bounds (for example based on the empirical-likelihood method, as in the above cited paper). This is out of the scope of this work (and would require a lot more space).

## 4 Proof of Theorem 3

This proof merges merges ideas presented in Bubeck and Liu (2013) for the analysis of the MOSS algorithm and from the analysis of kl-UCB in Cappé et al. (2013) (see also Kaufmann (2016)). It is divided into the following steps:

### Decomposition of the regret.

Let be the index of an optimal arm. Since by definition of the strategy for all , the regret can be decomposed as follows:

 (6)

We define ; since the bound (4) is otherwise trivial, we assume in the sequel that . For the first term , as in the proof of MOSS algorithm, we carefully upper bound the probability that appears inside the integral thanks to a ’peeling trick’. The second term B is easier to handle since we can reduce the index to UCB-like-index thanks to the Pinsker inequality (1) and proceed as in Bubeck and Liu (2013).

### Step 1: Upper-bounding A.

Term is concerned with the optimal arm only. Two words of intuition: since is meant to be an upper confidence bound for , this term should not be too large, at least as long as the the confidence level controlled by function is large enough – but when the confidence level is low, the number of draws is large and deviations are unlikely.

Upper-bounding term boils down to controlling the probability that

is under-estimated at time

. Indeed,

 E[μ⋆−Ua∗(t)]⩽E[(μ⋆−Ua∗(t))+] ⩽∫+∞0P(u<μ⋆−Ua∗(t))du ⩽δ0+∫+∞δ0P(Ua∗(t)⩽μ⋆−u)du, (7)

and we need to upper bound the left-deviations of the mean of arm . On the event , we have that , and by definition of it holds that

 kl(ˆμa∗(t),μ⋆)⩾g(Na∗(t))Na∗(t).

Consequently,

 P(Ua∗(t)⩽μ⋆−u) ⩽P(ˆμa∗(t)⩽μ⋆−u  and  kl(ˆμa∗(t),μ⋆)⩾g(Na∗(t))/Na∗(t)) ⩽P(∃1⩽n⩽T,  ˆμa∗,n⩽μ⋆−u  and  kl(ˆμa∗,n,μ⋆)⩾g(n)/n). (8)

For small values of , the dominant term is given by , whereas for large the event is quite unlikely. This is why we split the probability in two terms, proceeding as follows. Let be the function defined, for , by

 f(u)=2Vu2log(Tu22VK).

Our choice of implies that , and thus

 f(u)

In particular, for it holds that

 g(n)=log(TKn(1+log2(TKn))).

It appears that is the right place where to split the probability of Equation (8): defining , we write

 P(∃1 ⩽n⩽T,  ˆμa∗,n⩽μ⋆−u  and  kl(ˆμa∗,n,μ⋆)⩾g(n)/n)⩽ (10)

Controlling terms and is a matter of deviation inequalities.
Step 1.1: Upper-bounding . The term , which involves self-normalized devation probabilities, can be upper-bounded thanks to a ’peeling trick’ as in the proof of Theorem 5 from Audibert and Bubeck (2009). We assume that , for otherwise . We use the grid , where the real will be chosen later. We write

 A1⩽+∞∑ℓ=0P(∃f(u)βℓ+1⩽n⩽f(u)βℓ,  kl+(ˆμa∗,n,μ⋆)⩾γℓ)Aℓ1, (11)

where

 γℓ=log(TβℓKf(u)(1+log2(TKf(u)))f(u)/βℓ.

Thanks to Doob’s maximal inequality (see Lemma A in Appendix A),

 Aℓ1⩽exp(−f(u)βℓ+1γℓ)=e−ℓlog(β)/β−C/β,

where

 C:=log(TKf(u)(1+log2(TKf(u)))). (12)

Plugging this last inequality into (11), together with the numerical inequality of Lemma A (see Appendix A), we get

 A1⩽+∞∑ℓ=0e−ℓlog(β)/β−C/β=11−e−log(β)/βe−C/β⩽eelog(β)/β−1e−C/β⩽2emax(β,β/(β−1))e−C/β.

But thanks to Equation (9),

 C=log(TKf(u)(1+log2(TKf(u))))⩾log(TKf(u))⩾32.

It is now time to choose , so that and . Together with the definition of , this choice yields

 A1 ⩽4e2Ce−C=4e2log(TKf(u)(1+log2(TKf(u))))1+log2(TKf(u))Kf(u)T, (13)

and therefore

 A1⩽4e2Kf(u)T=16e2VKTu2log(√T2VKu) (14)

as, for all ,

 log(x(1+log2(x)))1+log2(x)⩽1.

Step 1.2: Upper-bounding . The term is more simple to handle, as it does not involve self-normalized deviations. Thanks to the maximal inequality (recalled in Equation (33) of Appendix A) and thanks to the Pinsker-like inequality (1),

 A2⩽e−u2f(u)/2V=2VKTu2. (15)

Putting Equations (7) to (15) together, we obtain that

 E[μ⋆−Ua∗(t)]⩽δ0+∫+∞δ016e2VKTu2log(√T2VKu)+2VKTu2du. (16)

It remains only to conclude with some calculus:

 ∫+∞δ016e2VKTu2log(√T2VKu)du =[−16e2VKTulog(e√T2VKu)]+∞δ0 =16e2√V√22log(e√11)√KT.

Similarly,

 ∫+∞δ02VKTu2du=2√V22√KT,

and replacing by its value we obtain from Equation (16) the following relation:

 E[μ⋆−Ua∗(t)]⩽√V(√22+16e2√22log(e√11)+2√22)√KT.

Summing over from to , this yields:

 A⩽√V(√22+16e2√22log(e√11)+2√22)√KT. (17)

### Step 2: Upper-bounding B.

Term is of different nature, since typically . However, as for the term , we first reduce the problem to the upper-bounding of a probability:

 B ⩽T−1∑t=Kδ0+∫+∞δ0P(UAt+1(t)−μAt+1⩾u)du ⩽Tδ0+∫+∞δ0T−1∑t=KP(UAt+1(t)−μAt+1⩾u)du. (18)

The event is typical if is small, and corresponds to a deviation of the sample mean otherwise. In order to handle this correctly, we first get rid of the randomness of by the pessimistic trajectorial upper bound from Bubeck and Liu (2013)

 T−1∑t=KI{UAt+1(t)−μAt+1⩾u}⩽T∑n=1K∑a=1I{Ua,n−μa⩾u}.

In addition, we simplify the upper bound thanks to our assumption (1) that some Pinsker type inequality is available:

 Ua,n:=sup{μ∈I : kl(ˆμa,n,μ)⩽g(n)n}⩽Ba,n:=ˆμa,n+√2Vg(n)n. (19)

Hence, can be upper-bounded as

 B⩽Tδ0+K∑a=1∫+∞δ0T∑n=1P(Ba,n−μa⩾u)du. (20)

Then, we need only to upper bound for each arm . We cut the sum at the critical sample size where the event becomes atypical: for , let be the integer such that

For it holds that

 √2Vg(n)n⩽u√2. (21)

Indeed, as for all , we have

 2Vg(n)n⩽4Vnlog+(TKn),

also observe that is such that for , and thus for and

 2Vg(n)n⩽4Vn(u)log+(TKn(u)) ⩽u22h(Tu28VK)⩽u22.

Therefore, cutting the sum in (20) at , we obtain:

 T∑n=1P(Ba,n−μa⩾u) ⩽n(u)−1+T∑n=n(u)P(ˆμa,n−μa⩾u−√2Vg(n)/n) ⩽n(u)−1+T∑n=n(u)P(ˆμa,n−μa⩾u(1−1/√2)) ⩽8Vu2log(Tu28VK)+T∑n=n(u)P(ˆμa,n−μa⩾cu), (22)

where . It remains to integrate Inequality (22) from to infinity. The first summand involves the same integral as we have already met in the upper bound of term :

 ∫+∞δ08Vu2log(Tu28VK)du=16√V22log(e√114)√TK.

For the remaining summand, Inequality (33) yields

 T∑n=n(u)P(ˆμa,n−μa⩾cu) ⩽T∑n=n(u)e−u2c2n2V⩽1eu2c22V−1.

Thus, as for all ,

 ∫+∞δ01eu2c22V−1du ⩽∫+∞δ02Vu2c2du=2c2√V22√TK,

Putting everything together starting from Inequality (22), we have proved that

 ∫+∞δ0T∑n=1P(Ba,n−μa⩾u)du⩽√V22(16log(e√114)+2c2)√TK.

By Equation (20), replacing by its value finally yields

 B⩽√V(√22+16√22log(e√114)+2√22c2)√KT. (23)

### Conclusion of the proof.

It just remains to plug Inequalities (17) and (23) into Equation (6):

 A+B ⩽76√VKT,

which concludes the proof.

## 5 Proof of Theorem 3

The analysis of asymptotic optimality shares many elements with the minimax analysis, with some differences however. The decomposition of the regret into two terms and is similar, but localized on a fixed sub-optimal arm : we analyze the number of draws of and not directly the regret (and we do not need to integrate the deviations at the end). We proceed roughly as in the proof of Theorem 3 for term , which involves the deviations of an optimal arm. For term B, which stands for the behavior of the sub-optimal arm , a different (but classical) argument is used, as one cannot simply use the Pinsker-like Inequality (1) if one wants to obtain the correct constant (and thus asymptotic optimality).

### Decomposition of E[Na(T)].

If arm is pulled at time , then by definition of the strategy for any index of an optimal arm. Thus, for any fixed to be chosen later,

 {At+1=a} ⊆{μ∗−δ⩾Ua(t)}∪{μ∗−δ

As a consequence,

 E[Na(T)]⩽1+T−1∑t=KP(Ua∗(t)⩽μ∗−δ)A+T−1∑t=KP(μ∗−δ

and it remains to bound each of these terms.

### Step 1: Upper-bounding A.

As in the proof of Theorem 3, we write

 P(Ua∗(t) ⩽μ∗−δ)⩽ P(∃1⩽n⩽f(δ),  kl+(ˆμa∗,n,μ⋆)⩾g(n)/n)A1+P(∃f(δ)⩽n⩽T,  ˆμa∗,n⩽μ⋆−δ)A2, (25)

where we use the same function

 f(δ)=2Vδ2log(Tδ22KV).

Thanks to the Inequality (13) that we saw in the proof of Theorem 3, we obtain that

 A1⩽4e2log(TKf(δ)(1+log2(TKf(δ))))log(TKf(δ))f(δ)log(TKf(δ))KT⩽16e2δ22VKT.

Here, we used that for all , since the condition implies that ,

 log(x(1+log2(x)))log(x)⩽2 and log(x)log(x/log(x))⩽2,

and that

 f(δ)log(TKf(δ))=2Vδ2log(Tδ22VK)log(Tδ22VK1log(Tδ2/(2VK))).

Thanks to the maximal inequality recalled in Appendix A as Equation (33), it holds that

 A2⩽e−δ2f(δ)/(2V)=2VKTδ2. (26)

Putting Equations (25) to (26) together yields:

 A⩽(16e2+1)2VKδ2. (27)

### Step 2: Upper-bounding B.

Thanks to the definition of it holds that

 {μ∗−δ

Together with the following classical argument for regret analysis in bandit models, this yields:

 B ⩽T−1∑t=KP(kl(ˆμa(t),μ∗−δ)⩽g(Na(t))/Na(t) and At+1=a) ⩽T∑n=1P(kl(ˆμa,n,μ∗−δ)⩽g(n)/n) (28)

as it holds . Now, let be the integer defined as

 n(δ)=⎡⎢ ⎢ ⎢ ⎢⎢log(T/K(1+log2(T/K)))kl(μa+δ,μ∗−δ)⎤⎥ ⎥ ⎥ ⎥⎥.

Then, for ,

 log(T/K(1+log2(T/K)))/n⩽kl(μa+δ,μ∗−δ).

We cut the sum in (28) at , so that

 B ⩽n(δ)−1+T∑n=n(δ)P(kl(ˆμa,n,μ∗−δ)⩽kl(μa+δ,μ∗−δ)) (29)

Recall that by assumption , using the inclusion

 {kl(ˆμa,n,μ∗−δ)⩽kl(μa+δ,μ∗−δ)}⊆{ˆμa,n⩾μa+δ},

together with Inequality (33), we obtain that

and Equation (29) yields

 B⩽log(T)kl(μa+δ,μ∗−δ)+log(1/K(1+log2(T/K)))kl(μa+δ,μ∗−δ)+2Vδ2