Regret Analysis of the Anytime Optimally Confident UCB Algorithm

03/29/2016 ∙ by Tor Lattimore, et al. ∙ 0

I introduce and analyse an anytime version of the Optimally Confident UCB (OCUCB) algorithm designed for minimising the cumulative regret in finite-armed stochastic bandits with subgaussian noise. The new algorithm is simple, intuitive (in hindsight) and comes with the strongest finite-time regret guarantees for a horizon-free algorithm so far. I also show a finite-time lower bound that nearly matches the upper bound.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The purpose of this article is to analyse an anytime version of the Optimally Confident UCB algorithm for finite-armed subgaussian bandits (Lattimore, 2015). For the sake of brevity I will give neither a detailed introduction nor an exhaustive survey of the literature. Readers looking for a gentle primer on multi-armed bandits might enjoy the monograph by Bubeck and Cesa-Bianchi (2012) from which I borrow notation. Let be the number of arms and be the arm chosen in round . The reward is where

is the unknown vector of means and the noise term

is assumed to be -subgaussian (therefore zero-mean). The -step pseudo-regret of strategy given mean vector with maximum mean is

where the expectation is taken with respect to uncertainty in both the rewards and actions. In all analysis I make the standard notational assumption that . The new algorithm is called OCUCB- and depends on two parameters and . The algorithm chooses in rounds and subsequently with


where is the number of times arm has been chosen after round and

is its empirical estimate and

Besides the algorithm, the contribution of this article is a proof that OCUCB- satisfies a nearly optimal regret bound.

Theorem 1.

If and , then

where and and is a constant that depends only on . Furthermore, for all it holds that .

Asymptotically the upper bound matches lower bound given by Lai and Robbins (1985) except for a factor of . In the non-asymptotic regime the additional terms inside the logarithm significantly improves on UCB. The bound in Theorem 1 corresponds to a worst-case regret that is suboptimal by a factor of just . Algorithms achieving the minimax rate are MOSS (Audibert and Bubeck, 2009) and OCUCB, but both require advance knowledge of the horizon. The quantity may be interpreted as the number of “effective” arms with larger values leading to improved regret. A simple observation is that is always non-increasing in , which makes the canonical choice. In the special case that all suboptimal arms have the same expected payoff, then for all . Interestingly I could not find a regime for which the algorithm is empirically sensitive to . If , then except for additive terms the problem dependent regret enjoyed by OCUCB- is equivalent to OCUCB. Finally, if , then the asymptotic result above applies, but the algorithm in that case essentially reduces to MOSS, which is known to suffer suboptimal finite-time regret in certain regimes (Lattimore, 2015).

Intuition for regret bound. Let us fix a strategy and mean vector and suboptimal arm . Suppose that for some . Now consider the alternative mean reward with for and , which means that is the optimal action for mean vector . Standard information-theoretic analysis shows that and are not statistically separable at confidence level and in particular, if is large enough, then . For mean we have and for any reasonable algorithm we would like

But this implies that should be chosen such that

which up to terms justifies the near-optimality of the regret guarantee given in Theorem 1 for close to . Of course is not known in advance, so no algorithm can choose this confidence level. The trick is to notice that arms with should be played about as often as arm and arms with should be played about as much as arm until . This means that as approaches the critical number of samples we can approximate

Then the index used by OCUCB- is justified by ignoring terms and the usual used by UCB and other algorithms. Theorem 1 is proven by making the above approximation rigorous. The argument for this choice of confidence level is made concrete in Appendix A where I present a lower bound that matches the upper bound except for additive terms.

2 Concentration

The regret guarantees rely on a number of concentration inequalities. For this section only let be i.i.d. 1-subgaussian and and . The first lemma below is well known and follows trivially from the maximal inequality and the fact that the rewards are -subguassian.

Important remark. For brevity I use to indicate a constant that depends on but not other variables such as and . The dependence is never worse than polynomial in .

Lemma 2.

If , then .

The following lemma analyses the likelihood that ever exceeds where . By the law of the iterated logarithm a.s. and for small it has been shown by Garivier (2013) that

The case where seems not to have been analysed and relies on the usual peeling trick, but without the union bound.

Lemma 3.

There exists a monotone non-decreasing function such that for all it holds that .

Lemma 4.

Let and and , then

The final concentration lemma is quite powerful and forms the lynch-pin of the following analysis.

Lemma 5.

Let and and and be constants. Furthermore, let

be the random variable given by

Finally let . Then

  1. If , then

  2. If , then

The proofs of Lemmas 5, 4 and 3 may be found in Appendices D, C and B.

3 Analysis of the KL-UCB+ Algorithm

Let us warm up by analysing a simpler algorithm, which chooses the arm that maximises the following index.


Strategies similar to this have been called KL-UCB+ and suggested as a heuristic by

Garivier and Cappé (2011) (this version is specified to the subgaussian noise model). Recently Kaufmann (2016) has established the asymptotic optimality of strategies with approximately this form, but finite-time analysis has not been available until now. Bounding the regret will follow the standard path of bounding for each suboptimal arm . Let be the empirical estimate of the mean of the th arm having observed samples. Define and by

If and , then by the definition of we have and by the definition of

which means that . Therefore may be bounded in terms of and as follows:

It remains to bound the expectations of and . By Lemma 5a with and and it follows that and by Lemma 4

Therefore the strategy in Eq. 2 satisfies:

Remark 6.

Without changing the algorithm and by optimising the constants in the proof it is possible to show that , which is just a factor of away from the asymptotic lower bound of Lai and Robbins (1985).

4 Proof of Theorem 1

The proof follows along similar lines as the warm-up, but each step becomes more challenging, especially controlling .

Step 1: Setup and preliminary lemmas

Define to be the random set of arms for which the empirical estimate never drops below the critical boundary given by the law of iterated logarithm.


where . By Lemma 3, . It will be important that only includes arms and that the events are independent for . From the definition of the index and for it holds that for all . The following lemma shows that the pull counts for optimistic arms “chase” those of other arms up the point that they become clearly suboptimal.

Lemma 7.

There exists a constant depending only on such that if (a) and (b) and (c) , then .


First note that implies that . Comparing the indices:

On the other hand, by choosing small enough and by the definition of :

which implies that . ∎

Let be the optimistic arm with the largest return where if we define and . By Lemma 3,

with constant probability, which means that

is sub-exponentially distributed with rate dependent on

only. Define by


where is as chosen in Lemma 7. Since we will have with high probability (this will be made formal later). Let


The following lemma essentially follows from Lemma 4 and the fact that is sub-exponentially distributed. Care must be taken because and are not independent. The proof is found in Appendix E.

Lemma 8.


The last lemma in this section shows that if , then either is not chosen or the index of the th arm is not too large.

Lemma 9.

If , then or .


By the definition of we have and . By Lemma 7, if and , then . Now suppose that for all . Then

Therefore from the definition of we have that . ∎

Step 2: Regret decomposition

By Lemma 9, if , then or . Now we must show there exists a for which . This is true for arms with since by definition for all . For the remaining arms we follow the idea used in Section 3 and define a random time for each .


Then the regret is decomposed as follows


The next step is to show that the first sum is dominant in the above decomposition, which will lead to the result via Lemma 8 to bound .

Step 3: Bounding

This step is broken into two quite technical parts as summarised in the following lemma. The proofs of both results are quite similar, but the second is more intricate and is given in Appendix G.

Lemma 10.

The following hold:

Proof of Lemma 10a.

Preparing to use Lemma 5, let be given by for with and otherwise. Now define random variable by

and . Then for and abbreviating we have

where the second last inequality follows since for arms with we have and for other arms by definition. The last inequality follows from the definition of . Therefore and so , which by Lemma 5b is bounded by


where the last line follows since and

The resulting is completed substituting into Eq. 8 and applying Lemma 8 to show that . ∎

Step 4: Putting it together

By substituting the bounds given in Lemma 10 into Eq. 7 and applying Lemma 8 we obtain

which completes the proof of the finite-time bound.

Asymptotic analysis. Lemma 5 makes this straightforward. Let and

Then by Lemma 5a with and we have . Then we modify the definition of by

which is chosen such that if , then . Therefore

Classical analysis shows that and , which implies the asymptotic claim in Theorem 1.

This naive calculation demonstrates a serious weakness of asymptotic results. The term in the regret will typically dominate the higher-order terms except when is outrageously large. A more careful argument (similar to the derivation of the finite-time bound) would lead to the same asymptotic bound via a nicer finite-time bound, but the details are omitted for readability. Interestingly the result is not dependent on and so applies also to the MOSS-type algorithm that is recovered by choosing .

5 Discussion

The UCB family has a new member. This one is tuned for subgaussian noise and roughly mimics the OCUCB algorithm, but without needing advance knowledge of the horizon. The introduction of is a minor refinement on previous measures of difficulty, with the main advantage being that it is very intuitive. The resulting algorithm is efficient and close to optimal theoretically. Of course there are open questions, some of which are detailed below.

Shrinking the confidence level. Empirically the algorithm improves significantly when the logarithmic terms in the definition of are dropped. There are several arguments that theoretically justify this decision. First of all if , then it is possible to replace the term in the definition of with just and use part (a) of Lemma 5 instead of part (b). The price is that the regret guarantee explodes as tends to (also not observed in practice). The second improvement is to replace in the definition of with

which boosts empirical performance and rough sketches suggest minimax optimality is achieved. I leave details for a longer article.

Improving analysis and constants. Despite its simplicity relative to OCUCB, the current analysis is still significantly more involved than for other variants of UCB. A cleaner proof would obviously be desirable. In an ideal world we could choose or (slightly worse) allow it to converge to as grows, which is the technique used in the KL-UCB algorithm (Cappé et al., 2013, and others). I anticipate this would lead to an asymptotically optimal algorithm.

Informational confidence bounds. Speaking of KL-UCB, if the noise model is known more precisely (for example, it is bounded), then it is beneficial to use confidence bounds based on the KL divergence. Such bounds are available and could be substituted directly to improve performance without loss (Garivier, 2013, and others)

. Repeating the above analysis, but exploiting the benefits of tighter confidence intervals would be an interesting (non-trivial) problem due to the need to exploit the non-symmetric KL divergences. It is worth remarking that confidence bounds based on the KL divergence are also

not tight. For example, for Gaussian random variables they lead to the right exponential rate, but with the wrong leading factor, which in practice can improve performance as evidenced by the confidence bounds used by (near) Bayesian algorithms that exactly exploit the noise model (eg., Kaufmann et al. (2012); Lattimore (2016); Kaufmann (2016)). This is related to the “missing factor” in Hoeffding’s bound studied by Talagrand (1995).

Precise lower bounds. Perhaps the most important remaining problem for the subgaussian noise model is the question of lower bounds. Besides the asymptotic results by Lai and Robbins (1985) and Burnetas and Katehakis (1997) there has been some recent progress on finite-time lower bounds, both in the OCUCB paper and a recent article by Garivier et al. (2016). Some further progress is made in Appendix A, but still there are regimes where the bounds are not very precise.


Appendix A Lower Bounds

I now prove a kind of lower bound showing that the form of the regret in Theorem 1 is approximately correct for close to . The result contains a lower order term, which for large dominates the improvements, but is meaningful in many regimes.

Theorem 11.

Assume a standard Gaussian noise model and let be any strategy and be such that for all . Then one of the following holds:

  1. .

  2. There exists an with such that

    where and for and and are defined as and but using .


On our way to a contradiction, assume that neither of the items hold. Let be a suboptimal arm and be as in the second item above. I write and for expectation when when rewards are sampled from . Suppose


Then Lemma 2.6 in the book by Tsybakov [2008] and the same argument as used by Lattimore [2015] gives

By Markov’s inequality

Therefore , which implies that

which is a contradiction. Therefore Eq. 9 does not hold for all with , but this also leads immediately to a contradiction, since then