The purpose of this article is to analyse an anytime version of the Optimally Confident UCB algorithm for finite-armed subgaussian bandits (Lattimore, 2015). For the sake of brevity I will give neither a detailed introduction nor an exhaustive survey of the literature. Readers looking for a gentle primer on multi-armed bandits might enjoy the monograph by Bubeck and Cesa-Bianchi (2012) from which I borrow notation. Let be the number of arms and be the arm chosen in round . The reward is where
is the unknown vector of means and the noise termis assumed to be -subgaussian (therefore zero-mean). The -step pseudo-regret of strategy given mean vector with maximum mean is
where the expectation is taken with respect to uncertainty in both the rewards and actions. In all analysis I make the standard notational assumption that . The new algorithm is called OCUCB- and depends on two parameters and . The algorithm chooses in rounds and subsequently with
where is the number of times arm has been chosen after round and
is its empirical estimate and
Besides the algorithm, the contribution of this article is a proof that OCUCB- satisfies a nearly optimal regret bound.
If and , then
where and and is a constant that depends only on . Furthermore, for all it holds that .
Asymptotically the upper bound matches lower bound given by Lai and Robbins (1985) except for a factor of . In the non-asymptotic regime the additional terms inside the logarithm significantly improves on UCB. The bound in Theorem 1 corresponds to a worst-case regret that is suboptimal by a factor of just . Algorithms achieving the minimax rate are MOSS (Audibert and Bubeck, 2009) and OCUCB, but both require advance knowledge of the horizon. The quantity may be interpreted as the number of “effective” arms with larger values leading to improved regret. A simple observation is that is always non-increasing in , which makes the canonical choice. In the special case that all suboptimal arms have the same expected payoff, then for all . Interestingly I could not find a regime for which the algorithm is empirically sensitive to . If , then except for additive terms the problem dependent regret enjoyed by OCUCB- is equivalent to OCUCB. Finally, if , then the asymptotic result above applies, but the algorithm in that case essentially reduces to MOSS, which is known to suffer suboptimal finite-time regret in certain regimes (Lattimore, 2015).
Intuition for regret bound. Let us fix a strategy and mean vector and suboptimal arm . Suppose that for some . Now consider the alternative mean reward with for and , which means that is the optimal action for mean vector . Standard information-theoretic analysis shows that and are not statistically separable at confidence level and in particular, if is large enough, then . For mean we have and for any reasonable algorithm we would like
But this implies that should be chosen such that
which up to terms justifies the near-optimality of the regret guarantee given in Theorem 1 for close to . Of course is not known in advance, so no algorithm can choose this confidence level. The trick is to notice that arms with should be played about as often as arm and arms with should be played about as much as arm until . This means that as approaches the critical number of samples we can approximate
Then the index used by OCUCB- is justified by ignoring terms and the usual used by UCB and other algorithms. Theorem 1 is proven by making the above approximation rigorous. The argument for this choice of confidence level is made concrete in Appendix A where I present a lower bound that matches the upper bound except for additive terms.
The regret guarantees rely on a number of concentration inequalities. For this section only let be i.i.d. 1-subgaussian and and . The first lemma below is well known and follows trivially from the maximal inequality and the fact that the rewards are -subguassian.
Important remark. For brevity I use to indicate a constant that depends on but not other variables such as and . The dependence is never worse than polynomial in .
If , then .
The following lemma analyses the likelihood that ever exceeds where . By the law of the iterated logarithm a.s. and for small it has been shown by Garivier (2013) that
The case where seems not to have been analysed and relies on the usual peeling trick, but without the union bound.
There exists a monotone non-decreasing function such that for all it holds that .
Let and and , then
The final concentration lemma is quite powerful and forms the lynch-pin of the following analysis.
Let and and and be constants.
Furthermore, let be the random variable given by
be the random variable given by
Finally let . Then
If , then
If , then
3 Analysis of the KL-UCB+ Algorithm
Let us warm up by analysing a simpler algorithm, which chooses the arm that maximises the following index.
Strategies similar to this have been called KL-UCB+ and suggested as a heuristic byGarivier and Cappé (2011) (this version is specified to the subgaussian noise model). Recently Kaufmann (2016) has established the asymptotic optimality of strategies with approximately this form, but finite-time analysis has not been available until now. Bounding the regret will follow the standard path of bounding for each suboptimal arm . Let be the empirical estimate of the mean of the th arm having observed samples. Define and by
If and , then by the definition of we have and by the definition of
which means that . Therefore may be bounded in terms of and as follows:
Therefore the strategy in Eq. 2 satisfies:
Without changing the algorithm and by optimising the constants in the proof it is possible to show that , which is just a factor of away from the asymptotic lower bound of Lai and Robbins (1985).
4 Proof of Theorem 1
The proof follows along similar lines as the warm-up, but each step becomes more challenging, especially controlling .
Step 1: Setup and preliminary lemmas
Define to be the random set of arms for which the empirical estimate never drops below the critical boundary given by the law of iterated logarithm.
where . By Lemma 3, . It will be important that only includes arms and that the events are independent for . From the definition of the index and for it holds that for all . The following lemma shows that the pull counts for optimistic arms “chase” those of other arms up the point that they become clearly suboptimal.
There exists a constant depending only on such that if (a) and (b) and (c) , then .
First note that implies that . Comparing the indices:
On the other hand, by choosing small enough and by the definition of :
which implies that . ∎
Let be the optimistic arm with the largest return where if we define and . By Lemma 3,
with constant probability, which means that
is sub-exponentially distributed with rate dependent ononly. Define by
where is as chosen in Lemma 7. Since we will have with high probability (this will be made formal later). Let
The last lemma in this section shows that if , then either is not chosen or the index of the th arm is not too large.
If , then or .
By the definition of we have and . By Lemma 7, if and , then . Now suppose that for all . Then
Therefore from the definition of we have that . ∎
Step 2: Regret decomposition
By Lemma 9, if , then or . Now we must show there exists a for which . This is true for arms with since by definition for all . For the remaining arms we follow the idea used in Section 3 and define a random time for each .
Then the regret is decomposed as follows
The next step is to show that the first sum is dominant in the above decomposition, which will lead to the result via Lemma 8 to bound .
Step 3: Bounding
This step is broken into two quite technical parts as summarised in the following lemma. The proofs of both results are quite similar, but the second is more intricate and is given in Appendix G.
The following hold:
Proof of Lemma 10a.
Preparing to use Lemma 5, let be given by for with and otherwise. Now define random variable by
and . Then for and abbreviating we have
where the second last inequality follows since for arms with we have and for other arms by definition. The last inequality follows from the definition of . Therefore and so , which by Lemma 5b is bounded by
where the last line follows since and
Step 4: Putting it together
which completes the proof of the finite-time bound.
Then by Lemma 5a with and we have . Then we modify the definition of by
which is chosen such that if , then . Therefore
Classical analysis shows that and , which implies the asymptotic claim in Theorem 1.
This naive calculation demonstrates a serious weakness of asymptotic results. The term in the regret will typically dominate the higher-order terms except when is outrageously large. A more careful argument (similar to the derivation of the finite-time bound) would lead to the same asymptotic bound via a nicer finite-time bound, but the details are omitted for readability. Interestingly the result is not dependent on and so applies also to the MOSS-type algorithm that is recovered by choosing .
The UCB family has a new member. This one is tuned for subgaussian noise and roughly mimics the OCUCB algorithm, but without needing advance knowledge of the horizon. The introduction of is a minor refinement on previous measures of difficulty, with the main advantage being that it is very intuitive. The resulting algorithm is efficient and close to optimal theoretically. Of course there are open questions, some of which are detailed below.
Shrinking the confidence level. Empirically the algorithm improves significantly when the logarithmic terms in the definition of are dropped. There are several arguments that theoretically justify this decision. First of all if , then it is possible to replace the term in the definition of with just and use part (a) of Lemma 5 instead of part (b). The price is that the regret guarantee explodes as tends to (also not observed in practice). The second improvement is to replace in the definition of with
which boosts empirical performance and rough sketches suggest minimax optimality is achieved. I leave details for a longer article.
Improving analysis and constants. Despite its simplicity relative to OCUCB, the current analysis is still significantly more involved than for other variants of UCB. A cleaner proof would obviously be desirable. In an ideal world we could choose or (slightly worse) allow it to converge to as grows, which is the technique used in the KL-UCB algorithm (Cappé et al., 2013, and others). I anticipate this would lead to an asymptotically optimal algorithm.
Informational confidence bounds. Speaking of KL-UCB, if the noise model is known more precisely (for example, it is bounded), then it is beneficial to use confidence bounds based on the KL divergence. Such bounds are available and could be substituted directly to improve performance without loss (Garivier, 2013, and others)
. Repeating the above analysis, but exploiting the benefits of tighter confidence intervals would be an interesting (non-trivial) problem due to the need to exploit the non-symmetric KL divergences. It is worth remarking that confidence bounds based on the KL divergence are alsonot tight. For example, for Gaussian random variables they lead to the right exponential rate, but with the wrong leading factor, which in practice can improve performance as evidenced by the confidence bounds used by (near) Bayesian algorithms that exactly exploit the noise model (eg., Kaufmann et al. (2012); Lattimore (2016); Kaufmann (2016)). This is related to the “missing factor” in Hoeffding’s bound studied by Talagrand (1995).
Precise lower bounds. Perhaps the most important remaining problem for the subgaussian noise model is the question of lower bounds. Besides the asymptotic results by Lai and Robbins (1985) and Burnetas and Katehakis (1997) there has been some recent progress on finite-time lower bounds, both in the OCUCB paper and a recent article by Garivier et al. (2016). Some further progress is made in Appendix A, but still there are regimes where the bounds are not very precise.
- Audibert and Bubeck  Jean-Yves Audibert and Sébastien Bubeck. Minimax policies for adversarial and stochastic bandits. In Proceedings of Conference on Learning Theory (COLT), pages 217–226, 2009.
Bubeck and Cesa-Bianchi 
Sébastien Bubeck and Nicolò Cesa-Bianchi.
Regret Analysis of Stochastic and Nonstochastic Multi-armed
Foundations and Trends in Machine Learning. Now Publishers Incorporated, 2012.ISBN 9781601986269.
Burnetas and Katehakis 
Apostolos N Burnetas and Michael N Katehakis.
Optimal adaptive policies for markov decision processes.Mathematics of Operations Research, 22(1):222–255, 1997.
- Cappé et al.  Olivier Cappé, Aurélien Garivier, Odalric-Ambrym Maillard, Rémi Munos, and Gilles Stoltz. Kullback–Leibler upper confidence bounds for optimal sequential allocation. The Annals of Statistics, 41(3):1516–1541, 2013.
- Garivier  Aurélien Garivier. Informational confidence bounds for self-normalized averages and applications. arXiv preprint arXiv:1309.3376, 2013.
- Garivier and Cappé  Aurélien Garivier and Olivier Cappé. The KL-UCB algorithm for bounded stochastic bandits and beyond. In Proceedings of Conference on Learning Theory (COLT), 2011.
- Garivier et al.  Aurélien Garivier, Pierre Ménard, and Gilles Stoltz. Explore first, exploit next: The true shape of regret in bandit problems. arXiv preprint arXiv:1602.07182, 2016.
- Kaufmann  Emilie Kaufmann. On Bayesian index policies for sequential resource allocation. arXiv preprint arXiv:1601.01190, 2016.
Kaufmann et al. 
Emilie Kaufmann, Olivier Cappé, and Aurélien Garivier.
On Bayesian upper confidence bounds for bandit problems.
International Conference on Artificial Intelligence and Statistics, pages 592–600, 2012.
- Lai and Robbins  Tze Leung Lai and Herbert Robbins. Asymptotically efficient adaptive allocation rules. Advances in applied mathematics, 6(1):4–22, 1985.
- Lattimore  Tor Lattimore. Optimally confident UCB: Improved regret for finite-armed bandits. Technical report, 2015. URL http://arxiv.org/abs/1507.07880.
- Lattimore  Tor Lattimore. Regret analysis of the finite-horizon Gittins index strategy for multi-armed bandits. In Proceedings of Conference on Learning Theory (COLT), 2016.
- Talagrand  Michel Talagrand. The missing factor in Hoeffding’s inequalities. In Annales de l’IHP Probabilités et statistiques, volume 31, pages 689–702, 1995.
- Tsybakov  Alexandre B Tsybakov. Introduction to nonparametric estimation. Springer Science & Business Media, 2008.
Appendix A Lower Bounds
I now prove a kind of lower bound showing that the form of the regret in Theorem 1 is approximately correct for close to . The result contains a lower order term, which for large dominates the improvements, but is meaningful in many regimes.
Assume a standard Gaussian noise model and let be any strategy and be such that for all . Then one of the following holds:
There exists an with such that
where and for and and are defined as and but using .
On our way to a contradiction, assume that neither of the items hold. Let be a suboptimal arm and be as in the second item above. I write and for expectation when when rewards are sampled from . Suppose
By Markov’s inequality
Therefore , which implies that
which is a contradiction. Therefore Eq. 9 does not hold for all with , but this also leads immediately to a contradiction, since then