Regret Analysis of Stochastic and Nonstochastic Multi-armed Bandit Problems

04/25/2012 ∙ by Sébastien Bubeck, et al. ∙ Princeton University Università degli Studi di Milano 0

Multi-armed bandit problems are the most basic examples of sequential decision problems with an exploration-exploitation trade-off. This is the balance between staying with the option that gave highest payoffs in the past and exploring new options that might give higher payoffs in the future. Although the study of bandit problems dates back to the Thirties, exploration-exploitation trade-offs arise in several modern applications, such as ad placement, website optimization, and packet routing. Mathematically, a multi-armed bandit is defined by the payoff process associated with each option. In this survey, we focus on two extreme cases in which the analysis of regret is particularly simple and elegant: i.i.d. payoffs and adversarial payoffs. Besides the basic setting of finitely many actions, we also analyze some of the most important variants and extensions, such as the contextual bandit model.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Optimism in face of uncertainty

The difficulty of the stochastic multi-armed bandit problem lies in the exploration-exploitation dilemma that the forecaster is facing. Indeed, there is an intrinsic tradeoff between exploiting the current knowledge to focus on the arm that seems to yield the highest rewards, and exploring further the other arms to identify with better precision which arm is actually the best. As we shall see, the key to obtain a good strategy for this problem is, in a certain sense, to simultaneously perform exploration and exploitation.

A simple heuristic principle for doing that is the so-called

optimism in face of uncertainty. The idea is very general, and applies to many sequential decision making problems in uncertain environments. Assume that the forecaster has accumulated some data on the environment and must decide how to act next. First, a set of “plausible” environments which are “consistent” with the data (typically, through concentration inequalities) is constructed. Then, the most “favorable” environment is identified in this set. Based on that, the heuristic prescribes that the decision which is optimal in this most favorable and plausible environment should be made. As we see below, this principle gives simple and yet almost optimal algorithms for the stochastic multi-armed bandit problem. More complex algorithms for various extensions of the stochastic multi-armed bandit problem are also based on the same idea which, along with the exponential weighting scheme presented in Section id1, is an algorithmic cornerstone of regret analysis in bandits.

2 Upper Confidence Bound (UCB) strategies

In this section we assume that the distribution of rewards

satisfy the following moment conditions. There exists a convex function

111 One can easily generalize the discussion to functions defined only on an interval . on the reals such that, for all ,


For example, when one can take . In this case (6) is known as Hoeffding’s lemma.

We attack the stochastic multi-armed bandit using the optimism in face of uncertainty principle. In order do so, we use assumption (6

) to construct an upper bound estimate on the mean of each arm at some fixed confidence level, and then choose the arm that looks best under this estimate. We need a standard notion from convex analysis: the Legendre-Fenchel transform of

, defined by

For instance, if then for . If then for any pair such that —see also Section 15, where the same notion is used in a different bandit model.

Let be the sample mean of rewards obtained by pulling arm for times. Note that since the rewards are i.i.d., we have that in distribution is equal to .

Using Markov’s inequality, from (6) one obtains that


In other words, with probability at least ,

We thus consider the following strategy, called -UCB, where is an input parameter: At time , select

We can prove the following simple bound.

Theorem .1 (Pseudo-regret of -Ucb)

Assume that the reward distributions satisfy (6). Then -UCB with satisfies

In case of

-valued random variables, taking

in (6) —the Hoeffding’s Lemma— gives , which in turns gives the following pseudo-regret bound


In this important special case of bounded random variables we refer to -UCB simply as -UCB. First note that if , then at least one of the three following equations must be true:


Indeed, assume that the three equations are all false, then we have:

which implies, in particular, that . In other words, letting

we just proved

Thus it suffices to bound the probability of the events (9) and (10). Using an union bound and (7) one directly obtains:

The same upper bound holds for (10). Straightforward computations conclude the proof.

3 Lower bound

We now show that the result of the previous section is essentially unimprovable when the reward distributions are Bernoulli. For we denote by

the Kullback-Leibler divergence between a Bernoulli of parameter

and a Bernoulli of parameter , defined as

Theorem .2 (Distribution-dependent lower bound)

Consider a strategy that satisfies for any set of Bernoulli reward distributions, any arm with , and any . Then, for any set of Bernoulli reward distributions the following holds

In order to compare this result with (8) we use the following standard inequalities (the left hand side follows from Pinsker’s inequality, and the right hand side simply uses ),


The proof is organized in three steps. For simplicity, we only consider the case of two arms.

First step: Notations.

Without loss of generality assume that arm is optimal and arm is suboptimal, that is . Let . Since is continuous one can find such that


We use the notation when we integrate with respect to the modified bandit where the parameter of arm is replaced by . We want to compare the behavior of the forecaster on the initial and modified bandits. In particular, we prove that with a big enough probability the forecaster can not distinguish between the two problems. Then, using the fact that we have a good forecaster by hypothesis, we know that the algorithm does not make too many mistakes on the modified bandit where arm is optimal. In other words, we have a lower bound on the number of times the optimal arm is played. This reasoning implies a lower bound on the number of times arm is played in the initial problem.

We now slightly change the notation for rewards and denote by the sequence of random variables obtained when pulling arm for times (that is, is the reward obtained from the -th pull). For , let

Note that, with respect to the initial bandit, is the (non re-normalized) empirical estimate of at time , since in that case the process is i.i.d. from a Bernoulli of parameter . Another important property is the following: for any event in the -algebra generated by the following change-of-measure identity holds:


In order to link the behavior of the forecaster on the initial and modified bandits we introduce the event


Second step: .

By (14) and (15) one has

Introduce the shorthand

Using again (15) and Markov’s inequality, the above implies

Now note that in the modified bandit arm is the unique optimal arm. Hence the assumption that for any bandit, any suboptimal arm , and any , the strategy satisfies , implies that

Third step: .

Observe that


Now we use the maximal version of the strong law of large numbers: for any sequence

of independent real random variables with positive mean ,

See, e.g., (Bubeck, 2010, Lemma 10.5).

Since and , we deduce that

Thus, by the result of the second step and (16), we get

Now recalling that , and using (13), we obtain

which concludes the proof.

4 Refinements and bibliographic remarks

The UCB strategy presented in Section 2 was introduced by Auer et al. (2002a) for bounded random variables. Theorem .2 is extracted from Lai and Robbins (1985)

. Note that in this last paper the result is more general than ours, which is restricted to Bernoulli distributions. Although

Burnetas and Katehakis (1997) prove an even more general lower bound, Theorem .2 and the UCB regret bound provide a reasonably complete solution to the problem. We now discuss some of the possible refinements. In the following, we restrict our attention to the case of bounded rewards (except in Section 4.7).

4.1 Improved constants

The regret bound proof for UCB can be improved in two ways. First, the union bound over the different time steps can be replaced by a “peeling” argument. This allows to show a logarithmic regret for any , whereas the proof of Section 2 requires —see (Bubeck, 2010, Section 2.2) for more details. A second improvement, proposed by Garivier and Cappé (2011), is to use a more subtle set of conditions than (9)–(11). In fact, the authors take both improvements into account, and show that -UCB has a regret of order for any . In the limit when tends to , this constant is unimprovable in light of Theorem .2 and (12).

4.2 Second order bounds

Although -UCB is essentially optimal, the gap between (8) and Theorem .2 can be important if is significantly larger than . Several improvements have been proposed towards closing this gap. In particular, the UCB-V algorithm of Audibert et al. (2009)

takes into account the variance of the distributions and replaces Hoeffding’s inequality by Bernstein’s inequality in the derivation of UCB. A previous algorithm with similar ideas was developed by

Auer et al. (2002a) without theoretical guarantees. A second type of approach replaces -neighborhoods in -UCB by -neighborhoods. This line of work started with Honda and Takemura (2010) where only asymptotic guarantees were provided. Later, Garivier and Cappé (2011) and Maillard et al. (2011) (see also Cappé et al. (2012)) independently proposed a similar algorithm, called KL-UCB, which is shown to attain the optimal rate in finite-time. More precisely, Garivier and Cappé (2011) showed that KL-UCB attains a regret smaller than

where is a parameter of the algorithm. Thus, KL-UCB is optimal for Bernoulli distributions, and strictly dominates -UCB for any bounded reward distributions.

4.3 Distribution-free bounds

In the limit when tends to , the upper bound in (8) becomes vacuous. On the other hand, it is clear that the regret incurred from pulling arm cannot be larger than . Using this idea, it is easy to show that the regret of -UCB is always smaller than (up to a numerical constant). However, as we shall see in the next chapter, one can show a minimax lower bound on the regret of order . Audibert and Bubeck (2009) proposed a modification of -UCB that gets rid of the extraneous logarithmic term in the upper bound. More precisely, let , Audibert and Bubeck (2010) show that MOSS (Minimax Optimal Strategy in the Stochastic case) attains a regret smaller than

up to a numerical constant. The weakness of this result is that the second term in the above equation only depends on the smallest gap . In Auer and Ortner (2010) (see also Perchet and Rigollet (2011)) the authors designed a strategy, called improved UCB, with a regret of order

This latter regret bound can be better than the one for MOSS in some regimes, but it does not attain the minimax optimal rate of order . It is an open problem to obtain a strategy with a regret always better than those of MOSS and improved UCB. A plausible conjecture is that a regret of order

is attainable. Note that the quantity appears in other variants of the stochastic multi-armed bandit problem, see Audibert et al. (2010).

4.4 High probability bounds

While bounds on the pseudo-regret are important, one typically wants to control the quantity with high probability. Showing that is close to its expectation is a challenging task, since naively one might expect the fluctuations of to be of order , which would dominate the expectation which is only of order . The concentration properties of for UCB are analyzed in detail in Audibert et al. (2009), where it is shown that concentrates around its expectation, but that there is also a polynomial (in ) probability that is of order . In fact the polynomial concentration of around can be directly derived from our proof of Theorem .1. In Salomon and Audibert (2011) it is proved that for anytime strategies (i.e., strategies that do not use the time horizon ) it is basically impossible to improve this polynomial concentration to a classical exponential concentration. In particular this shows that, as far as high probability bounds are concerned, anytime strategies are surprisingly weaker than strategies using the time horizon information (for which exponential concentration of around are possible, see Audibert et al. (2009)).

4.5 -greedy

A simple and popular heuristic for bandit problems is the -greedy strategy —see, e.g., Sutton and Barto (1998). The idea is very simple. First, pick a parameter . Then, at each step greedily play the arm with highest empirical mean reward with probability , and play a random arm with probability . Auer et al. (2002a) proved that, if is allowed to be a certain function of the current time step , namely , then the regret grows logarithmically like , provided . While this bound has a suboptimal dependence on , Auer et al. (2002a) show that this algorithm performs well in practice, but the performance degrades quickly if is not chosen as a tight lower bound of .

4.6 Thompson sampling

In the very first paper on the multi-armed bandit problem, Thompson (1933)

, a simple strategy was proposed for the case of Bernoulli distributions. The so-called Thompson sampling algorithm proceeds as follows. Assume a uniform prior on the parameters

, let be the posterior distribution for at the round, and let (independently from the past given ). The strategy is then given by . Recently there has been a surge of interest for this simple policy, mainly because of its flexibility to incorporate prior knowledge on the arms, see for example Chapelle and Li (2011) and the references therein. While the theoretical behavior of Thompson sampling has remained elusive for a long time, we have now a fairly good understanding of its theoretical properties: in Agrawal and Goyal (2012) the first logarithmic regret bound was proved, and in Kaufmann et al. (2012b) it was showed that in fact Thompson sampling attains essentially the same regret than in (8). Interestingly note that while Thompson sampling comes from a Bayesian reasoning, it is analyzed with a frequentist perspective. For more on the interplay between Bayesian strategy and frequentist regret analysis we refer the reader to Kaufmann et al. (2012a).

4.7 Heavy-tailed distributions

We showed in Section 2

how to obtain a UCB-type strategy through a bound on the moment generating function. Moreover one can see that the resulting bound in Theorem

.1 deteriorates as the tail of the distributions become heavier. In particular, we did not provide any result for the case of distributions for which the moment generating function is not finite. Surprisingly, it was shown in Bubeck et al. (2012b) that in fact there exists strategy with essentially the same regret than in (8), as soon as the variance of the distributions are finite. More precisely, using more refined robust estimators of the mean than the basic empirical mean, one can construct a UCB-type strategy such that for distributions with moment of order bounded by it satisfies

We refer the interested reader to Bubeck et al. (2012b) for more details on these ’robust’ versions of UCB.

5 Pseudo-regret bounds

As we pointed out, in order to obtain non-trivial regret guarantees in the adversarial framework it is necessary to consider randomized forecasters. Below we describe the randomized forecaster Exp3, which is based on two fundamental ideas.

Exp3 (Exponential weights for Exploration and Exploitation) Parameter: a non-increasing sequence of real numbers . Let

be the uniform distribution over

. For each round Draw an arm from the probability distribution . For each arm compute the estimated loss and update the estimated cumulative loss . Compute the new probability distribution over arms , where

First, despite the fact that only the loss of the played arm is observed, with a simple trick it is still possible to build an unbiased estimator for the loss of any other arm. Namely, if the next arm

to be played is drawn from a probability distribution , then

is an unbiased estimator (with respect to the draw of ) of . Indeed, for each we have

The second idea is to use an exponential reweighting of the cumulative estimated losses to define the probability distribution from which the forecaster will select the arm . Exponential weighting schemes are a standard tool in the study of sequential prediction schemes under adversarial assumptions. The reader is referred to the monograph by Cesa-Bianchi and Lugosi (2006) for a general introduction to prediction of individual sequences, and to the recent survey by Arora et al. (2012b) focussed on computer science applications of exponential weighting.

We provide two different pseudo-regret bounds for this strategy. The bound (19) is obtained assuming that the forecaster does not know the number of rounds . This is the anytime version of the algorithm. The bound (18), instead, shows that a better constant can be achieved using the knowledge of the time horizon.

Theorem .1 (Pseudo-regret of Exp3)

If Exp3 is run with , then


Moeover, if Exp3 is run with , then


We prove that for any non-increasing sequence Exp3 satisfies


Inequality (18) then trivially follows from (20). For (19) we use (20) and . The proof of (20) in divided in five steps.

First step: Useful equalities.

The following equalities can be easily verified:


In particular, they imply


The key idea of the proof is rewrite as follows


The reader may recognize as the cumulant-generating function (or the log of the moment-generating function) of the random variable . This quantity naturally arises in the analysis of forecasters based on exponential weights. In the next two steps we study the two terms in the right-hand side of (23).

Second step: Study of the first term in (23).

We use the inequalities and , for all , to obtain:


where the last step comes from the third equality in (21).

Third step: Study of the second term in (23).

Let , and . Then, by definition of we have


Fourth step: Summing.

Putting together (22), (23), (24) and (25) we obtain

The first term is easy to bound in expectation since, by the rule of conditional expectations and the last equality in (21) we have

For the second term we start with an Abel transformation,

since . Note that

and thus we have

To conclude the proof, we show that . Since , we then obtain . Let


Simplifying, we get (since is the uniform distribution over ),

6 High probability and expected regret bounds

In this section we prove a high probability bound on the regret. Unfortunately, the Exp3 strategy defined in the previous section is not adequate for this task. Indeed, the variance of the estimate is of order , which can be arbitrarily large. In order to ensure that the probabilities are bounded from below, the original version of Exp3 mixes the exponential weights with a uniform distribution over the arms. In order to avoid increasing the regret, the mixing coefficient associated with the uniform distribution cannot be larger than . Since this implies that the variance of the cumulative loss estimate can be of order , very little can be said about the concentration of the regret also for this variant of Exp3.

This issue can be solved by combining the mixing idea with a different estimate for losses. In fact, the core idea is more transparent when expressed in terms of gains, and so we turn to the gain version of the problem. The trick is to introduce a bias in the gain estimate which allows to derive a high probability statement on this estimate.

Lemma .1

For , let

Then, with probability at least ,

Let be the expectation conditioned on . Since for , for we have

where the last inequality uses . As a consequence, we have

Moreover, Markov’s inequality implies and thus, with probability at least ,

Exp3.P Parameters: and