Upper Confidence Bounds for Combining Stochastic Bandits

12/24/2020 ∙ by Ashok Cutkosky, et al. ∙ Google 0

We provide a simple method to combine stochastic bandit algorithms. Our approach is based on a "meta-UCB" procedure that treats each of N individual bandit algorithms as arms in a higher-level N-armed bandit problem that we solve with a variant of the classic UCB algorithm. Our final regret depends only on the regret of the base algorithm with the best regret in hindsight. This approach provides an easy and intuitive alternative strategy to the CORRAL algorithm for adversarial bandits, without requiring the stability conditions imposed by CORRAL on the base algorithms. Our results match lower bounds in several settings, and we provide empirical validation of our algorithm on misspecified linear bandit and model selection problems.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

This paper studies the classic contextual bandit problem in a stochastic setting [19, 9], which is a generalization of the even more classical multi-armed bandit problem [18]. In each of rounds indexed by , we observe an i.i.d. random context , which we use to select an action based on some policy . Then, we receive a noisy reward , whose expectation is a function of only and : . The goal is to perform nearly as well as the best policy in hindsight by minimizing the regret:

where , and is some space of possible policies.

This problem and variants has been extensively studied under diverse assumptions about the space of policies and distributions of the rewards and values for . (e.g. see [8, 19, 9, 20, 25, 4]). Many of these algorithms have different behaviors in different environments (e.g. one algorithm might do much better if the reward is a linear function of the context, while another might do better if the reward is independent of the context). This plethora of prior algorithms necessitates a “meta-decision”: If the environment is not known in advance, which algorithm should be used for the task at hand? Even in hindsight, it may not be obvious which algorithm was most optimized for the experienced environment, and so this meta-decision can be quite difficult.

We model this meta-decision by assuming we have access to base bandit algorithms . We will attempt to design a meta-algorithm whose regret is comparable to the best regret experienced by any base algorithm in hindsight for the current environment. Since we don’t know in advance which base algorithm will be optimal for the current environment, we need to address this problem in an online fashion. On the th round, we will choose some index and play the action suggested by the algorithm . The primary difficulty is that some base algorithms might perform poorly at first, and then begin to perform well later. A naive strategy might discard such algorithms based on the poor early performance, so some enhanced form of exploration is necessary for success. A pioneering prior work on this setting has considered the adversarial rather than stochastic case [5], and utilizes a sampling based on mirror descent with a clever mirror-map. Somewhat simplifying these results, suppose each algorithm guarantees regret for some and 111in many common settings, . Given a user-specified learning rate parameter , [5, 23] guarantee:


The value of is chosen apriori, but the values of need not be known. To gain a little intuition for this expression, suppose all , and set . Then the regret is .

In this paper, we leverage the stochastic environment to avoid requiring some technical stability conditions needed in [5]: our result is a true black-box meta-algorithm that does not require any modifications to the base algorithms. Moreover, our general technique is in our view both different and much simpler. Our regret bound also improves on (1) by virtue of being non-uniform over the base algorithms: given any parameters , we obtain


This recovers (1) when all are equal. In general one can think of the values as specifying a kind of prior over the which allows us to develop more delicate trade-offs between their performances. For example, consider again the setting when all . If we believe that is more likely to perform well, then by setting and for , we can obtain a regret of .

This type of bound is in some sense a continuation of a general trend in the bandit community towards finding algorithms that adapt to various properties of the environment that are unknown in foresight (e.g [11, 26, 7, 16]). However, instead of committing to some property (e.g. large value of

, or small variance of

), we instead design an algorithm that is in some sense “future-proof”, as new algorithms can be easily incorporated as new base algorithms .

In a recent independent work, [23] has extended the techniques of [5] to our same stochastic setting. They use a clever smoothing technique to also dispense with the stability condition required by [5], and achieve the regret bound (1). In addition to achieving the non-uniform bound (2), we also improve upon their work in two other ways: our algorithm requires only space in contrast to

space, and we allow for our base algorithms to only guarantee in-expectation rather than high-probability bounds.

In the stochastic setting, it is frequently possible to obtain logarithmic regret subject to various forms of “gap” assumptions on the actions. However, the method of [5], even when considered in the stochastic setting as in [23], seems unable to obtain such bounds. Instead, [6] has recently provided a method based on UCB that can achieve such results. Their algorithm is similar to ours, but we devise a somewhat more intricate method that is able to not only obtain the results outlined previously, but also match the logarithmic regret bounds provided by [6].

In the stochastic setting, [1] introduces a new model selection technique called Regret Balancing. However this approach requires knowledge of the exact regret bounds of the optimal base algorithm. Our approach not only avoids this requirement, but also results in stronger regret guarantees than those in that paper.

We also consider an extension of our techniques to the setting of linear contextual bandits with adversarial features. In this case we require some modifications to our combiner algorithm and assume that the base algorithms are in fact instances of linUCB [13] or similar confidence-ellipsoid based algorithms. However, subject to these restrictions we are able to recover essentially similar results as in the non-adversarial setting, which we present in Section G.

The rest of this paper is organized as follows. In section 2 we describe our formal setup and some assumptions. In section 3 we provide our algorithm and main regret bound in a high-probability setting. In section 5, we extend our analysis to in-expectation regret bounds as well as providing an automatic tuning of some parameters in the main algorithm. In section 6 we sketch some ways in which our algorithm can be applied, and show that it matches some prior lower bound frontiers. Finally, in section 7 we provide empirical validation of our algorithm.

2 Problem Setup

We let be a space of actions, a space of policies, and a space of contexts. Each policy is a function . (random policies can be modeled by pushing the random bits into the context). In each round we choose , then see an i.i.d. random context , and then receive a random reward . Let denote the sequence . There is an unknown function such that for any . The distribution of is independent of all other values conditioned on and . We will also write and . Let . Define . Then we define the regret (often called “pseudo-regret” instead) as:

Each base bandit algorithm can be viewed as a randomized procedure that takes any sequence and outputs some . At the th round of the bandit game, our algorithm will choose some index . Then we obtain policy from , and take action . The policy is the output of on the input sequence of policies, contexts, and rewards for all the prior rounds for which we have chosen this same index . After receiving the reward , we send this reward as feedback to .

In order to formalize this analysis more cleanly, for all , we define

independent random variables

, where each . Further, we define random variables and such that is the output of on the input sequence and is a reward obtained by choosing policy with context . We also define random variables .

Then we can rephrase the high-level action of our algorithm as follows: We choose some index . Then we play policy , see context , and obtain reward . We define . Note that the distribution of the observed reward as well as the expected reward is maintained by this description of the random variables.

2.1 Assumptions

We will consider two settings for the base algorithms in our analysis. First, a high-probability setting in which we wish to provide a regret bound that holds with probability at least for some given . Second, an in-expectation setting in which we simply wish to bound the expected value of the regret. In the first setting, we require high-probability bounds on the regret of the base algorithms, while in the second setting we do not. Our approach to the in-expectation setting will be a construction that uses our high-probability algorithm as a black-box. Thus, the majority of our analysis takes place in the high-probability setting (in Section 3), for which we now describe the assumptions.

In the high-probability setting, we assume there are (known) numbers and , a and an (unknown) set such that with probability at least , there is some such that for all ,


Intuitively, this is saying that each algorithm comes with a putative regret bound , and with high probability there is some with for which its claimed regret bound is in fact correct. The assumption is stronger as becomes smaller, and our final results will depend on the size of . In Section F, we provide some examples of algorithms that satisfy the requirement (3). Generally, it turns out that most algorithms based the optimism principle can be made to work in this setting for any desired . We will refer to an index that satisfied (3) as being “well-specified”.

To gain intuition about our results, we recommend that the reader supposes that is a singleton for some unknown index , and for all (note that may be much smaller than the set of indices for which 3 holds). As a concrete example, suppose and for all , so that we are playing a classic -armed bandits problem. Let be an instance of UCB, but restricted to the first arms, while is instance of UCB restricted to the last arms. Then, we can set and . Now, depending on which is the optimal arm, we may have or . Although would also satisfy the assumptions, since our results improve when is smaller and is unknown to the algorithm, we are free to choose the smallest possible .

These assumptions may seem stronger than prior work at first glance: not only do we require high-probability rather than in-expectation regret bounds, we require at least one base algorithm to be well-specified, and we require knowledge of the putative regret bounds through the coefficients and . Prior work in this setting (e.g. [5, 23]) dispenses with the last two requirements, and [5] also dispenses with the high-probability requirement. However, it turns out that through a simple doubling construction, we can easily incorporate base algorithms with unknown regret bounds that hold only in expectation into our framework. The ability to handle unknown regret bounds dispenses with the well-specified assumption. We describe this construction and the relevant assumptions in Section 5, and show that it only weakens our analysis by log factors.

3 UCB over Bandits

In this Section, we describe our meta-learner for the high-probability setting. The intuition is based on upper-confidence bounds: first, we observe that the unknown well-specified algorithm ’s rewards behave very similarly to independent bounded random variables with mean in that their average value concentrates about with radius . From this, one might imagine that for any index , the value of gives some kind of upper-confidence bound for the final total reward of algorithm

. We could then feed these estimates into a UCB-style algorithm that views the

base algorithms as arms. Unfortunately, such an approach is complicated by two issues. First, the putative regret bounds for each may not actually hold, which could damage the validity of our confidence estimates. Second, the confidence bounds for different algorithms may have very imbalanced behavior due to the different values of and , and we would like our final regret bound to depend only on and .

We address the first issue by keeping track of the statistic . If at any time, then we can conclude that is not the well-specified index and so we simply discard . Moreover, it turns out that so long as , the rewards are “well-behaved” enough that our meta-UCB algorithm can operate correctly.

We address the second issue by employing shifted confidence intervals in a construction analogous to that employed by [21], who designed a -armed bandit algorithm with a regret bound that depends on the identity of the best arm. Essentially, each algorithm is associated with a target regret bound,

, and each confidence interval is decreased by

. Assuming satisfies some technical conditions, this will guarantee that the regret is at most for any such that is well-specified.

Formally, our algorithm is provided in Algorithm 1, and its analysis is given in Theorem 1, proved in Appendix B. Note that little effort has been taken to improve the constants or log factors.

  Input: Bandit algorithms , numbers , , , .
  Set for all , set for all , and set
  for  do
     Set for all .
     Set .
     Update and for .
     Get th policy from . See context and play action .
     Receive reward , provide reward and context as feedback to .
     Update .
     if  then
     end if
  end for
Algorithm 1 Bandit Combiner
Theorem 1.

Suppose there is a set such that with probability at least , there is some such that

for all . Further, suppose the and are known, and the satisfy:

Let be the expected reward of Algorithm 1 at time . Then, with probability at least , the regret satisfies:

Note that the algorithm does not know the set .

The conditions on in this Theorem are somewhat opaque, so to unpack this a bit we provide the following corollary:

Corollary 2.

Suppose there is a set such that with probability at least , there is some such that for all . Further, suppose we are given positive real numbers . Set via:

Then, with probability at least , the regret of Algorithm 1 satisfies:

In most settings, and , so that is smaller than .

Proof sketch of Theorem 1.

While the full proof of Theorem 1 is deferred to the appendix, we sketch the main ideas here. For simplicity, we consider the case that for all , assume that is a singleton set, assume , and drop all log factors and constants. Then by some martingale concentration bounds combined with the high-probability regret bound on , we have that for all with high probability. Furthermore, by martingale concentration again, we have that . Therefore, an algorithm is dropped from the set only if , which does not happen for algorithm with probability at least . Further, by definition of and another martingale bound, all algorithms satisfy with high probability. Let us consider the instantaneous regret on some round in which . In this case, we must have , so that we can write:

Summing over all timesteps for which algorithm is chosen, we have

Now summing over all indices , we use the fact that and the assumption that to conclude that the regret over all rounds in which is not chosen is at most . For the rounds in which is chosen, we experience regret , which concludes the Theorem. ∎

4 Gap-dependent regret bounds

In this section, we provide an analog of the standard “gap-dependent” bound for UCB. As a motivating example, consider the setting in which all for never play any policy with for some . In this case, we might hope to perform much better, in the same way that standard UCB obtains logarithmic regret when the suboptimal arms have a non-negligible gap between their rewards and the optimal rewards. Specifically, we have the following result, whose proof is deferred to Section C:

Theorem 3.

Suppose that there is some such that with probability at least , we have:

Also, for all , define and by:

And let with be the set of indices with for .

For , let and for be arbitrary. Then with probability at least , the regret of Algorithm 1 satisfies:

Note that we have made no conditions on in this expression. In particular, consider the case that each algorithm considers only a subset of the possible policies, and that is the only algorithm that is allowed to choose the optimal policy with reward . Then is at least the gap between the reward of the best policy available to and . Thus for large enough , will be all indices except , so that the overall regret provided in Theorem 3 is .

As another example of this Theorem in action, let us consider the setting studied by [6]. Specifically, each has a putative regret bound of for all for some , and in fact obtains its bound. However, for all , also suffers for all for some constant . For example, this might occur if each is restricted to some subset of actions that does not include the best action. Now, recall that we made no restrictions of in Theorem 3, so we are free to set for all . Then, we will have and obtain the following Corollary:

Corollary 4.

Suppose are such that for some , guarantees for all . Further, suppose that for all , for all for some constant . Then with , and , with probability at least , Algorithm 1 guarantees regret:

Notably, the first term is the actual regret of if rather than the regret bound . Thus if outperforms this bound and obtains logarithmic regret, our combiner algorithm will also obtain logarithmic regret, which is not obviously possible using techniques based on the Corral algorithm [5]. Note that this result also appears to improve upon [6] (Theorem 4.2) by removing a factor, but this is because we have assumed knowledge of the time horizon in order to set .

5 Unknown and In-Expectation Bounds on Base Algorithms

In this section, we show how to remedy two surface-level issues with Algorithm 1. First, we require knowledge of the values and . Second, we require a high-probability regret bound for the well-specified base algorithm . Here, we show that a simple duplication and doubling-based technique suffices to address both issues.

First, let us gain some intuition for how to convert an in-expectation bound into a high-probability bound suitable for use in Theorem 1. Suppose we are given an algorithm that maintains expected regret . We duplicate this algorithm times. Then by Markov inequality, each individual duplicate obtains regret at most with probability at least . Therefore, with probability at least , at least one of the duplicates obtains regret at most . Then in the terminology of Theorem 1, we let be the set of duplicate algorithms and so we satisfy the hypothesis of the Theorem. This argument is slightly flawed as-is because we need an anytime regret bound for the base algorithms, but it turns out this is fixable by another use of Markov and union bound inequality. We then use a variant on the doubling trick to avoid requiring knowledge of and . The full construction is described below in Theorem 5, with proof in Appendix D.

Theorem 5.

Suppose that for some , there is some unknown and such that ensures for all . Further, suppose we are given positive numbers . Let be some user-specified failure probability. For and and , we duplicate each times, specifying each duplicate by a multi-index . To each duplicate we associate and . Let . Specify as a function of as described in Theorem 2. Then with probability at least , Algorithm 1 guarantees regret:

so that the expected regret is bounded by:

6 Examples and Optimality

In this section, we provide some illustrative examples of how our approach can be used. We will also highlight a few examples in which our construction matches lower bounds. The proofs are straightforward applications of Theorems 1 and 2, and are deferred to Appendix E.

6.1 -Armed Bandits

For our first example, suppose that the space is a finite set of arms and consists of the constant functions mapping all contexts to a single arm. This setup describes the classic -armed bandit problem. Let and suppose each is a naive algorithm that simply pulls arm on every round. We consider a to be well-specified if the th arm is in fact the optimal arm, in which case it is clear we may set , for all . In the high-probability setting, we let be the singleton set containing only the unknown optimal index. In this case, the conditions on of Theorem 1 correspond almost exactly (up to constants and log factors) with the pareto frontier for regret bounds described in [21], showing that using our construction in this setting allows us to match this lower bound frontier.

6.2 Misspecified Linear Bandit

For our second example, suppose that the space is a finite set of arms, and that the context is a constant and provides a feature for each arm . The space of policies is the set of constant functions again. In this case, it is possible the reward is a fixed linear function for some , in which case the linUCB algorithm [13] can obtain regret . On the other hand, in general the reward might be totally unrelated to the context, in which case one might wish to fall back on the UCB algorithm which obtains regret . By setting to be linUCB and to be ordinary UCB, we say that if the rewards are indeed linear, and otherwise. Further, we set , and . Now let and be any two numbers such that and both and are greater than . Then appropriate application of Corollary 2, yields regret in the linear setting and regret in general. This again matches the frontier of regret bounds for this scenario described in Theorem 24.4 of [20] (see also Lemma 6.1 of [23]). Formally, we have the following Corollary:

Corollary 6.

Suppose for all . Suppose . Let be an instance of linUCB and be an instance of the ordinary UCB algorithm. Let and be any two numbers such that and both are greater than . We consider two cases, either the reward is a linear function of , or it is not. Then guarantees regret with probability in the first case, while guarantees regret in the second case. Set and . Then using the construction of Corollary 2, with probability at least , we guarantee regret with linear rewards, and otherwise.

Note that we leverage our ability to use non-uniform values in this Corollary. It is not so obvious how to obtain this full frontier using the prior uniform bound (1), although it is of course conceivable that more detailed analysis of prior algorithms might allow for this same result.

6.3 Linear Model Selection

For our third example, we consider the case of model selection for linear bandits. In this setting, the context again specifies features for each arm , and we are guaranteed that the reward is a linear function of the context. The question now is whether the full -dimensions are actually necessary. Specifically, if there is some such that the reward is in fact a linear function of the first coordinates of the context only, then we would like our regret to depend on rather than . This setting has been studied before in the context of a finite set of actions in [15, 12]. These prior works impose some additional technical conditions on the distribution of rewards and contexts provided by the environment. Under their conditions, [12] obtains regret while under somewhat weaker conditions, [15] obtains regret . In contrast, we require no extra conditions, and obtain regret . The construction is detailed in the following Corollary:

Corollary 7.

Suppose for all and the reward is always a linear function of . Suppose that the reward is in fact purely a linear function of the first coordinates of . Suppose the action set has finite cardinality . Let be an instance of linUCB of [13] restricted to the first coordinates of the context. Set , , and . Then using the instantiation of Algorithm 1 from Theorem 2, we obtain regret . If instead the set is infinite, let be an instance of the linUCB algorithm for infinite arms [2, 14] restricted to the first coordinates. Set , , and . Then we obtain regret .

7 Experimental Validation

(a) Linear Rewards
(b) Non-Linear Rewards
Figure 1: Misspecified Linear Bandit
Figure 2: Model Selection Experiments

We now demonstrate empirical validation of our results in two different application settings. For our first experiment, we consider the misspecified linear bandit setting. We use ordinary UCB and linUCB as the two base algorithms. For simplicity we focus on the ordinary stochastic bandits framework (i.e. we assume the context remains fixed over time). Each arm

is associated with a feature vector

that is chosen from the uniform distribution on the unit sphere. Let

also chosen from the unit sphere be a fixed unknown parameter vector. Finally, for each arm , we choose to be specified later. The reward for arm at any time step is set to be where