The contextual bandit paradigm involves sequential decision-making settings in which we repeatedly pick one out of actions (or “arms”) in the presence of contextual side information. Algorithms for this problem usually involve policies that map the contextual information to a chosen action, and the reward feedback is typically limited in the sense that it is only obtained for the action that was chosen. The goal is to maximize the total reward over several () rounds of decision-making, and the performance of an online algorithm is typically measured in terms of regret with respect to the best policy within some policy class that is fixed a priori. Applications of this paradigm include advertisement placement/web article recommendation [li2010contextual, agarwal2016making], clinical trials and mobile health-care [woodroofe1979one, tewari2017ads].
The contextual bandit problem can be thought of as an online supervised learning problem (over policies mapping contexts to actions) with limited information feedback, and so the optimal regret bounds scale like, a natural measure of the sample complexity of the policy class [auer2002nonstochastic, mcmahan2009tighter, beygelzimer2011contextual]. These are typically achieved by algorithms that are inefficient (linear in the size of the policy class). Much of the research in contextual bandits has tackled computational efficiency [langford2008epoch, agarwal2014taming, rakhlin2016bistro, syrgkanis2016efficient, syrgkanis2016improved, foster2018contextual]: do there exist computationally efficient algorithms that achieve the optimal regret guarantee? A question that has received relatively less attention involves the choice of policy class itself. Even for a fixed regret-minimizing algorithm, the choice of policy class is critical to maximize the overall reward of the algorithm. As can be seen in applications of contextual bandits models for article recommendation [li2010contextual], the choice is often made in hindsight, and more complex policy classes are used if the algorithm is run for more rounds. A quantitative understanding of how to do this is still lacking, and intuitively, we should expect the optimal choice of policy class to not be static. Ideally, we could design adaptive contextual bandit algorithms that would initially use simple policies, and switch over to more complex ones as more data is obtained.
Theoretically, what this means is that the regret bounds derived for a contextual bandit algorithm are only meaningful for rewards that are generated by a policy within the policy class to which the algorithm is tailored. If the rewards are derived from a “more complex" policy outside the policy class, even the optimal policy may neglect obvious patterns and obtain a very low reward. If the rewards are derived from a policy that is expressible by a much smaller class, the regret that is accumulated is unnecessary. Let us view this through the lens of the simplest possible example: the standard linear contextual bandits [chu2011contextual] paradigm, where we can choose one out of arms and rewards are generated according to the process
where represents a “bias” of arm , represents the linear parameter of the model (which is shared across all arms111This is the model that was described in [chu2011contextual]. It is worth noting that more complex variants of this model with a separate for every have also been empirically evaluated [li2010contextual].
), represents the contextual information and represents noise in the reward observations.
It is well-known that variants of linear upper confidence bound algorithms like chu2011contextual and [abbasi2011improved]222Guarantees for were established under slightly different constraints on and the context vectors which led to a regret bound of
and the context vectors which led to a regret bound of. We show in Lemma 6 that a slight variant of has its regret bounded by in our setting. suffer at most 333The notation hides poly-logarithmic factors. regret with respect to the optimal linear policy. However, setting yields the important case of the reward distribution being independent from the contextual information. Here, a simple upper confidence bound algorithm like auer2002finite would yield the optimal regret bound, which does not depend on the dimension of the contexts . Thus, we pay substantial extra regret by using the algorithm meant for linear contextual bandits on such instances with much simpler structure. On the other hand, upper confidence bounds that ignore the contextual information will not guarantee any control on the policy regret: it can even be linear. It is natural to desire a single approach that adapts to the inherent complexity of the reward-generating model and obtains the optimal regret bound as if this complexity was known in hindsight. Specifically, this paper seeks an answer to the following question:
Does there exist a single algorithm that simultaneously achieves the regret rate on simple multi-armed bandit instances and the regret rate on linear contextual bandit instances?
1.1 Our contributions
We answer the question of simultaneously optimal regret rates in the multi-armed (“simple”) bandit regime and the linear contextual (“complex”) bandit regime affirmatively under the condition that the contexts are generated from a stochastic process that yields covariates that are not ill-conditioned. Our algorithm, (for Optimistic Selection of Models), essentially exploits the best policy (simply the best arm) that is learned under the assumption of the simple reward model - while conducting a sequential statistical test for the presence of additional complexity in the model, and particularly whether ignoring this additional complexity would lead to substantial regret. This is a simple statistical principle that could conceivably be generalized to arbitrary policy classes that are nested: we will see that the algorithm critically exploits the nested structure of the simple bandit model within the linear contextual model.
1.2 Related work
The contextual bandit paradigm was first considered by woodroofe1979one (woodroofe1979one) to model clinical trials. Since then it has been studied intensely both theoretically and empirically in many different application areas under many different pseudonyms. We point the reader to tewari2017ads for an extensive survey of the contextual bandits history and literature.
Treating policies as experts ( auer2002nonstochastic) with careful control on the exploration distribution led to the optimal regret bounds of in a number of settings. From an efficiency point of view (where efficiency is defined with respect to an arg-max-oracle
that is able to compute the best greedy policy in hindsight), the first approach conceived was the epoch-greedy approach[langford2008epoch], that suffers a sub-optimal dependence of in the regret. More recently, “randomized-UCB" style approaches [agarwal2014taming] have been conceived that retain the optimal regret guarantee with calls to the arg-max-oracle. This question of computational efficiency has generated a lot of research interest [rakhlin2016bistro, syrgkanis2016efficient, syrgkanis2016improved, foster2018contextual]. The problem of policy class selection itself has received less attention in the research community, and how this is done in practice in a statistically sound manner remains unclear. An application of linear contextual bandits was to personalized article recommendation using hand-crafted features of users [li2010contextual]: two classes of linear contextual bandit models with varying levels of complexity were compared to simple (multi-armed) bandit algorithms in terms of overall reward (which in this application represented the click-through rate of ads). A striking observation was that the more complex models won out when the algorithm was run for a longer period of time (eg: 1 day as opposed to half a day). Surveys on contextual bandits as applied to mobile health-care [tewari2017ads] have expressed a desire for algorithms that adapt their choice of policy class according to the amount of information they have received (e.g. the number of rounds). At a high level, we seek a theoretically principled way of doing this.
Perhaps the most relevant work to online policy class selection involves significant attempts to corral a band of base bandit algorithms into a meta-bandit framework [agarwal2017corralling]. The idea is to bound the regret of the meta-algorithm in terms of the regret of the best base algorithm in hindsight. (This is clearly useful for policy class selection that we study here – by corralling together an algorithm designed for the linear model and one for the simple multi-armed bandits model.) The Corral framework is very general and can be applied to any set of base algorithms, whether efficient or not. This generality is attractive, but it is not the optimal choice of computationally efficient algorithm for the multi-armed-vs-linear-contextual bandit problem for a couple of reasons.
It is not clear what (if any) choice of base algorithms would lead to a computationally efficient algorithm that is also statistically optimal in a minimax sense simultaneously for both problems.
The meta-algorithm framework uses an experts algorithm (in particular, mirror descent with log-barrier regularizer and importance weighting on the base algorithms) to choose which base algorithm to play in each round. Thus, it is impossible to expect the instance-optimal regret rate of on the simple bandit instance. More generally, the Corral framework will not yield instance-optimal rates on any policy class444On our much simpler instance of bandit-vs-linear-bandit, we do obtain instance-optimal rates for at least the simple bandit model..
The Corral framework highlights the principal difficulty in contextual bandit model selection that can be thought of as an even finer exploration-exploitation tradeoff: algorithms (designed for particular model classes) that fall out of favor in initial rounds could be picked very rarely and the information required to truly perform model selection may be absent even after many rounds of play. Corral tackles this difficulty using the log-barrier regularizer for the meta-algorithm as a natural form of heightened exploration [foster2016learning], together with clever learning rate schedules555An undesirable side effect of using the log-barrier regularizer is a dependence on as opposed to in the regret bound, where is the number of policy classes.. Related recent work krishnamurthy2019contextual adapts to the unknown Lipschitz constant of the optimal policy (function from context to recommended action) in the stochastic contextual bandit problem with an abstract policy class and continuous action space.
Our stylistic approach to the model selection problem is a little different, as we focus on the much more specific case of models: the simple multi-armed bandit model and the linear contextual bandit model. We encounter a similar difficulty and obtain striking clarity on the extent of this difficulty owing to the simplicity of the models. On the other hand, we observe that commonly encountered sequences of contexts can help us carefully navigate the finer exploration-exploitation tradeoff when the model classes are nested.
Our algorithm () utilizes a simple “best-of-both-worlds” principle: exploit the possible simple reward structure in the model until (unless) there is significant statistical evidence for the presence of complex reward structure that would incur substantial complex policy regret if not exploited. This algorithmic framework is inspired by the initial “best-of-both-worlds” results for stochastic and adversarial multi-armed bandits; in particular, the “Stochastic and Adversarial Optimal” () algorithm [bubeck2012best] (although the details of the phases of the algorithm and the statistical test are very different). In that framework, instances that are not stochastic (and could be thought of as “adversarial”) are not always detected as such by the test. The test is designed in an elegant manner such that the regret is optimally bounded on instances that are not detected as adversarial, even if an algorithm meant for stochastic rewards is used. Our test to distinguish between simple and complex instances shares this flavor – in fact, all theoretically complex instances () are not detected as such.
Also related are results on contextual bandits with similarity information on the contexts, which automatically encodes a potentially easier learning problem [slivkins2014contextual]. The main novelty in these results involves adapting to such similarity online.
Technically, our proofs leverage the most recent set of theoretical results on regret bounds for linear bandits [abbasi2011improved], which can easily be applied to the linear contextual bandit model, and sophisticated self-normalized
concentration bounds for our estimates of both the bias termsand the parameter vector . For the latter, we find that the Matrix Freedman inequality [oliveira2009concentration, tropp2011freedman] is particularly useful.
1.3 Problem Statement
At the beginning of each round , the learner is required to choose one of arms and gets a reward associated with that arm. To help make this choice the learner is handed a context vector at every round (this is essentially a concatenation of vectors, each of dimension equal to ). Let denote the reward of arm and let denote the choice of the learner in round . The rewards could be arriving from one of two models that is described below:
Simple Model: Under the simple multi-armed bandit model, the mean rewards of arms are fixed and are not a function of the contexts. That is, at each round
where , are identical, independent, zero mean, -sub-Gaussian noise (defined below). Let the arm with the highest reward have mean and be indexed by . The benchmark that the algorithm hopes to compete against is the pseudo-regret (henceforth regret for brevity),
Define the gap as the difference in the mean rewards of the best arm compared to the mean reward of the arm, that is, . Previous literature on multi-armed bandits Lai:1985:AEA:2609660.2609757 tells us that the best one can hope to do in this setting in the worst case is . Several algorithms like upper confidence bounds () auer2002finite and minimax-optimal strategies in the stochastic case () audibert2010regret,degenne2016anytime achieve this lower bound up to logarithmic (and constant) factors.
Complex Model: In this model the mean reward of each arm is a linear function of the contexts (linear contextual bandits). We work with the following stochastic assumptions on the context vectors. Each of these contexts vectors and are drawn independent of the past from a distribution such that is independent of and, and ,
That is the conditional mean of the context vectors are.
In this complex model, we assume there exists an underlying linear predictor and biases of the arms, such that the mean rewards of the arms are affine functions of the contexts, i.e.,
We impose compactness constraints on the parameters: in particular, we have , . Further, the noise are identical, independent, zero mean, and -sub-Gaussian. Clearly, simple model instances (which are parameterized only by the biases ) can be expressed as complex model instances by setting .
At each round define to be the best arm at round . Here, we define pseudo-regret with respect to the optimal policy under the generative linear model:
As noted above, past literature on this problem yielded algorithms like chu2011contextual and abbasi2011improved that only suffer from the minimax regret of . As we will see in the simulations, these algorithms incur the dependence on the dimension in the regret bound even for simple instances.
Notation and Definitions
Given a vector , let denote its component. For a vector we let for denote the -norm. Given a matrix we denote it’s operator norm by , and use to denote its Frobenius norm. Given a symmetric matrix let and denote its largest and smallest eigenvalues. Given a positive definite matrix we define the norm of a vector with respect to matrix as . Let be a filtration. A stochastic process where is measurable with respect to is defined to be conditionally -sub-Gaussian for some if, for all , we have
2 Construction of Confidence Sets
In our algorithm, which is presented subsequently at the end of round , we build an upper confidence estimate for each arm. Let be the number of times arm was pulled and be the average reward of that arm at the end of round . For each arm we define the upper confidence estimate as follows,
Lemma 6 in abbasi2011improved (restated below as Lemma 1 here) uses a refined self-normalized martingale concentration inequality to bound across all arms and all rounds.
Under the simple model, with probability at least
Under the simple model, with probability at leastwe have, ,
For any round , let be the -regularized least-squares estimate of , which we define explicitly below.
where is the matrix whose rows are the context vectors selected from round up until round : and . Here we are regressing on the rewards seen to estimate , while using the bias estimates obtained by our upper confidence estimates defined in Eq. (2).
We present a proof of this lemma in Appendix A.
3 Algorithm and Main Result
The intuition behind Algorithm 1 is straightforward. The algorithm starts off by using the simple model estimate of the recommended action, i.e., ; until it has reason to believe that there is a benefit from switching to the complex model estimates. If the rewards are truly coming from the simple model, or from a complex model that is well approximated by a simple multi-armed bandit model, then Condition 7 will not be violated and the regret shall continue to be bounded under either model. However, if Condition 7 is violated then algorithm switches to the complex estimates – for the remaining rounds. The condition is designed using the function which is of the order . This corresponds to the additional regret incurred when we attempt to estimate the extra parameter – .
Our main theorem optimally bounds the regret of under either of the two reward-generating models.
Notice that Theorem 3 establishes regret bounds on the algorithm which are minimax optimal under both simple model and the complex model up to logarithmic factors. In fact, under the simple model we are able to obtain problem-dependent regret rates. Note that the above regret bound is with high probability and also implies a bound in expectation by setting and using Markov’s inequality.
To prove Theorem 3, we need to show that the regret of is bounded under either underlying model. In Lemma 4 we demonstrate that whenever the rewards are generated under the simple model, Condition 7 is not violated with high probability. This ensures that when the data is generated from the simple model, we have that the Boolean variable throughout the run of the algorithm. Thus, the regret is automatically equal to the regret incurred by the algorithm, which is meant for simple model instances.
On the other hand, when the data is generated according to the complex model, we demonstrate (in Lemma 5) that the regret remains appropriately bounded if Condition 7 is not violated. If the condition gets violated at a certain round, we switch to the estimates of the complex model, i.e. . This corresponds to a variant of the algorithm , which is meant for complex instances. Thus, the regret remains bounded in the subsequent rounds under this event as well (formally proved in Lemma 6).
We define below several functions which will be used throughout the proof. These arise naturally by applying the concentration inequalities on terms that appear while controlling the regret.
Given the definitions above, it is straightforward to verify that .
Additionally, we define several statistical events that will be useful in proofs of the lemmas that follow. equationparentequation
Event represents control on the fluctuations due to noise: applying Theorem 9 in the one-dimensional case with and , we get for all . Event represents control on the fluctuations of the empirical estimate of the biases around their true values: by Lemma 1 we have . Finally, event represents control on the fluctuations of the empirical estimate of the parameter vector around its true value: by Lemma 2, we have . We define the desired event as the intersection of these three events. The union bound gives us . For the rest of the proof, we condition on the event .
4.1 Regret under the Simple Model
The following lemma establishes that under the simple model, Condition 7 is not violated with high probability.
Assume that rewards are generated under the simple model. Then, with probability at least , we have
Proof Under the simple model, We have the model for the rewards is . Therefore, we have
Notice that the difference neatly decomposes into four terms, each of which we interpret below. The first term is purely a sum of the noise in the problem that concentrates under the event . The second term corresponds to the difference between the true mean reward and simple estimate of the mean reward , which is controlled under the event . The third term is the difference between the mean rewards prescribed by the simple estimate and complex estimate and respectively. Finally, the last term is only a function the estimated linear predictor (and since the true predictor is , this term is controlled by even ).
Step (i) (Bound on ): Under the event , we have
Step (ii) (Bound on ): By the definition of we have,
where follows under the event , follows as
and follows by Jensen’s inequality and the fact that .
Step (iii) (Bound on ): Eq. (5), which shows the optimality of arm , tells us that for all . Therefore .
Step (iv) (Bound on ): By the Cauchy-Schwartz inequality, the constraint and the triangle inequality, we get
where is defined in Eq. (11).
Combining the bounds on and and by the definition of , we have
which completes the proof.
Proof [Proof of Part (a) of Theorem 3]
We have established above that Condition 7 is not violated with probability at least under the simple model by the lemma above. Conditioned on this event, plays according to the simple model estimate, , for all rounds. Invoking Theorem 7 in abbasi2011improved gives us that with probability at least , . Applying the union bound over these two events gives this regret bound with probability at least .
4.2 Regret under the Complex Model
The bound on the regret under the complex model follows by establishing two facts. First, when Condition 7 is not violated, we demonstrate in Lemma 5 that the regret is appropriately bounded. Second, if the condition does get violated, say at round , our algorithm chooses arms according to the complex model estimates ‘’ for . In Lemma 6, we show that the regret remains bounded in this case as well.
We start with the first case by stating and proving Lemma 5.
Consider . Let Condition 7 not be violated up until round , i.e.
Then, we have
with probability at least .
Proof Since we have already conditioned on the event , we can assume that events , and hold. Note that if Condition 7 is not violated up to round t then we have that for all . Using the definition of , we get
where is the maximum possible regret incurred in the first rounds under the complex model. By the definition of , we get . Next, let us control . We have
where the non-positivity of the second term is because of the optimality of arm as expressed in Eq. (6). Hence, we have
where the last two inequalities follow from the Cauchy-Schwarz inequality and the constraint respectively. Under the event , we have . Also, by the definition of and under event , we have