1 Introduction
Contexual bandit (CB) learning is the repetition of the following steps, carried out by a an agent and an environment .

the environment presents the agent a context ,

the agent chooses an action ,

the environment presents the learner the reward corresponding to action .
The goal of the agent is to accumulate the highest possible cumulative reward over a certain number of rounds . The relative performance of existing CB algorithms depends on the environment : for instance some algorithms are best suited for settings where the reward structure is linear (LinUCB), but can be outperformed by greedy algorithms when the reward structure is more complex. It would therefore be desirable to have a procedure that is able to identify, in a datadriven fashion, which one of a pool of base CB algorithms is best suited for the environment at hand. This task is referred to as model selection. In batch settings and online full information settings, model selection is a mature field, with developments spanning several decades [Stone74, Lepski90, Lepski91, gyorfi2002, dudoit_vdL2005, massart2007, benkeser2018]. Crossvalidation is now the standard approach used in practice, and it enjoys solid theoretical foundations [devroyelugosi2001, gyorfi2002, dudoit_vdL2005, benkeser2018].
Literature on model selection in online learning under bandit feedback is more recent and sparser. This owes to challenges specific to the bandit setting. Firstly, the bandit feedback structure implies that at any round, only the loss (here the negative reward) corresponding to one action can be observed, which implies that the loss can be observed only for a subset of the candidate learners (those which proposed the action eventually chosen). Any model selection procedure must therefore address the question of how to decide which base learner to follow at each round (the allocation challenge), and how to pass feedback to the base learners (the feedback challenge). A tempting approach to decide how to allocate rounds to different base learners is to use a standard multiarmed bandit (MAB) algorithm as a metalearner, and treat the base learners as arms. This approach fails because, unlike in the usual MAB setting, the reward distribution of the arms changes with the number of times they get played: the more a base learner gets chosen, the more data it receives, and the better its proposed policy (and therefore expected reward) becomes. This exemplifies the comparability challenge: how to compare the candidate learners based on the available data at any given time?
Existing approaches solve these challenges in differents ways. We saw essentially two types of solutions in the existing literature, represented on the one hand by the OSOM algorithm of chatterji2019osom and the ModCB algorithm of foster2019, and on the other hand by the CORRAL algorithm of agarwal17, and the stochastic CORRAL algorithm, an improved version thereof introduced by pacchiano2020model.
In OSOM and ModCB, the base learners learn policies in policy classes that form a nested sequence, which can be ordered from least complex to most complex. Their solution to the allocation challenge is to start by using the least complex algorithm, and move irreversibly to the next one if a goodnessoffit test indicates its superiority. The goodnessoffit tests uses all the data available to compute fits of the current and next policy, and compares them. This describes their solution to the feedback challenge and the comparability challenge.
CORRAL variants take another route. They use an Online Mirror Descent (OMD) based master algorithm that samples alternatively which base learner to follow, and gradually phases out the suboptimal ones. In that sense, their allocation strategy resembles the one of a MAB algorithm. The comparability issue arises naturally in the context of an OMD metalearner, which can be understood easily with an example. Suppose that we have two base algorithms and , and that has better asymptotic regret thant . It can happen that either by chance ( plays unlucky rounds) or by design (e.g. explores a lot in early rounds), fares worse than initially. As a result, the master would initially give a lesser weight to than to , with the result that at some time , the policy proposed by is based on a much smaller internal sample size than the policy proposed by . As a result, at , even though is asymptotically better than , the losses of are worse than the losses of , which accentuates the datastarvation of and can lead to never recovering from its early underperformance. The issue described here is that the losses used for the OMD weights update are not comparable across candidates, as they are based on policies informed by significantly different internal sample sizes. CORRAL can be viewed as the solution to the comparability challenge in the context of an OMD master: by using gentle weight updates (as opposed to the more aggressive weight updates of Exp3 for instance) and by regularly increasing the learning rate of base learners of which the weight drops too low, CORRAL prevents the base algorithm datastarvation phenomenon. The two CORRAL variants differ in their solution to the feedback challenge. The original CORRAL algorithm [agarwal17] passes, at each round, importance weighted losses to the master and to all base learners. In contrast,pacchiano2020model’s stochastic CORRAL passes unweighted losses to each base algorithm, but only at the time they get selected.
Guarantees in chatterji19a and foster2019 rely on the socalled realizability assumption, which states that at least one of the candidate policy classes contains , the optimal measurable policy under the current environment. chatterji19a show that their approach achieves the minimax regret rate for the smallest policy class that contains . foster2019 consider linear policy classes and show that their algorithm achieves regret no larger than and where is the dimension of the smallest policy class that contains . This is optimal if . In CORRAL variants, if one the the base algorihtms has regret , the master achieves regret , with the initial learning rate of the master. The learning rate can be optimized so that this regret bound becomes , that is, up to log factors, the upper bound on regret of that base algorithm. As pointed out by agarwal17, and as can be seen from the regret bound restated here, CORRAL presents an important caveat: the learning rate must be tuned to the rate of the base algorithm one wishes to compete with. This is not an issue when working with a collection of algorithms with same regret upper bound, and in that case CORRAL offers protection against model misspecification. However when base learners have different regret rates, CORRAL fails to adapt to the rate of the optimal algorithm.
In this article, we propose a master algorithm that allows to work with general offtheshelf (contextual) bandit algorithhms, and achieves the same regret rate as the best of them. Our theoretical guarantees improve upon OSOM [chatterji2019osom] and ModCB [foster2019] in the sense that our algorithm allows to work with a general collection of bandit algorithms, as opposed to a collection of algorithms based on a nested sequence of parametric reward models. It improves upon CORRAL variants in the sense that it is rateadaptive. Our master algorithm can be described as follows: for a well chosen sequence of exploration probabilities, at each time , the master either samples a base algorithm uniformly at random and follows its proposal (with probability ), or it picks the base algorithm that maximizes a certain criterion based on past performance (with an exploitation probability of ). Each algorithm receives feedback only if it gets played by the master. The crucial idea is to compare the performance of base algorithms at the same internal time. At global time , the algorithms are at internal times (with ). We compare them based on their first rounds, thus ensuring a fair comparison.
We organize the article as follows. In section 2, we formalize the setting consisting of a master algorithm allocating rounds to base algorithms. In section 3, we present our master algorithm, EnsBFC (Ensembling Bandits by Fair Comparison). We present its theoretical guarantees in section 4. We show in section 5 that many wellknown existing bandit algorithms satisfy the assumption of our main theorem. We give experimental validation of our claims in section 6.
2 Problem setting
2.1 Master data and base algorithms internal data
A master algorithm has access to base contextual bandit algorithms . At any time
, the master observes a context vector
, selects the index of a base algorithm, and draws an action , following the policy of the selected base algorithm. The environment presents the reward corresponding to action . We distinguish two types of rounds for the master algorithm: exploration rounds and exploitation rounds. We will cover in more detail further down the definition of each type of round. We let be the indicator of the event that round is an exploration round. The data collected at time by the master algorithm is . We denote the subvector of corresponding to the triple context, action, reward at time . We denote , the filtration induced by the first observations. We suppose that contexts are independent and identically distributed (i.i.d.) and that the conditional distribution of rewards given actions and contexts is fixed across time points.After each round , the master passes the triple to base algorithm , which increments the internal time of algorithm by 1, and leaves unchanged the internal time of the other algorithms. For any , , we denote the triple collected by base algorithm at its internal time . Making this more formal, we define the internal time of at global time as , that is the number of times has been selected by the master up till global time . We define the reciprocal of as , that is the global time at which the internal time of was updated from to . We can then formally define as . We denote the filtration induced by the first observations of algorithm .
Let and , the number of exploration and exploitation rounds was selected up till global time . Note that . Define , , and .
2.2 Policies and base algorithm regret
A policy is a conditional distribution over actions given a context, or otherwise stated, a mapping from contexts to a distribution over actions. So as to define the value and the risk of a policy, we introduce an triple of reference such that X(t)Y^LABEL:A^LABEL:,X^LABEL:$ has same law as for any , and , where for every and . We introduce what we call the value loss , defined for any policy and triple as . We then define the risk of as . We will use that , where the latter quantity is the value of , that is the expected reward per round one would get if one carried out under environment . We denote it .
We denote the policy proposed by at its internal time . For any , is an measurable distribution over . We suppose that each algorithm operates over a policy class . The regret of over its first rounds is defined as , with . We define the cumulative conditional regret as , with and , where the identity follows from the fact that . We define the pseudo regret as .
2.3 Master regret and rate adaptivity
We let , the optimal value across all policy classes , and similarly, we denote , the optimal risk across . We define the regret of the master as , and the conditional regret as .
The bandit literature gives upper bounds on either or where the dependence in is of the form , for some . (We denote if for some .) While is known, it is not the case for , the asymptotic value of (the policy proposed by) .
As a necessary requirement, a successful metalearner should achieve asymptotic value . A second natural requirement is that it should have as good regret guarantees as the best algorithm in the subset of algorithms with optimal asymptotic value. We say that a master algorithm is rateadaptive if it achieves these two requirements.
Definition 1 (Rate adaptivity).
Suppose that base algorithms have known regret (or conditional regret, or pseudo regret) upper bounds . Let , the rate exponent corresponding to the fastest upper bound rate among algorithms with optimal limit value .
We say that the master is rateadaptive in regret (or conditional regret, or pseudo regret), up to logarithmic factors, if it holds that (or , or ).
Remark 1.
A natural setting where several base algorithms converge to the same value is when several of the candidate policy classes contain the optimal measurable policy , that is when the realizability assumption is satisfied for several base policy classes.
Remark 2.
Suppose that rates are minimax optimal (up to logarithmic factors) for the policy classes , and that at least one class contains . Then, in this context, rate adaptivity means that the master achieve the best minimax rate among classes that contain . In this context, rateadaptivity coincides with the notion of minimax adaptivity from statistics’ model selection literature (see e.g. massart2007, gine_nickl_2015).
Remark 3.
OSOM [chatterji2019osom] and ModCB foster2019 are minimax adaptive (and thus rateadaptive) under the condition that belongs to at least one of the policy classes (that is under the realizability assumption). CORRAL and stochastic CORRAL are not rateadaptive.
3 Algorithm description
Our master algorithm can be described as follows. At each global time , selects a base algorithm index based on past data, observes the context , draws an action conditional on following the policy proposed by at its current internal time, carries out action and collects reward . At the end of round , passes the triple to , which then increments its internal time and updates its policy proposal based on the new datapoint.
To fully characterize it remains to describe the mechanism that produces . We distinguish exploration rounds and exploitation rounds. We determine if round is to be an exploration round by drawing, independently from the past , the exploration round indicator from a Bernoulli law with probability , which we will define further down. During an exploration round (if ), we draw independently of
, from a uniform distribution over
. During an exploitation round (if ), we draw based on a criterion depending on the past rewards of base algorithms. Let us define this criterion.Let , the mean of negative rewards collected by algorithm up till its internal time . For any , define the algorithm selector , with a tuning parameter. When there is no ambiguity, we will use the shorthand notation . The selector
compares every base algorithm at the same internal time, and picks the one that minimizes the sum of the estimated risk at internal time
plus the theoretical regret upper bound rate . If , we let , that is we compare the base algorithms at a common internal time equal to the highest common number of exploration rounds each base has been called until .If any base algorithm has average risk converging to some , the regret of an exploration step is in expectation. If we want the regret of the master with respect to (w.r.t.) to be , we need the exploration probability to be . Because is unknown (it depends on hence on too), we make a conservative choice and we set , with (a quantity available to us), where is a tuning parameter.
We give the pseudo code of the master algorithm as algorithm 1 below.
(1) 
4 Regret guarantees of the master algorithm
Our main result shows that the expected regret of the master satisfies the same theoretical upper bound with respect to as the best base algorithm. The main assumption is that each base algorithm satisfies its conditional regret bound with high probability. We state this requirement formally as an exponential deviation bound.
Assumption 1 (Concentration).
There exists , , , such that, for any , and ,
(2) 
and .
We also require that the rewards be conditionally subGaussian given the past. Without loss of generality, we require that they be conditionally 1subGaussian.
Assumption 2.
For all , and every , .
We show in the next section that the high probability regret bounds available in the literature for many wellknown CB algorithms can be reformulated as an exponential deviation bound of the form (2). We can now state our main result.
Theorem 1 (Expected regret for the master).
Suppose that assumptions 1 and 2 hold, and recall the definition of from subsection 2.3. Then, EnsBFC is rateadaptive in pseudoregret, that is,
for some depending only on the constants of the problem. If, in addition, the regret upper bounds satisfied by the base algorithms are minimax for their respective policy classes, then EnsBFC is minimax adaptive in pseudo regret.
Remark 4.
Assumption 1 is met for many wellknown algorithms, as we show in the following section.
Remark 5.
The term in the criterion that minimizes across ensures that is, in expectation, lower bounded by . It may be the case that, among the base algorithms that have optimal limit value (that is those in ), the one that performs best in a given environment is not the one that has best regret rate upper bound . Enforcing this lower bound on the criterion ensures that the master picks an algorithm with optimal regret upper bound . We further discuss the need for such a lower bound in appendix D.
Remark 6.
The rate of pseudoregret of EnsBFC is not impacted by the specific values of the tuning parameters and (as long as they are set to constants independent of ), but the finite performance is. We found in our simulations that setting and works fine. We leave to future work the task of designing a datadriven rule of thumb to select and .
In the next subsection, we take a step back to put our results in perspective with the broader model selection literature.
4.1 Comments on the nature of the result: minimax adaptivity vs. oracle equivalence
Results in the model selection literature are essentially of two types: minimax adaptivity guarantees and oracle inequalities.
Given a collection of statistical models, a model selection procedure is said to be minimax adaptive if it achieves the minimax risk of any model that contains the “truth”. In our setting, the statistical models are policy classes and the “truth” is the optimal measurable policy . A notable example of minimax adaptive model selection procedure is Lepski’s method [Lepski90, Lepski91].
Consider a collection of estimators , and a datagenerating distribution , and denote the risk of any estimator under . In our context, one should think of the estimators as the policies computed by the base algorithms, and of specifying as specifying . We say that an estimator satisfies an oracle inequality w.r.t. if , with and an error term. Moreover, we say that the estimator is oracle equivalent if . Being oracle equivalent means performing as well as the best instancedependent (that is dependent) estimator. Multifold cross validation yields an oracleequivalent estimator [devroyelugosi2001, gyorfi2002, dudoit_vdL2005].
Our guarantees are closer to the notion of minimax adaptivity than to that of oracle equivalence, and, as we pointed out earlier, coincide with it if the base algorithms are minimax w.r.t. their policy classes. Minimax adaptivity is the property satisfied by the OSOM [chatterji2019osom] and ModCB foster2019. Minimax adaptivity is a worstcase (over each base model) statement, which represents a step in the right direction. We nevertheless argue that what practioners are looking for in a model selection procedure is to get the same performance as the base learner that performs best under the environment at hand, that is oracle equivalence, like the guarantee offered by multifold crossvalidation.
5 High probability regret bound for some existing CB algorithms
In this section, we recast regret guarantees for wellknown CB algorithms under the form the exponential bound (2) from our concentration assumption (assumption 1).
Recall the definitions of , and from section 2. Observe that our concentration assumption is a high probability bound on , the average of the conditional instantaneous regret. Although some articles provide high probability bounds directly on (e.g. abbasiyadkori2011), most works give high probability bounds on . Fortunately, under the assumption that rewards are conditionally subGaussian (assumption 2), we can recover a high probability regret bound on from a high probability regret bound on using the AzumaHoeffding inequality.
(In the following paragraphs, we suppose, to keep notation consistent, that is a base learner of the type considered in the paragraph).
Ucb.
[Lemma 4.9 in pacchiano2020model], itself a corollary of [theorem 7 in abbasiyadkori2011] states that if the rewards are conditionally subGaussian, the regret of UCB over rounds is .
Corollary 1 (Exponential deviation bound for UCB).
Suppose that assumption 2 holds. Then, there exist such that, for all ,
greedy.
bibaut2020 consider the greedy algorithm over a nonparametric policy class. The following result is a direct consequence of an intermediate claim in the proof [thereom 4 in bibaut2020].
Lemma 1 (Exponential deviation bound for greedy).
Consider the greedy algorithm over a nonparametric policy class . Suppose that the metric entropy in norm of satisfies for some , and that the exploration rate at is . Then, there exist such that, for all , with .
LinUCB.
[Theorem 3 in abbasiyadkori2011] states that LinUCB satisfies with probability at least . We recast their bound as follows.
Corollary 2.
Under the conditions of [theorem 3 in abbasiyadkori2011], there exists such that, for all ,
Ilovetoconbandits.
[Theorem 2 in agarwalb14] et al. states that with probability at least . (The proof of their lemma actually states as an intermediate claim a probability bound on which can easily be shown to be as well). We recast their bound as follows.
Corollary 3 (Exponential deviation bound for ILOVETOCONBANDITS).
Suppose that assumption 2 holds. Then, there exist , such that, for any , .
6 Simulation study
We implemented EnsBFC using LinUCB and an greedy algorihtm as base learners, and we evaluated it under two toy environments. We considered the setting . We chose environments and , and the specifications of the two base algorithms such that:

the greedy has regret w.r.t. the value of the optimal measurable policy under , while LinUCB has linear regret lower bound w.r.t. ,

LinUCB has regret w.r.t. while the greedy algorithm has linear regret lower bound w.r.t. .
We present the mean cumulative reward results in figure 1. We demonstrate the behavior of the algorithm on a single run in figure 2 in appendix E. We provide additional details about the experimental setting in appendix E.
Mean cumulative reward of the master and base algorithms over 100 runs, with (10%,90%) quantile bands
7 Discussion
We provided and analyzed a metalearning algorithm that is the first proven rateadaptive model selection algorithm for a general collection of contextual bandit algorithms. The general idea can be expressed in extremely simple terms: compare the performance of base learners at the same internal sample size, and explore uniformly at random with a well chosen decaying rate. Simulations confirm the validity of the procedure.
We commented on the nature of the guarantees of our algorithms and of previous approaches, and argued that they are close to (or coincide, under certain conditions, with) minimax adaptivity guarantees. We believe that further efforts should aim to bring the guarantees of model selection procedures under bandit feedback on par with the guarantees of crossvalidation in the fullinformation setting. This would entail proving asymptotic equivalence with an oracle, which is an instancedependent form of optimality, as opposed to minimax adaptivity.
Broader Impact
Our work concerns the design of model selection / ensemble learning methods for contextual bandits. As it has the potential to improve the learning performance of any system relying on contextual bandits, it can impact essentially any setting where contextual bandits are used.
Contextual bandits are used or envisioned in settings as diverse as clinical trials, personalized medicine, ads placement and recommender systems. We therefore believe the broader impact of our work is positive inasmuch as these applications benefit to society.
References
Appendix A Proof of theorem 1
We can without loss of generality assume that the tuning parameters and are set to 1. The proof of theorem 1 relies on the following lemmas.
Lemma 2.
For any , and , and are independent.
The following lemma tells us that the probability of selecting outside of the set of optimal candidates decrease exponentially with the common internal time of candidates.
Lemma 3 (Probability of selecting a suboptimal candidate).
For all and all ,
(3) 
with depending only on the constants of the problem, and .
Proof.
Suppose that . Then
(4) 
which we can rewrite as
(5)  
(6) 
Using that , we must then have
(7)  
(8) 
We distinguish two cases.