Log In Sign Up

The Pareto Frontier of model selection for general Contextual Bandits

by   Teodor V. Marinov, et al.

Recent progress in model selection raises the question of the fundamental limits of these techniques. Under specific scrutiny has been model selection for general contextual bandits with nested policy classes, resulting in a COLT2020 open problem. It asks whether it is possible to obtain simultaneously the optimal single algorithm guarantees over all policies in a nested sequence of policy classes, or if otherwise this is possible for a trade-off α∈[1/2,1) between complexity term and time: ln(|Π_m|)^1-αT^α. We give a disappointing answer to this question. Even in the purely stochastic regime, the desired results are unobtainable. We present a Pareto frontier of up to logarithmic factors matching upper and lower bounds, thereby proving that an increase in the complexity term ln(|Π_m|) independent of T is unavoidable for general policy classes. As a side result, we also resolve a COLT2016 open problem concerning second-order bounds in full-information games.


page 1

page 2

page 3

page 4


Open Problem: Model Selection for Contextual Bandits

In statistical learning, algorithms for model selection allow the learne...

Optimal Model Selection in Contextual Bandits with Many Classes via Offline Oracles

We study the problem of model selection for contextual bandits, in which...

Problem-Complexity Adaptive Model Selection for Stochastic Linear Bandits

We consider the problem of model selection for two popular stochastic li...

Model Selection for Generic Contextual Bandits

We consider the problem of model selection for the general stochastic co...

Pareto-optimal clustering with the primal deterministic information bottleneck

At the heart of both lossy compression and clustering is a trade-off bet...

Pareto Optimal Model Selection in Linear Bandits

We study a model selection problem in the linear bandit setting, where t...

Efficient First-Order Contextual Bandits: Prediction, Allocation, and Triangular Discrimination

A recurring theme in statistical learning, online learning, and beyond i...

1 Introduction

Contextual multi-armed bandits are a fundamental problem in online learning (Auer et al., 2002; Langford and Zhang, 2007; Chu et al., 2011; Abbasi-Yadkori et al., 2011). The contextual bandit problem proceeds as a repeated game between a learner and an adversary. At every round of the game the adversary prepares a pair of a context and a loss over an action space, the learner observes the context and selects an action from the action space and then observes only the loss of the selected action. The goal of the learner is to minimize their cumulative loss. The performance measure, known as regret, is the difference between the learner’s cumulative loss and the smallest loss of a fixed policy, belonging to an apriori determined policy class, mapping contexts to actions. Given a single contextual bandit instance with finite sized policy class, the well-known Exp4 algorithm (Auer et al., 2002) achieves the optimal regret bound of . Regret guarantees degrade with the complexity of the policy class, therefore a a learner might want to leverage “guesses” about the optimal policy. Given policy classes , a learner would ideally suffer regret scaling only with the complexity of , the smallest policy class containing the optimal policy . While these kind of results are obtainable in full-information games, in which the learner gets to observe the loss for all actions, they are impossible for multi-armed bandits (Lattimore, 2015). In some aspects, contextual bandits are an intermediate setting between full-information and multi-armed bandits and it is unknown if model selection is possible. Foster et al. (2020b) stated model selection in contextual bandits as a relevant open problem in COLT2020. Any positive result for model selection in contextual bandits would imply a general way to treat multi-armed bandits with a switching baseline. Furthermore any negative result is conjectured to implicate negative results on another unresolved open problem on second order bounds for full-information games (Freund, 2016).

In this paper, we give a fairly complete answer to the questions above.

  1. [label=P.0]

  2. We provide a Pareto frontier of upper bounds for model selection in contextual bandits with finite sized policy classes.

  3. We present matching lower bounds that shows that our upper bounds are tight, thereby resolve the motivating open problems (Foster et al., 2020b).

  4. We present a novel impossibility result for adapting to the number of switch points under adaptive adversaries (Besbes et al., 2014).

  5. We negatively resolve an open problem on second order bounds for full-information (Freund, 2016).

Related work.

A problem closely related to contextual bandits with finite policy classes are linear contextual bandits. Model selection in linear contextual bandit problems has recently received significant attention, however none of these resuls transfer to the finite policy case. In the linear bandits problem the -th policy class is a subset of and the losses are linear, that is . Here is a feature embedding mapping from context-action pairs into ,

is mean-zero sub-Gaussian noise with variance proxy equal to one and

is an unknown parameter.

Foster et al. (2019) assume the contexts are also drawn from an unknown distribution and propose an algorithm which does not incur more than , where

is the smallest eigenvalue of the covariance matrix of feature embeddings

. Pacchiano et al. (2020b) propose a different approach based on the corralling algorithm of Agarwal et al. (2017) which enjoys a regret bound for finite action sets and bound for arbitrary action sets . Later, Pacchiano et al. (2020a) design an algorithm which enjoys a gap-dependent guarantee under the assumption that all of the miss-specified models have regret . Under such an assumption, the authors recover a regret bounds of the order for arbitrary action sets. Cutkosky et al. (2020) also manage to recover the and bounds for the model selection problems through their corralling algorithm. Ghosh et al. (2021) propose an algorithm which enjoys in the finite arm setting, where is the smallest, in absolute value, entry of . Their algorithm also enjoys a similar guarantee for arbitrary action sets with replaced by . Zhu and Nowak (2021) show that it is impossible to achieve the desired regret guarantees of without additional assumptions by showing a result similar to the one of Lattimore (2015). The work of Lattimore (2015) states that in the stochastic multi-armed bandit problem it is impossible to achieve regret to a fixed arm, without suffering at least regret to a different arm.

Chatterji et al. (2020) study the problem of selecting between an algorithm for the linear contextual bandit problem and the simple stochastic multi-armed bandit problem, that is they aim to achieve simultaneously a regret guarantee which is instance-dependent optimal for the stochastic multi-armed bandit problem and optimal for the finite arm stochastic linear bandit problem. The proposed results only hold under additional assumptions. More generally, the study of the corralling problem, in which we are presented with multiple bandit algorithms and would like to perform as well as the best one, was initiated by Agarwal et al. (2017). Other works which fall into the corralling framework are that of Foster et al. (2020a) who study the miss-specified linear contextual bandit problem, that is the observed losses are linear up to some unknown miss-specification, and the work of Arora et al. (2021) who study the corralling problem for multi-armed stochastic bandit algorithms.

Our work also shows an impossibility result for the stochastic bandit problem with non-stationary rewards. Auer (2002) first investigates the problem under the assumption that there are distributional changes throughout the game and gives an algorithm with a dynamic regret111In dynamic regret the comparator is the best action for the current distribution. bound, under the assumption that is known. Auer et al. (2019) achieves similar regret guarantees without assuming that the number if switches (or changes) of the distribution is known. A different measurement of switches is the total variation of changes in distribution . Multiple works give dynamic regret bounds of the order (hiding dependence on the size of the policy class) when is known, including for extensions of the multi-armed bandit problem like contextual bandits and linear contextual bandits (Besbes et al., 2014; Luo et al., 2018; Besbes et al., 2015; Wei et al., 2017). Cheung et al. (2019); Zhao et al. (2020) further show algorithms which enjoy a parameter free regret bound of the order (hiding dependence on dimensionality) for the linear bandits problem. The lower bound in Table 1 might seem to contradict such results. In Section 5.1 we carefully explain why this is not the case.

Finally, our lower bounds apply to the problem of devising an algorithm which simultaneously enjoys a second order bound over any fraction of experts. Cesa-Bianchi et al. (2007) first investigate the problem of second order bounds for the experts problem, in which the proposed algorithm maintains a distribution over the set of experts, during every round of the game. The experts are assumed to have stochastic losses and the work shows an algorithm with regret guarantee. Chaudhuri et al. (2009); Chernov and Vovk (2010); Luo and Schapire (2015); Koolen and Van Erven (2015)

study a different experts problem in which the comparator class for the regret changes from the best expert in hindsight to the uniform distribution over the best

experts for an arbitrary positive . The above works propose algorithms which achieve a regret bound for all simultaneously. Freund (2016) asks if there exists an algorithm which enjoys both guarantees at the same time, that is, does there exist an algorithm with regret bound which holds simultaneously for all positive .

General CB Upper bound Lower bound
adaptive adversary
oblivious adversary / stochastic
S-switch Upper bound Lower bound
adaptive adversary
oblivious adversary
Table 1: Overview of our results. Our novel contributions are in bold; lower bounds only hold if the expressions are not exceeding . The stochastic/oblivious lower bounds hold only for proper algorithms.


For any , denotes the set . notation hides poly-logarithmic factors in the horizon and the number of arms but not in the size of the policy classes .

2 Problem setting

We consider the contextual bandit problem with general policy classes of finite size. There are arms and nested policy classes , where a policy is a mapping from an arbitrary context space into the set of arms. The game is played for rounds and at any time , the agent observes a context , selects arm and observes the loss

from an otherwise unobserved loss vector

. We measure an algorithm’s performance in terms of pseudo-regret, which is the expected cumulative regret of the player against following a fixed policy in hindsight


We distinguish between stochastic environments and oblivious or adaptive adversaries. In stochastic environments, there are unknown distribution such that and are i.i.d. samples. In the adversarial regime, the distributions can change over time, i.e. . When the choices are fixed at the beginning of the game, the adversary is called oblivious, while an adaptive adversary can chose based on all observations up to time .

Often the stochastic-adversarial hybrid problem has been studied with adversarially chosen context but stochastic losses. In our work, all upper bounds hold in the stronger notion where both the losses and the contexts are adaptive, while the lower bounds hold for the weaker notion where only the contexts are adaptive.

Open problem (Foster et al., 2020b).

The regret upper bounds for all regimes introduced above for a fixed policy class of finite size are of the order and can be achieved by the Exp4 algorithm (Auer et al., 2002). The question asked by Foster et al. (2020b): For a nested sequence of policies , is there a universal such that a regret bound of


is obtainable for all simultaneously?

W.l.o.g. we can assume that . Otherwise we take a subset of policy classes that includes and where two consequent policy classes at least square in size. Due to nestedness, any guarantees on this subset of models imply up to constants the same bounds on the full set.


A motivating example for studying nested policy classes is the S-switch problem. The context is simply and the set of policies is given by

the set of policies that changes its action not more than many times. Any positive result for contextual bandits with finite sized policy classes would provide algorithms that adapt to the number of switch points, since . To make clear what problem we are considering, we are using to denote the regret in the switching problem.

Next, we define the class of proper algorithms which choose their policy at every time step independently of context . Restricting our attention to such algorithms greatly reduces the technicalities for lower bound proofs in the non-adaptive regimes. The lower bound for this class of algorithms is also at the core of the argument for adaptive (improper) algorithms in stochastic environments.

Definition 1.

We call an algorithm proper, if at any time , the algorithm follows the recommendation of a policy , and if the choice of by the algorithm, is independent of the context .


EXP4 is proper.

The properness assumption intuitively allows us to reduce the model selection problem to a bandit-like problem in the space of all policies . We give more details in Section 4.2 and Appendix B.3.

3 Upper bounds

In this section, we generalize the Hedged-FTRL algorithm (Foster et al., 2020a) to obtain an upper bound for model selection over a large collection of regret algorithms.

Theorem 1.

For any , we can tune Hedged-FTRL over a selection of instances of EXP4 operating on policy classes , such that the following regret bound holds uniformly over all

for  do
       Get from Let play the next round and receive Play and observe Update with Update with if  would violate Eq. 2 then
             Bias losses by to ensure Eq. 2.
       end if
end for
Algorithm 1 Hedged FTRL


-hedged FTRL, introduced in Foster et al. (2020a), is a type of Follow the Regularized Leader (FTRL) algorithm which is used as a corralling algorithm (Agarwal et al., 2017). At every round , the algorithm chooses to play one of base algorithms . Base algorithm

is selected with probability

, where is a distribution over base algorithms determined by the FTRL rule , where is the sum of the loss vectors , is the potential induced by the -Tsallis entropy, is a step size determined by the problem parameters and is a special bias term which we now explain. Define , and initialize . Here

is a vector which tracks the variance of the loss estimators,

is a vector with regret upper bounds for the base algorithms, and is a threshold depending on the base algorithms. At any time , after selecting base algorithm to play the current action, the top (corralling) algorithm observes its loss and gives as feedback an important weighted loss to the selected base, . Whenever the base played at round would satisfy , the loss fed to the top algorithm is adjusted with a bias , such that the cumulative biases track the quantity . This has been shown to be always possible (Foster et al., 2020a). The condition for adjusting the biases reads


The condition in Equation 2 is motivated in a similar way to the stability condition in the work of Agarwal et al. (2017). Algorithm 1

constructs an unbiased estimator for the loss vector,

, and updates each base algorithm accordingly. A similar update is present in the Corral algorithm Agarwal et al. (2017) in which each of the base learners also receives an importance weighted loss. The regret of the base learners is assumed to scale with the variance of the importance weighted losses. This assumption is natural and in practice holds for all bandit or expert algorithms. The scaling of the regret, however, must be appropriately bounded as Agarwal et al. (2017) show, otherwise no corralling or model selection guarantees are possible. Formally, the following stability property is required. If an algorithm enjoys a regret bound under environment with loss sequence , then the algorithm is -stable if it enjoys a regret bound of the order under the environment of importance weighted losses , where is the maximum variance of the losses and the expectation is taken with respect to any randomness in . Essentially all bandit and expert algorithms used in practice are -stable with , e.g., Exp4 is -stable. The bias terms in Algorithm 1 intuitively cancel the additional variance introduced by the importance weighted losses and this is why we require the biases to satisfy Equation 2.

Theorem 2.

Given a collection of base algorithms which are -stable, that is

and any , then the regret of (-hedged Tsallis-Inf with , satisfies a simultaneous regret of

The analysis follows closely the proof of Foster et al. (2020a) and is postponed to Appendix A.

Theorem 2 recovers the bounds of Pacchiano et al. (2020b) for model selection in linear bandits, but holds in more general settings including adaptive adversaries in both contexts and losses. It neither requires nestedness of the policies nor that the policies operate on the same action or context space.

Proof of Theorem 1.

The EXP4 algorithm initialized with policy class satisfies the condition of Theorem 2 with , as shown in Agarwal et al. (2017). Hence Theorem 1 is a direct corollary of Theorem 2. ∎

4 Lower bounds

We present lower bounds that match the upper bounds from Section 3 up to logarithmic factors, thereby proving a tight Pareto frontier of worst-case regret guarantees in model selection for contextual bandits.

In the first part of this section, we consider a special instance of -switch with adaptive adversary. The proof technique based on Pinsker’s inequality is folklore and leads to the following theorem.

Theorem 3.

For any , sufficiently large , and any algorithm with regret guarantee

there exists for any number of switches a stochastic bandit problem such that

This bound holds even when the agent is informed about the number of switches up to time .

Since this bound holds even when the agent is informed about when a switch occurs, we can restrict the policy class to policies that only switch arms whenever the agent is informed about a switch in the environment. This as a contextual bandit problem with context and policies. Hence Theorem 3 implies a lower bound of In the second part of the section, we consider the stochastic regime. Our lower bound construction is non-standard and relies on bounding the total variation between problem instances directly without the use of Pinsker’s inequality.

Theorem 4.

There exist policy classes 222 In Foster et al. (2020b) open problem 2, they ask about model based contextual bandit with realizability. Our lower bound is providing an instance of that. with , such that if the regret of a proper algorithm is upper bounded in any environment by

then there exists an environment such that

These theorems directly provide negative answers to (Foster et al., 2020b).

Corollary 1.

There is no that satisfies the regret guarantee of open problem (1) for any algorithm in the adaptive adversarial regime or any proper algorithm in the stochastic case.


By Theorems 4 and 3, for any there exists . Assume that where . Hence by Theorem 3 and Theorem 4 there exist environments where

Finally, we disprove the open problem in the stochastic case for any algorithm.

Theorem 5.

No algorithm (proper or improper) can satisfy the requirements of open problem (1) for all stochastic environments.

We present the high level proof ideas in the following subsections and the detailed proof in Appendix B.

4.1 Adaptive adversary: -switch( Problem

We present the adaptive environment in which model selection fails and the proof of Theorem 3.

The adversary switches the reward distribution up to many times, thereby segmenting the time into phases . We denote as the counter of phases and assume the agent is given this information. For each phase , the adversary selects an optimal arm uniformly at random among the first arms. If

, the losses are i.i.d. Bernoulli random variables with means

In phase , all losses are until the end of the game. The adversary decides on the switching points based on an adaptive strategy. A switch from phase to occurs when the player has played times an arm in in phase . We can see this problem either as a special case of S-switch problem, or alternatively as a contextual bandit problem with policies.

The lower bound proof for -switch() relies on the following Lemma, which is proven in Appendix B.

Lemma 1.

Let an agent interact with a armed bandit problem with centered Bernoulli losses and randomized best arm of gap for an adaptive number of time steps . If the probability of is at least , then the regret after time-steps conditioned on the event is lower bounded by

Informally, this Lemma says that conditioned on transitioning from phase to phase , the agent has suffered regret against arm during phase .

Informal proof of Theorem 3.

The adversary’s strategy is designed in a way such that at each phase it only allows the player’s strategy to interact with the environment just enough times to discover the best action . Then a new phase begins to prevent the player from exploiting knowledge of . This ensures by Lemma 1 that the player suffers regret at least during each completed phase. If an agent proceeds finding for all phases , then the regret against the non-switching baseline is . By the assumption on the maximum regret of and an appropriate choice of and , we can ensure that the agent must fail to discover all with constant probability, thus incurs regret at least against the optimal -switch baseline. Tuning and yield the desired theorem. The formal argument with explicit choice of is found in Appendix B. ∎

4.2 Stochastic lower bound

We now present the stochastic environment used for the impossibility results in Theorems 5 and 4.

There are environments with policies and . In all environments, we have and always chooses action , while are playing an action from uniformly at random. (In other words, the context is with sampled uniformly at random and .)

In each environment, the losses of actions at any time step satisfy , which are conditioned on independent Bernoulli random variables, with mean

Action gives a constant loss of in all environments.

Let us unwrap these definitions. Playing either action 1 or action 2, which we call revealing actions, yields full-information of all random variables at time due to the dependence of and the non-randomness of . On the other hand, playing action 3 allows only to observe , which has the same distribution in all environments, hence there is no information gained at all.

We know from full-information lower bounds that for optimal tuning of the gap, one suffers regret in the policy class , due to the difficulty of identifying the optimal arm. For a smaller regret in policy class , one needs to confirm or reject the hypothesis faster than it takes to identify the optimal arm. Existing techniques do not answer the question whether this is possible, and our main contribution of this section is to show that the hardness of rejecting is of the same order as identifying the exact environment.

For the remaining section, it will be useful to consider a reparametrization of the random variables. Let be the losses incurred by the policies : . We can easily see that together with is sufficient to uniquely determine and . Furthermore, is always a vector of independent Bernoulli random variables, which are independent of 333 We want to emphasize that is only independent of , not independent of the full vector . . In environments , the -th component is a biased Bernoulli, while all other components have mean . In , no component is biased. As before, does not provide any information since its distribution conditioned on is identical in all environments (see Lemma 4 in Appendix B for a formal proof).

Under this reparameterization and ignoring non-informative bits of randomness, the problem of distinguishing from now looks as follows. For time steps , decide whether to play a revealing action and observe (potentially by taking into account). Use observed to distinguish between the environments. Proper algorithms simplify the problem even further, because selecting independently of implies that the decision of observing is also independent of (any policy except allows to observe under any context). Hence for proper algorithms, we can reason directly about how many samples are required to distinguish between environments. This problem bears similarity to the property testing of dictator functions (Balcan et al., 2012)

and sparse linear regression 

(Ingster et al., 2010)444The setting of (Ingster et al., 2010) is different from our setting as they consider an asymptotic regime where both feature sparsity and dimensionality of the problem go to infinity, while for us the sparsity is fixed to one., however, there is no clear way to apply such results to our setting.

The following lemma shows the difficulty of testing for hypothesis .

Lemma 2.

Let , , and . If the algorithm chooses whether to reveal independently of and if the total times is revealed is bounded by a.s.  then for any measurable event it holds that

The proof of Lemma 2 is deferred to Section B.3. The high level idea is to directly bound the TV between and over the space of outcomes of by utilizing Berry-Essen’s inequality instead of going through Pinsker’s inequality. This step is key to achieve a dependence on in the bound.

For readers familiar with lower bound proofs for bandits and full-information, this Lemma should not come at a huge surprise. For a -round full-information game, it tells us that we can bias a single arm up to , without this being detectable. This directly recovers the well known lower bound of for full-information via the argument used for bandit lower bounds. However, this result goes beyond what is known in the literature. We not only show that one cannot reliably detect the biased arm, but that one cannot even reliably detect whether any biased arm is present at all. This property is the key to showing the lower bound of Theorem 4.

Informal proof of Theorem 4.

Under environment , observing for time-steps incurs a regret of . Using the assumption on the regret and Markov inequality, we obtain an upper bound on the expected number of observations, which holds with probability . We can construct an algorithm that never observes more than samples, by following algorithm until it played times a revealing action and then commits to policy (action 3). Since the algorithm is proper, we can define as the observed ’s during time where the algorithm plays a revealing action. For the revealed information generated by , we tune the remaining parameters such that the conditions of Lemma 2 are satisfied. Let be the event that plays exactly times a revealing action (i.e. plays at least time the revealing action), then happens with probability under . Thus, there exists an environment such that plays less than times an action in with constant probability, which incurs regret of . The theorem follows from tuning and , which is done formally in Appendix B. ∎

Improper algorithms.

Even though we are not able to extend the lower bound proof uniformly over all values and to improper algorithms, we can still show that no algorithm (proper or improper) can solve the open problem (1) for stochastic environments.

The key is the following generalization of Lemma 2, which is proven in the appendix.

Lemma 3.

Let , , and . If the total number of times is revealed is bounded by a.s.  then for any measurable event it holds that

This holds even if the agent can take all contexts and previous observations into account when deciding whether to pick a revealing action at any time-step.

Informal proof of Theorem 5.

The proof is analogous to Theorem 4, however we use Lemma 3 to bound the difference in probability of under and . The key is to find a tuning such that the RHS of Lemma 3 does not exceed . Note that this is of order . Let , then the requirement on yields . Setting for any , then it follows immediately that the RHS in Lemma 3 goes to for . Following the same arguments as in Theorem 4, there exists a sufficiently large up from which there always exists an environment such that the regret is linear in , thereby contradicting the open problem. The formal proof is deferred to Appendix B

5 Implications

The relevance of open problem Eq. 1 has been motivated by its potential implications for other problems such as the S-switch bandit problem and an unresolved COLT2016 open problem on improved second order bounds for full-information. Our negative result for Eq. 1 indeed lead to the expected insights.

5.1 S-switch

Our lower bound in the adaptive regime shows that adapting to the number of switches is hopeless if the timing of the switch points is not independent of the players actions. Any algorithm adaptive to the number of switches in the regime with oblivious adversary must break in the adaptive case, which rules out bandit over bandit approaches based on importance sampling (Agarwal et al., 2017). The successful algorithm proposed in Cheung et al. (2019) is using a bandit over bandit approach without importance sampling. Nonetheless, all components have adaptive adversarial guarantees. The algorithm splits the time horizon into equal intervals of length . It initializes EXP3 with

arms, corresponding to a grid of learning rates. For each epoch, the EXP3 top algorithm samples an arm and starts a freshly initialized instance of EXP3.S using the learning rate corresponding to the selected arm. This instance is run over the full epoch of length

. It collects the accumulated losses of the algorithm and feeds the loss to the EXP3 top algorithm.

If all algorithms in the protocol enjoy guarantees against adaptive adversaries, why do bandit over bandit break against adaptive adversaries? Adaptive adversaries are assumed to pick the losses independent of the choice of arm of the agent at round . In the bandit over bandit protocol, the loss of the arm of the top algorithm dependents on the losses that the selected base suffers in the epoch. An adaptive adversary can adapt the losses in the epoch based on the actions of the base algorithm, that means the loss is not chosen independent of the action . Hence the protocol is broken and the adaptive adversarial regret bounds do not hold.

5.2 Second order bounds for full information.

In an unresolved COLT2016 open problem, Freund (2016) asks if it is possible to ensure a regret bound of order


against the best proportion of policies simultaneously for all . We go even a step further and show that the lower bound construction from Section 4.2 directly provides a negative answer for any to the looser bound

Theorem 6.

An algorithm satisfying Eq. 4 for implies the existence of a proper algorithm that violates the lower bound for the counter example in the proof of Theorem 4.

Theorem 6 has the following interpretation. For any fixed , there is no algorithm which enjoys a regret upper bound as in Equation 3 for all problem instances s.t. . This implies we can not hope for a polynomial improvement, in terms of time horizon, over the existing bound of . The detailed proof is found in Appendix B. The high level idea is to initialize the full-information algorithm satisfying Eq. 4 with a sufficient number of copies of the baseline policy and to feed importance weighted losses of the experts (i.e. policies) to that algorithm.

As we mention in Section 1, the case is obtainable. Our reduction relates the adaptation to variance to the model selection problem. As in Eq. 1, is the trade-off between time and complexity. An algorithm satisfying Eq. 4 with merely allows to recover the trivial bound for model selection, and hence does not lead to a contradiction.

6 Conclusion

We derived the Pareto Frontier of minimax regret for model selection in Contextual bandits. Our results have resolved several open problems (Foster et al., 2020b; Freund, 2016).

We like to thank Haipeng Luo and Yoav Freund for discussions about our lower bound proofs. We thank Tor Lattimore for pointing us to the technicalities required for bounding the total variation of improper algorithms.



  1. For all authors…

    1. Do the main claims made in the abstract and introduction accurately reflect the paper’s contributions and scope? Our main regret upper bounds are found in Theorem 1 and Theorem 2. Our main lower bounds are found in Theorem 3 and Theorem 4 with Corollary 1 solving the COLT2020 open problem and Theorem 6 solving the COLT2016 open problem.

    2. Did you describe the limitations of your work? See paragraph regarding removing the properness requirement at the end of page 8.

    3. Did you discuss any potential negative societal impacts of your work? This paper is theoretical in nature and we do not foresee any immediate societal impacts.

    4. Have you read the ethics review guidelines and ensured that your paper conforms to them?

  2. If you are including theoretical results…

    1. Did you state the full set of assumptions of all theoretical results?

    2. Did you include complete proofs of all theoretical results? For complete proofs we refer the reader to the appendix.

  3. If you ran experiments…

    1. Did you include the code, data, and instructions needed to reproduce the main experimental results (either in the supplemental material or as a URL)?

    2. Did you specify all the training details (e.g., data splits, hyperparameters, how they were chosen)?

    3. Did you report error bars (e.g., with respect to the random seed after running experiments multiple times)?

    4. Did you include the total amount of compute and the type of resources used (e.g., type of GPUs, internal cluster, or cloud provider)?

  4. If you are using existing assets (e.g., code, data, models) or curating/releasing new assets…

    1. If your work uses existing assets, did you cite the creators?

    2. Did you mention the license of the assets?

    3. Did you include any new assets either in the supplemental material or as a URL?

    4. Did you discuss whether and how consent was obtained from people whose data you’re using/curating?

    5. Did you discuss whether the data you are using/curating contains personally identifiable information or offensive content?

  5. If you used crowdsourcing or conducted research with human subjects…

    1. Did you include the full text of instructions given to participants and screenshots, if applicable?

    2. Did you describe any potential participant risks, with links to Institutional Review Board (IRB) approvals, if applicable?

    3. Did you include the estimated hourly wage paid to participants and the total amount spent on participant compensation?

Appendix A Upper bound proofs

Proof of Theorem 2.

Denote with the arm that algorithm would choose if it would be selected during round . We decompose the regret into

where the last line is by the assumption of the theorem. The first term requires some basic properties of FTRL analysis, see e.g. [Zimmert and Seldin, 2021]. Define , . For Tsallis-INF with constant learning rate555The proof can be adapted to time dependent learning rates. , we have the following properties

The standard FTRL proof (e.g. Zimmert and Seldin [2021]) shows that

The first term is

Let us consider the terms and . First, using the definition of the conjugate function and we know that

Further by Young’s inequality it holds that

The above two displays imply

Thus we can bound