Open Problem: Model Selection for Contextual Bandits

In statistical learning, algorithms for model selection allow the learner to adapt to the complexity of the best hypothesis class in a sequence. We ask whether similar guarantees are possible for contextual bandit learning.


page 1

page 2

page 3

page 4


Model selection for contextual bandits

We introduce the problem of model selection for contextual bandits, wher...

The Pareto Frontier of model selection for general Contextual Bandits

Recent progress in model selection raises the question of the fundamenta...

Universal and data-adaptive algorithms for model selection in linear contextual bandits

Model selection in contextual bandits is an important complementary prob...

Optimal Model Selection in Contextual Bandits with Many Classes via Offline Oracles

We study the problem of model selection for contextual bandits, in which...

Pareto Optimal Model Selection in Linear Bandits

We study a model selection problem in the linear bandit setting, where t...

SeqROCTM: A Matlab toolbox for the analysis of Sequence of Random Objects driven by Context Tree Models

In several research problems we deal with probabilistic sequences of inp...

Model Selection in Batch Policy Optimization

We study the problem of model selection in batch policy optimization: gi...

1 Introduction

Model selection is the fundamental statistical task of choosing a hypothesis class using data, with statistical guarantees dating back to Vapnik’s structural risk minimization principle. Despite decades of research on model selection for supervised learning and the ubiquity of model selection procedures such as cross-validation in practice, very little is known about model selection in interactive learning and reinforcement learning settings where exploration is required. Focusing on contextual bandits, a simple reinforcement learning setting, we ask:

Can model selection guarantees be achieved in contextual bandit learning, where a learner must balance exploration and exploitation to make decisions online?

2 Problem Formulation

We consider the adversarial contextual bandit setting (Auer et al., 2002). The setting is defined by a context space and a finite action space . The learner interacts with nature for rounds, where in round : (1) nature selects a context and loss , (2) the learner observes and chooses action , and (3) the learner observes . We allow for an adaptive adversary, so that and may depend on . In the usual problem formulation, the learner is given a policy class , and the goal is to minimize regret to :

When is finite, the well-known Exp4 algorithm (Auer et al., 2002) achieves the optimal regret bound of .

The Model Selection Problem.

In the contextual bandit model selection problem, we assume that the policy class under consideration decomposes as a nested sequence:111It is also natural to consider infinite sequences of policy classes, but we restrict to finite sequences for simplicity.

The goal of the learner is to achieve low regret to all classes in the sequence simultaneously, with the regret to policy class scaling only with . Intuitively, this provides a luckiness guarantee: if a good policy lies in a small policy class, the algorithm discovers this quickly.

To motivate the precise guarantee we ask for, let us recall what is known in the simpler full-information

online learning setting, where the learner gets to see the entire loss vector

at the end of each round. Here, the minimax rate is , and it can be shown (Foster et al. (2015); see also Orabona and Pál (2016)) that a variant of the exponential weights algorithm guarantees


In other words, by paying a modest additive overhead of , we can compete with all policy classes simultaneously. The most basic variant of our open problem asks whether the natural analogue of eq:fullinfo can be attained for contextual bandits.

Open Problem 1a.

Design a contextual bandit algorithm that for any sequence ensures


We also welcome the following weaker guarantees.

Open Problem 1b.

Design a contextual bandit algorithm that for any sequence ensures either:

  1. for all .

  2. for all , where .

Alternatively, prove that no algorithm can achieve item 2 above for any value .

The first item here differs from op:adv only in the dependence on , , and factors, which we do not believe represent the most challenging aspect of the problem. The second item is a further relaxation of the original guarantee. Here we simply ask that the model selection algorithm has regret sublinear in whenever is sublinear. In other words, if policy class is learnable on its own, the model selection algorithm should have sublinear regret to it. To attain this behavior it is essential that the exponents and sum to one. Indeed, it is relatively easy to design algorithms with exponents that do not sum to one,222For example we can attain regret for all by running Exp4 with a particular prior over policies. but we do not know of an algorithm satisfying item 2 above for any . This stands in contrast to other problems involving adaptivity and data-dependence in contextual bandits (Agarwal et al., 2017a), where attaining adaptive guarantees with suboptimal dependence on is straightforward, and the primary challenge is to attain -type regret bounds. We also welcome a lower bound showing that this type of model selection guarantees is not possible for contextual bandits.

Stochastic Setting.

The model selection problem for contextual bandits has yet to be solved even for the stochastic setting, and even when the model is well-specified. Here, we assume: (1) are drawn i.i.d. from a fixed distribution ; (2) Each class is induced by a class of regression functions , in the sense that , where ; (3) The problem is realizable/well-specified in the sense that there exists index and regression function such that , for all .

The final version of our open problem asks for a model selection guarantee when the problem is stochastic and well-specified. Note that these assumptions imply that the optimal unconstrained policy (in terms of expected loss) is . As such, here we only ask for a regret bound against class . From an algorithmic perspective, this is the easiest version of the problem.

Open Problem 2.

For some value , design an algorithm for contextual bandits that for any sequence , whenever data is stochastic and realizable, ensures


Alternatively, prove that no algorithm can achieve this guarantee for any value of .

We offer $300 for the first solution to either op:advOpen Problem 1 or op:real.

3 Challenges and Partial Progress

Many natural algorithmic strategies for model selection fail under bandit feedback. These include (a) running Exp4 over all policies with a non-uniform prior adapted to the nested policy class structure, (b) the Corral aggregation strategy (Agarwal et al., 2017b), and (c) an adaptive version of the classical -greedy strategy (Langford and Zhang, 2008). These strategies all require tuning parameters (e.g., the learning rate ) in terms of the class index of interest, and naive tuning gives guarantees of the form for . Adaptive online learning algorithms like AdaNormalHedge (Luo and Schapire, 2015) and Squint (Koolen and Van Erven, 2015) also fail because they do not adequately handle bandit feedback.333Their regret bounds do not contain the usual “local norm” term used in the analysis of Exp4 and other bandit algorithms. We refer the reader to Foster et al. (2019) for more details on these strategies in the context of model selection. The main point here is that model selection for contextual bandits appears to require new algorithmic ideas, even when we are satisfied with weak -type rates where .

In a recent paper (Foster et al., 2019), we showed that a guarantee of the form eq:cb4 is achievable when consists of linear functions in dimensions, under distributional assumptions on . Our strategy was inspired by the fact that if the optimal loss is known, one can test if a given class contains the optimal policy by running a standard contextual bandit algorithm and checking whether it substantially underperforms relative to

. In our linear setup, we showed that one can estimate a surrogate for the optimal loss

at a “sublinear” rate, which allowed us to run this testing strategy and achieve a guarantee akin to eq:cb4 with no prior information. However, we do not know if this strategy can succeed beyond specialized settings where sublinear loss estimation is possible. Along these lines, Locatelli and Carpentier (2018) also observe that knowledge of can enable adaptive guarantees in Lipschitz bandits, where adaptivity is not possible in the absence of such information (such lower bounds do not appear to carry over to the contextual case).

For (non-contextual) multi-armed bandits, several lower bounds demonstrate that model selection is not possible. Lattimore (2015) shows that for multi-armed bandits, if we want to ensure regret against a single fixed arm instead of the usual rate, we must incur regret to one of the remaining arms in the worst case. This precludes a model selection guarantee of the form for nested action sets , which is a natural analogue of eq:cb1 for bandits.444This does not preclude a guarantee of the form eq:cb1, however, since we pay for the maximum number of actions. Related lower bounds are also known for Lipschitz bandits (Locatelli and Carpentier, 2018; Krishnamurthy et al., 2019). On the positive side, Chatterji et al. (2019) show that, with distributional assumptions, it is possible to adapt between multi-armed bandits and linear contextual bandits.

4 Consequences and Connections to Other Problems

Switching Regret for Bandits.

In full-information online learning, algorithms for switching regret (Herbster and Warmuth, 1998) simultaneously ensure that for all sequences of actions , , where denotes the number of switches in the sequence. In the (non-contextual) multi-armed bandit setting, with no prior knowledge of the number of switches , the best guarantee we are aware of is which can be attained by combining the Bandits-over-Bandits strategy from Cheung et al. (2019) with Exp3.555Auer et al. (2002) achieves regret , but only when a bound on the switches is known a-priori. A solution to op:adv would immediately yield a nearly-optimal switching regret bound of for bandits by choosing the th policy class to be the set of all sequences with at most switches.666Formally, this is accomplished by setting and . Solving op:adv would also lead to improvements in switching regret for contextual bandits.

Second-Order Regret Bounds for Online Learning.

Consider full-information online learning, and let denote the algorithm’s distribution over policies at time . An unresolved COLT 2016 open problem of Freund (2016) asks whether there exists an algorithm for this setting with regret at most against the top

-quantile of policies for all

simultaneously. A slight strengthening of Freund’s open problem asks for the following bound:


eq:pacbayes implies the weaker quantile bound by choosing to be uniform over the top -fraction of policies and

to be the uniform distribution over all policies. While the

-type quantile bound does not seem to imply eq:pacbayes directly, historically KL-based bounds have quickly followed quantile bounds (Chaudhuri et al., 2009; Luo and Schapire, 2015; Koolen and Van Erven, 2015).

The guarantee in eq:pacbayes would immediately yield a positive resolution to op:adv via the following reduction: (1) Choose for all ; (2) To handle bandit feedback, draw and feed importance weighted losses into the full-information algorithm at each round, where . Conversely, a lower bound showing that eq:cb1 is not attainable would imply that no full-information algorithm can achieve eq:pacbayes, which strongly suggests that the quantile bound in Freund’s open problem is also not attainable.


  • Agarwal et al. (2017a) Alekh Agarwal, Akshay Krishnamurthy, John Langford, Haipeng Luo, and Robert E. Schapire. Open problem: First-order regret bounds for contextual bandits. In Conference on Learning Theory, 2017a.
  • Agarwal et al. (2017b) Alekh Agarwal, Haipeng Luo, Behnam Neyshabur, and Robert E Schapire. Corralling a band of bandit algorithms. Conference on Learning Theory, 2017b.
  • Auer et al. (2002) Peter Auer, Nicolo Cesa-Bianchi, Yoav Freund, and Robert E. Schapire. The nonstochastic multiarmed bandit problem. SIAM Journal on Computing, 2002.
  • Chatterji et al. (2019) Niladri S Chatterji, Vidya Muthukumar, and Peter L Bartlett. OSOM: A simultaneously optimal algorithm for multi-armed and linear contextual bandits. In

    International Conference on Artificial Intelligence and Statistics

    , 2019.
  • Chaudhuri et al. (2009) Kamalika Chaudhuri, Yoav Freund, and Daniel J Hsu. A parameter-free hedging algorithm. In Advances in neural information processing systems, 2009.
  • Cheung et al. (2019) Wang Chi Cheung, David Simchi-Levi, and Ruihao Zhu. Learning to optimize under non-stationarity. In International Conference on Artificial Intelligence and Statistics, 2019.
  • Foster et al. (2015) Dylan J Foster, Alexander Rakhlin, and Karthik Sridharan. Adaptive online learning. In Advances in Neural Information Processing Systems, pages 3375–3383, 2015.
  • Foster et al. (2019) Dylan J Foster, Akshay Krishnamurthy, and Haipeng Luo. Model selection for contextual bandits. In Advances in Neural Information Processing Systems, 2019.
  • Freund (2016) Yoav Freund. Open problem: Second order regret bounds based on scaling time. In Conference on Learning Theory, 2016.
  • Herbster and Warmuth (1998) Mark Herbster and Manfred K Warmuth. Tracking the best expert. Machine learning, 1998.
  • Koolen and Van Erven (2015) Wouter M Koolen and Tim Van Erven. Second-order quantile methods for experts and combinatorial games. In Conference on Learning Theory, 2015.
  • Krishnamurthy et al. (2019) Akshay Krishnamurthy, John Langford, Aleksandrs Slivkins, and Chicheng Zhang. Contextual bandits with continuous actions: Smoothing, zooming, and adapting. Conference on Learning Theory, 2019.
  • Langford and Zhang (2008) John Langford and Tong Zhang.

    The epoch-greedy algorithm for multi-armed bandits with side information.

    In Advances in neural information processing systems, 2008.
  • Lattimore (2015) Tor Lattimore. The pareto regret frontier for bandits. In Advances in Neural Information Processing Systems, 2015.
  • Locatelli and Carpentier (2018) Andrea Locatelli and Alexandra Carpentier. Adaptivity to smoothness in X-armed bandits. In Conference on Learning Theory, 2018.
  • Luo and Schapire (2015) Haipeng Luo and Robert E Schapire. Achieving all with no parameters: Adanormalhedge. In Conference on Learning Theory, 2015.
  • Orabona and Pál (2016) Francesco Orabona and Dávid Pál. Coin betting and parameter-free online learning. In Advances in Neural Information Processing Systems, 2016.