1 Introduction
Model selection is the fundamental statistical task of choosing a hypothesis class using data, with statistical guarantees dating back to Vapnik’s structural risk minimization principle. Despite decades of research on model selection for supervised learning and the ubiquity of model selection procedures such as crossvalidation in practice, very little is known about model selection in interactive learning and reinforcement learning settings where exploration is required. Focusing on contextual bandits, a simple reinforcement learning setting, we ask:
Can model selection guarantees be achieved in contextual bandit learning, where a learner must balance exploration and exploitation to make decisions online?2 Problem Formulation
We consider the adversarial contextual bandit setting (Auer et al., 2002). The setting is defined by a context space and a finite action space . The learner interacts with nature for rounds, where in round : (1) nature selects a context and loss , (2) the learner observes and chooses action , and (3) the learner observes . We allow for an adaptive adversary, so that and may depend on . In the usual problem formulation, the learner is given a policy class , and the goal is to minimize regret to :
When is finite, the wellknown Exp4 algorithm (Auer et al., 2002) achieves the optimal regret bound of .
The Model Selection Problem.
In the contextual bandit model selection problem, we assume that the policy class under consideration decomposes as a nested sequence:^{1}^{1}1It is also natural to consider infinite sequences of policy classes, but we restrict to finite sequences for simplicity.
The goal of the learner is to achieve low regret to all classes in the sequence simultaneously, with the regret to policy class scaling only with . Intuitively, this provides a luckiness guarantee: if a good policy lies in a small policy class, the algorithm discovers this quickly.
To motivate the precise guarantee we ask for, let us recall what is known in the simpler fullinformation
online learning setting, where the learner gets to see the entire loss vector
at the end of each round. Here, the minimax rate is , and it can be shown (Foster et al. (2015); see also Orabona and Pál (2016)) that a variant of the exponential weights algorithm guarantees(1) 
In other words, by paying a modest additive overhead of , we can compete with all policy classes simultaneously. The most basic variant of our open problem asks whether the natural analogue of eq:fullinfo can be attained for contextual bandits.
Open Problem 1a.
Design a contextual bandit algorithm that for any sequence ensures
(2) 
We also welcome the following weaker guarantees.
Open Problem 1b.
Design a contextual bandit algorithm that for any sequence ensures either:

for all .

for all , where .
Alternatively, prove that no algorithm can achieve item 2 above for any value .
The first item here differs from op:adv only in the dependence on , , and factors, which we do not believe represent the most challenging aspect of the problem. The second item is a further relaxation of the original guarantee. Here we simply ask that the model selection algorithm has regret sublinear in whenever is sublinear. In other words, if policy class is learnable on its own, the model selection algorithm should have sublinear regret to it. To attain this behavior it is essential that the exponents and sum to one. Indeed, it is relatively easy to design algorithms with exponents that do not sum to one,^{2}^{2}2For example we can attain regret for all by running Exp4 with a particular prior over policies. but we do not know of an algorithm satisfying item 2 above for any . This stands in contrast to other problems involving adaptivity and datadependence in contextual bandits (Agarwal et al., 2017a), where attaining adaptive guarantees with suboptimal dependence on is straightforward, and the primary challenge is to attain type regret bounds. We also welcome a lower bound showing that this type of model selection guarantees is not possible for contextual bandits.
Stochastic Setting.
The model selection problem for contextual bandits has yet to be solved even for the stochastic setting, and even when the model is wellspecified. Here, we assume: (1) are drawn i.i.d. from a fixed distribution ; (2) Each class is induced by a class of regression functions , in the sense that , where ; (3) The problem is realizable/wellspecified in the sense that there exists index and regression function such that , for all .
The final version of our open problem asks for a model selection guarantee when the problem is stochastic and wellspecified. Note that these assumptions imply that the optimal unconstrained policy (in terms of expected loss) is . As such, here we only ask for a regret bound against class . From an algorithmic perspective, this is the easiest version of the problem.
Open Problem 2.
For some value , design an algorithm for contextual bandits that for any sequence , whenever data is stochastic and realizable, ensures
(3) 
Alternatively, prove that no algorithm can achieve this guarantee for any value of .
We offer $300 for the first solution to either op:advOpen Problem 1 or op:real.
3 Challenges and Partial Progress
Many natural algorithmic strategies for model selection fail under bandit feedback. These include (a) running Exp4 over all policies with a nonuniform prior adapted to the nested policy class structure, (b) the Corral aggregation strategy (Agarwal et al., 2017b), and (c) an adaptive version of the classical greedy strategy (Langford and Zhang, 2008). These strategies all require tuning parameters (e.g., the learning rate ) in terms of the class index of interest, and naive tuning gives guarantees of the form for . Adaptive online learning algorithms like AdaNormalHedge (Luo and Schapire, 2015) and Squint (Koolen and Van Erven, 2015) also fail because they do not adequately handle bandit feedback.^{3}^{3}3Their regret bounds do not contain the usual “local norm” term used in the analysis of Exp4 and other bandit algorithms. We refer the reader to Foster et al. (2019) for more details on these strategies in the context of model selection. The main point here is that model selection for contextual bandits appears to require new algorithmic ideas, even when we are satisfied with weak type rates where .
In a recent paper (Foster et al., 2019), we showed that a guarantee of the form eq:cb4 is achievable when consists of linear functions in dimensions, under distributional assumptions on . Our strategy was inspired by the fact that if the optimal loss is known, one can test if a given class contains the optimal policy by running a standard contextual bandit algorithm and checking whether it substantially underperforms relative to
. In our linear setup, we showed that one can estimate a surrogate for the optimal loss
at a “sublinear” rate, which allowed us to run this testing strategy and achieve a guarantee akin to eq:cb4 with no prior information. However, we do not know if this strategy can succeed beyond specialized settings where sublinear loss estimation is possible. Along these lines, Locatelli and Carpentier (2018) also observe that knowledge of can enable adaptive guarantees in Lipschitz bandits, where adaptivity is not possible in the absence of such information (such lower bounds do not appear to carry over to the contextual case).For (noncontextual) multiarmed bandits, several lower bounds demonstrate that model selection is not possible. Lattimore (2015) shows that for multiarmed bandits, if we want to ensure regret against a single fixed arm instead of the usual rate, we must incur regret to one of the remaining arms in the worst case. This precludes a model selection guarantee of the form for nested action sets , which is a natural analogue of eq:cb1 for bandits.^{4}^{4}4This does not preclude a guarantee of the form eq:cb1, however, since we pay for the maximum number of actions. Related lower bounds are also known for Lipschitz bandits (Locatelli and Carpentier, 2018; Krishnamurthy et al., 2019). On the positive side, Chatterji et al. (2019) show that, with distributional assumptions, it is possible to adapt between multiarmed bandits and linear contextual bandits.
4 Consequences and Connections to Other Problems
Switching Regret for Bandits.
In fullinformation online learning, algorithms for switching regret (Herbster and Warmuth, 1998) simultaneously ensure that for all sequences of actions , , where denotes the number of switches in the sequence. In the (noncontextual) multiarmed bandit setting, with no prior knowledge of the number of switches , the best guarantee we are aware of is which can be attained by combining the BanditsoverBandits strategy from Cheung et al. (2019) with Exp3.^{5}^{5}5Auer et al. (2002) achieves regret , but only when a bound on the switches is known apriori. A solution to op:adv would immediately yield a nearlyoptimal switching regret bound of for bandits by choosing the th policy class to be the set of all sequences with at most switches.^{6}^{6}6Formally, this is accomplished by setting and . Solving op:adv would also lead to improvements in switching regret for contextual bandits.
SecondOrder Regret Bounds for Online Learning.
Consider fullinformation online learning, and let denote the algorithm’s distribution over policies at time . An unresolved COLT 2016 open problem of Freund (2016) asks whether there exists an algorithm for this setting with regret at most against the top
quantile of policies for all
simultaneously. A slight strengthening of Freund’s open problem asks for the following bound:(4) 
eq:pacbayes implies the weaker quantile bound by choosing to be uniform over the top fraction of policies and
to be the uniform distribution over all policies. While the
type quantile bound does not seem to imply eq:pacbayes directly, historically KLbased bounds have quickly followed quantile bounds (Chaudhuri et al., 2009; Luo and Schapire, 2015; Koolen and Van Erven, 2015).The guarantee in eq:pacbayes would immediately yield a positive resolution to op:adv via the following reduction: (1) Choose for all ; (2) To handle bandit feedback, draw and feed importance weighted losses into the fullinformation algorithm at each round, where . Conversely, a lower bound showing that eq:cb1 is not attainable would imply that no fullinformation algorithm can achieve eq:pacbayes, which strongly suggests that the quantile bound in Freund’s open problem is also not attainable.
References
 Agarwal et al. (2017a) Alekh Agarwal, Akshay Krishnamurthy, John Langford, Haipeng Luo, and Robert E. Schapire. Open problem: Firstorder regret bounds for contextual bandits. In Conference on Learning Theory, 2017a.
 Agarwal et al. (2017b) Alekh Agarwal, Haipeng Luo, Behnam Neyshabur, and Robert E Schapire. Corralling a band of bandit algorithms. Conference on Learning Theory, 2017b.
 Auer et al. (2002) Peter Auer, Nicolo CesaBianchi, Yoav Freund, and Robert E. Schapire. The nonstochastic multiarmed bandit problem. SIAM Journal on Computing, 2002.

Chatterji et al. (2019)
Niladri S Chatterji, Vidya Muthukumar, and Peter L Bartlett.
OSOM: A simultaneously optimal algorithm for multiarmed and linear
contextual bandits.
In
International Conference on Artificial Intelligence and Statistics
, 2019.  Chaudhuri et al. (2009) Kamalika Chaudhuri, Yoav Freund, and Daniel J Hsu. A parameterfree hedging algorithm. In Advances in neural information processing systems, 2009.
 Cheung et al. (2019) Wang Chi Cheung, David SimchiLevi, and Ruihao Zhu. Learning to optimize under nonstationarity. In International Conference on Artificial Intelligence and Statistics, 2019.
 Foster et al. (2015) Dylan J Foster, Alexander Rakhlin, and Karthik Sridharan. Adaptive online learning. In Advances in Neural Information Processing Systems, pages 3375–3383, 2015.
 Foster et al. (2019) Dylan J Foster, Akshay Krishnamurthy, and Haipeng Luo. Model selection for contextual bandits. In Advances in Neural Information Processing Systems, 2019.
 Freund (2016) Yoav Freund. Open problem: Second order regret bounds based on scaling time. In Conference on Learning Theory, 2016.
 Herbster and Warmuth (1998) Mark Herbster and Manfred K Warmuth. Tracking the best expert. Machine learning, 1998.
 Koolen and Van Erven (2015) Wouter M Koolen and Tim Van Erven. Secondorder quantile methods for experts and combinatorial games. In Conference on Learning Theory, 2015.
 Krishnamurthy et al. (2019) Akshay Krishnamurthy, John Langford, Aleksandrs Slivkins, and Chicheng Zhang. Contextual bandits with continuous actions: Smoothing, zooming, and adapting. Conference on Learning Theory, 2019.

Langford and Zhang (2008)
John Langford and Tong Zhang.
The epochgreedy algorithm for multiarmed bandits with side information.
In Advances in neural information processing systems, 2008.  Lattimore (2015) Tor Lattimore. The pareto regret frontier for bandits. In Advances in Neural Information Processing Systems, 2015.
 Locatelli and Carpentier (2018) Andrea Locatelli and Alexandra Carpentier. Adaptivity to smoothness in Xarmed bandits. In Conference on Learning Theory, 2018.
 Luo and Schapire (2015) Haipeng Luo and Robert E Schapire. Achieving all with no parameters: Adanormalhedge. In Conference on Learning Theory, 2015.
 Orabona and Pál (2016) Francesco Orabona and Dávid Pál. Coin betting and parameterfree online learning. In Advances in Neural Information Processing Systems, 2016.