1 Introduction
Contextual multiarmed bandits are a fundamental problem in online learning (Auer et al., 2002; Langford and Zhang, 2007; Chu et al., 2011; AbbasiYadkori et al., 2011). The contextual bandit problem proceeds as a repeated game between a learner and an adversary. At every round of the game the adversary prepares a pair of a context and a loss over an action space, the learner observes the context and selects an action from the action space and then observes only the loss of the selected action. The goal of the learner is to minimize their cumulative loss. The performance measure, known as regret, is the difference between the learner’s cumulative loss and the smallest loss of a fixed policy, belonging to an apriori determined policy class, mapping contexts to actions. Given a single contextual bandit instance with finite sized policy class, the wellknown Exp4 algorithm (Auer et al., 2002) achieves the optimal regret bound of . Regret guarantees degrade with the complexity of the policy class, therefore a a learner might want to leverage “guesses” about the optimal policy. Given policy classes , a learner would ideally suffer regret scaling only with the complexity of , the smallest policy class containing the optimal policy . While these kind of results are obtainable in fullinformation games, in which the learner gets to observe the loss for all actions, they are impossible for multiarmed bandits (Lattimore, 2015). In some aspects, contextual bandits are an intermediate setting between fullinformation and multiarmed bandits and it is unknown if model selection is possible. Foster et al. (2020b) stated model selection in contextual bandits as a relevant open problem in COLT2020. Any positive result for model selection in contextual bandits would imply a general way to treat multiarmed bandits with a switching baseline. Furthermore any negative result is conjectured to implicate negative results on another unresolved open problem on second order bounds for fullinformation games (Freund, 2016).
In this paper, we give a fairly complete answer to the questions above.

[label=P.0]

We provide a Pareto frontier of upper bounds for model selection in contextual bandits with finite sized policy classes.

We present matching lower bounds that shows that our upper bounds are tight, thereby resolve the motivating open problems (Foster et al., 2020b).

We present a novel impossibility result for adapting to the number of switch points under adaptive adversaries (Besbes et al., 2014).

We negatively resolve an open problem on second order bounds for fullinformation (Freund, 2016).
Related work.
A problem closely related to contextual bandits with finite policy classes are linear contextual bandits. Model selection in linear contextual bandit problems has recently received significant attention, however none of these resuls transfer to the finite policy case. In the linear bandits problem the th policy class is a subset of and the losses are linear, that is . Here is a feature embedding mapping from contextaction pairs into ,
is meanzero subGaussian noise with variance proxy equal to one and
is an unknown parameter.Foster et al. (2019) assume the contexts are also drawn from an unknown distribution and propose an algorithm which does not incur more than , where
is the smallest eigenvalue of the covariance matrix of feature embeddings
. Pacchiano et al. (2020b) propose a different approach based on the corralling algorithm of Agarwal et al. (2017) which enjoys a regret bound for finite action sets and bound for arbitrary action sets . Later, Pacchiano et al. (2020a) design an algorithm which enjoys a gapdependent guarantee under the assumption that all of the missspecified models have regret . Under such an assumption, the authors recover a regret bounds of the order for arbitrary action sets. Cutkosky et al. (2020) also manage to recover the and bounds for the model selection problems through their corralling algorithm. Ghosh et al. (2021) propose an algorithm which enjoys in the finite arm setting, where is the smallest, in absolute value, entry of . Their algorithm also enjoys a similar guarantee for arbitrary action sets with replaced by . Zhu and Nowak (2021) show that it is impossible to achieve the desired regret guarantees of without additional assumptions by showing a result similar to the one of Lattimore (2015). The work of Lattimore (2015) states that in the stochastic multiarmed bandit problem it is impossible to achieve regret to a fixed arm, without suffering at least regret to a different arm.Chatterji et al. (2020) study the problem of selecting between an algorithm for the linear contextual bandit problem and the simple stochastic multiarmed bandit problem, that is they aim to achieve simultaneously a regret guarantee which is instancedependent optimal for the stochastic multiarmed bandit problem and optimal for the finite arm stochastic linear bandit problem. The proposed results only hold under additional assumptions. More generally, the study of the corralling problem, in which we are presented with multiple bandit algorithms and would like to perform as well as the best one, was initiated by Agarwal et al. (2017). Other works which fall into the corralling framework are that of Foster et al. (2020a) who study the missspecified linear contextual bandit problem, that is the observed losses are linear up to some unknown missspecification, and the work of Arora et al. (2021) who study the corralling problem for multiarmed stochastic bandit algorithms.
Our work also shows an impossibility result for the stochastic bandit problem with nonstationary rewards. Auer (2002) first investigates the problem under the assumption that there are distributional changes throughout the game and gives an algorithm with a dynamic regret^{1}^{1}1In dynamic regret the comparator is the best action for the current distribution. bound, under the assumption that is known. Auer et al. (2019) achieves similar regret guarantees without assuming that the number if switches (or changes) of the distribution is known. A different measurement of switches is the total variation of changes in distribution . Multiple works give dynamic regret bounds of the order (hiding dependence on the size of the policy class) when is known, including for extensions of the multiarmed bandit problem like contextual bandits and linear contextual bandits (Besbes et al., 2014; Luo et al., 2018; Besbes et al., 2015; Wei et al., 2017). Cheung et al. (2019); Zhao et al. (2020) further show algorithms which enjoy a parameter free regret bound of the order (hiding dependence on dimensionality) for the linear bandits problem. The lower bound in Table 1 might seem to contradict such results. In Section 5.1 we carefully explain why this is not the case.
Finally, our lower bounds apply to the problem of devising an algorithm which simultaneously enjoys a second order bound over any fraction of experts. CesaBianchi et al. (2007) first investigate the problem of second order bounds for the experts problem, in which the proposed algorithm maintains a distribution over the set of experts, during every round of the game. The experts are assumed to have stochastic losses and the work shows an algorithm with regret guarantee. Chaudhuri et al. (2009); Chernov and Vovk (2010); Luo and Schapire (2015); Koolen and Van Erven (2015)
study a different experts problem in which the comparator class for the regret changes from the best expert in hindsight to the uniform distribution over the best
experts for an arbitrary positive . The above works propose algorithms which achieve a regret bound for all simultaneously. Freund (2016) asks if there exists an algorithm which enjoys both guarantees at the same time, that is, does there exist an algorithm with regret bound which holds simultaneously for all positive .General CB  Upper bound  Lower bound 

adaptive adversary  
oblivious adversary / stochastic  
Sswitch  Upper bound  Lower bound 
adaptive adversary  
oblivious adversary  
stochastic 
Notation.
For any , denotes the set . notation hides polylogarithmic factors in the horizon and the number of arms but not in the size of the policy classes .
2 Problem setting
We consider the contextual bandit problem with general policy classes of finite size. There are arms and nested policy classes , where a policy is a mapping from an arbitrary context space into the set of arms. The game is played for rounds and at any time , the agent observes a context , selects arm and observes the loss
from an otherwise unobserved loss vector
. We measure an algorithm’s performance in terms of pseudoregret, which is the expected cumulative regret of the player against following a fixed policy in hindsightEnvironments.
We distinguish between stochastic environments and oblivious or adaptive adversaries. In stochastic environments, there are unknown distribution such that and are i.i.d. samples. In the adversarial regime, the distributions can change over time, i.e. . When the choices are fixed at the beginning of the game, the adversary is called oblivious, while an adaptive adversary can chose based on all observations up to time .
Often the stochasticadversarial hybrid problem has been studied with adversarially chosen context but stochastic losses. In our work, all upper bounds hold in the stronger notion where both the losses and the contexts are adaptive, while the lower bounds hold for the weaker notion where only the contexts are adaptive.
Open problem (Foster et al., 2020b).
The regret upper bounds for all regimes introduced above for a fixed policy class of finite size are of the order and can be achieved by the Exp4 algorithm (Auer et al., 2002). The question asked by Foster et al. (2020b): For a nested sequence of policies , is there a universal such that a regret bound of
(1) 
is obtainable for all simultaneously?
W.l.o.g. we can assume that . Otherwise we take a subset of policy classes that includes and where two consequent policy classes at least square in size. Due to nestedness, any guarantees on this subset of models imply up to constants the same bounds on the full set.
Sswitch
A motivating example for studying nested policy classes is the Sswitch problem. The context is simply and the set of policies is given by
the set of policies that changes its action not more than many times. Any positive result for contextual bandits with finite sized policy classes would provide algorithms that adapt to the number of switch points, since . To make clear what problem we are considering, we are using to denote the regret in the switching problem.
Next, we define the class of proper algorithms which choose their policy at every time step independently of context . Restricting our attention to such algorithms greatly reduces the technicalities for lower bound proofs in the nonadaptive regimes. The lower bound for this class of algorithms is also at the core of the argument for adaptive (improper) algorithms in stochastic environments.
Definition 1.
We call an algorithm proper, if at any time , the algorithm follows the recommendation of a policy , and if the choice of by the algorithm, is independent of the context .
Example.
EXP4 is proper.
3 Upper bounds
In this section, we generalize the HedgedFTRL algorithm (Foster et al., 2020a) to obtain an upper bound for model selection over a large collection of regret algorithms.
Theorem 1.
For any , we can tune HedgedFTRL over a selection of instances of EXP4 operating on policy classes , such that the following regret bound holds uniformly over all
HedgedFTRL.
hedged FTRL, introduced in Foster et al. (2020a), is a type of Follow the Regularized Leader (FTRL) algorithm which is used as a corralling algorithm (Agarwal et al., 2017). At every round , the algorithm chooses to play one of base algorithms . Base algorithm
is selected with probability
, where is a distribution over base algorithms determined by the FTRL rule , where is the sum of the loss vectors , is the potential induced by the Tsallis entropy, is a step size determined by the problem parameters and is a special bias term which we now explain. Define , and initialize . Hereis a vector which tracks the variance of the loss estimators,
is a vector with regret upper bounds for the base algorithms, and is a threshold depending on the base algorithms. At any time , after selecting base algorithm to play the current action, the top (corralling) algorithm observes its loss and gives as feedback an important weighted loss to the selected base, . Whenever the base played at round would satisfy , the loss fed to the top algorithm is adjusted with a bias , such that the cumulative biases track the quantity . This has been shown to be always possible (Foster et al., 2020a). The condition for adjusting the biases reads(2) 
The condition in Equation 2 is motivated in a similar way to the stability condition in the work of Agarwal et al. (2017). Algorithm 1
constructs an unbiased estimator for the loss vector,
, and updates each base algorithm accordingly. A similar update is present in the Corral algorithm Agarwal et al. (2017) in which each of the base learners also receives an importance weighted loss. The regret of the base learners is assumed to scale with the variance of the importance weighted losses. This assumption is natural and in practice holds for all bandit or expert algorithms. The scaling of the regret, however, must be appropriately bounded as Agarwal et al. (2017) show, otherwise no corralling or model selection guarantees are possible. Formally, the following stability property is required. If an algorithm enjoys a regret bound under environment with loss sequence , then the algorithm is stable if it enjoys a regret bound of the order under the environment of importance weighted losses , where is the maximum variance of the losses and the expectation is taken with respect to any randomness in . Essentially all bandit and expert algorithms used in practice are stable with , e.g., Exp4 is stable. The bias terms in Algorithm 1 intuitively cancel the additional variance introduced by the importance weighted losses and this is why we require the biases to satisfy Equation 2.Theorem 2.
Given a collection of base algorithms which are stable, that is
and any , then the regret of (hedged TsallisInf with , satisfies a simultaneous regret of
The analysis follows closely the proof of Foster et al. (2020a) and is postponed to Appendix A.
Theorem 2 recovers the bounds of Pacchiano et al. (2020b) for model selection in linear bandits, but holds in more general settings including adaptive adversaries in both contexts and losses. It neither requires nestedness of the policies nor that the policies operate on the same action or context space.
4 Lower bounds
We present lower bounds that match the upper bounds from Section 3 up to logarithmic factors, thereby proving a tight Pareto frontier of worstcase regret guarantees in model selection for contextual bandits.
In the first part of this section, we consider a special instance of switch with adaptive adversary. The proof technique based on Pinsker’s inequality is folklore and leads to the following theorem.
Theorem 3.
For any , sufficiently large , and any algorithm with regret guarantee
there exists for any number of switches a stochastic bandit problem such that
This bound holds even when the agent is informed about the number of switches up to time .
Since this bound holds even when the agent is informed about when a switch occurs, we can restrict the policy class to policies that only switch arms whenever the agent is informed about a switch in the environment. This as a contextual bandit problem with context and policies. Hence Theorem 3 implies a lower bound of In the second part of the section, we consider the stochastic regime. Our lower bound construction is nonstandard and relies on bounding the total variation between problem instances directly without the use of Pinsker’s inequality.
Theorem 4.
There exist policy classes ^{2}^{2}2 In Foster et al. (2020b) open problem 2, they ask about model based contextual bandit with realizability. Our lower bound is providing an instance of that. with , such that if the regret of a proper algorithm is upper bounded in any environment by
then there exists an environment such that
These theorems directly provide negative answers to (Foster et al., 2020b).
Corollary 1.
There is no that satisfies the regret guarantee of open problem (1) for any algorithm in the adaptive adversarial regime or any proper algorithm in the stochastic case.
Proof.
By Theorems 4 and 3, for any there exists . Assume that where . Hence by Theorem 3 and Theorem 4 there exist environments where
∎
Finally, we disprove the open problem in the stochastic case for any algorithm.
Theorem 5.
No algorithm (proper or improper) can satisfy the requirements of open problem (1) for all stochastic environments.
We present the high level proof ideas in the following subsections and the detailed proof in Appendix B.
4.1 Adaptive adversary: switch( Problem
We present the adaptive environment in which model selection fails and the proof of Theorem 3.
The adversary switches the reward distribution up to many times, thereby segmenting the time into phases . We denote as the counter of phases and assume the agent is given this information. For each phase , the adversary selects an optimal arm uniformly at random among the first arms. If
, the losses are i.i.d. Bernoulli random variables with means
In phase , all losses are until the end of the game. The adversary decides on the switching points based on an adaptive strategy. A switch from phase to occurs when the player has played times an arm in in phase . We can see this problem either as a special case of Sswitch problem, or alternatively as a contextual bandit problem with policies.
The lower bound proof for switch() relies on the following Lemma, which is proven in Appendix B.
Lemma 1.
Let an agent interact with a armed bandit problem with centered Bernoulli losses and randomized best arm of gap for an adaptive number of time steps . If the probability of is at least , then the regret after timesteps conditioned on the event is lower bounded by
Informally, this Lemma says that conditioned on transitioning from phase to phase , the agent has suffered regret against arm during phase .
Informal proof of Theorem 3.
The adversary’s strategy is designed in a way such that at each phase it only allows the player’s strategy to interact with the environment just enough times to discover the best action . Then a new phase begins to prevent the player from exploiting knowledge of . This ensures by Lemma 1 that the player suffers regret at least during each completed phase. If an agent proceeds finding for all phases , then the regret against the nonswitching baseline is . By the assumption on the maximum regret of and an appropriate choice of and , we can ensure that the agent must fail to discover all with constant probability, thus incurs regret at least against the optimal switch baseline. Tuning and yield the desired theorem. The formal argument with explicit choice of is found in Appendix B. ∎
4.2 Stochastic lower bound
We now present the stochastic environment used for the impossibility results in Theorems 5 and 4.
There are environments with policies and . In all environments, we have and always chooses action , while are playing an action from uniformly at random. (In other words, the context is with sampled uniformly at random and .)
In each environment, the losses of actions at any time step satisfy , which are conditioned on independent Bernoulli random variables, with mean
Action gives a constant loss of in all environments.
Let us unwrap these definitions. Playing either action 1 or action 2, which we call revealing actions, yields fullinformation of all random variables at time due to the dependence of and the nonrandomness of . On the other hand, playing action 3 allows only to observe , which has the same distribution in all environments, hence there is no information gained at all.
We know from fullinformation lower bounds that for optimal tuning of the gap, one suffers regret in the policy class , due to the difficulty of identifying the optimal arm. For a smaller regret in policy class , one needs to confirm or reject the hypothesis faster than it takes to identify the optimal arm. Existing techniques do not answer the question whether this is possible, and our main contribution of this section is to show that the hardness of rejecting is of the same order as identifying the exact environment.
For the remaining section, it will be useful to consider a reparametrization of the random variables. Let be the losses incurred by the policies : . We can easily see that together with is sufficient to uniquely determine and . Furthermore, is always a vector of independent Bernoulli random variables, which are independent of ^{3}^{3}3 We want to emphasize that is only independent of , not independent of the full vector . . In environments , the th component is a biased Bernoulli, while all other components have mean . In , no component is biased. As before, does not provide any information since its distribution conditioned on is identical in all environments (see Lemma 4 in Appendix B for a formal proof).
Under this reparameterization and ignoring noninformative bits of randomness, the problem of distinguishing from now looks as follows. For time steps , decide whether to play a revealing action and observe (potentially by taking into account). Use observed to distinguish between the environments. Proper algorithms simplify the problem even further, because selecting independently of implies that the decision of observing is also independent of (any policy except allows to observe under any context). Hence for proper algorithms, we can reason directly about how many samples are required to distinguish between environments. This problem bears similarity to the property testing of dictator functions (Balcan et al., 2012)
and sparse linear regression
(Ingster et al., 2010)^{4}^{4}4The setting of (Ingster et al., 2010) is different from our setting as they consider an asymptotic regime where both feature sparsity and dimensionality of the problem go to infinity, while for us the sparsity is fixed to one., however, there is no clear way to apply such results to our setting.The following lemma shows the difficulty of testing for hypothesis .
Lemma 2.
Let , , and . If the algorithm chooses whether to reveal independently of and if the total times is revealed is bounded by a.s. then for any measurable event it holds that
The proof of Lemma 2 is deferred to Section B.3. The high level idea is to directly bound the TV between and over the space of outcomes of by utilizing BerryEssen’s inequality instead of going through Pinsker’s inequality. This step is key to achieve a dependence on in the bound.
For readers familiar with lower bound proofs for bandits and fullinformation, this Lemma should not come at a huge surprise. For a round fullinformation game, it tells us that we can bias a single arm up to , without this being detectable. This directly recovers the well known lower bound of for fullinformation via the argument used for bandit lower bounds. However, this result goes beyond what is known in the literature. We not only show that one cannot reliably detect the biased arm, but that one cannot even reliably detect whether any biased arm is present at all. This property is the key to showing the lower bound of Theorem 4.
Informal proof of Theorem 4.
Under environment , observing for timesteps incurs a regret of . Using the assumption on the regret and Markov inequality, we obtain an upper bound on the expected number of observations, which holds with probability . We can construct an algorithm that never observes more than samples, by following algorithm until it played times a revealing action and then commits to policy (action 3). Since the algorithm is proper, we can define as the observed ’s during time where the algorithm plays a revealing action. For the revealed information generated by , we tune the remaining parameters such that the conditions of Lemma 2 are satisfied. Let be the event that plays exactly times a revealing action (i.e. plays at least time the revealing action), then happens with probability under . Thus, there exists an environment such that plays less than times an action in with constant probability, which incurs regret of . The theorem follows from tuning and , which is done formally in Appendix B. ∎
Improper algorithms.
Even though we are not able to extend the lower bound proof uniformly over all values and to improper algorithms, we can still show that no algorithm (proper or improper) can solve the open problem (1) for stochastic environments.
The key is the following generalization of Lemma 2, which is proven in the appendix.
Lemma 3.
Let , , and . If the total number of times is revealed is bounded by a.s. then for any measurable event it holds that
This holds even if the agent can take all contexts and previous observations into account when deciding whether to pick a revealing action at any timestep.
Informal proof of Theorem 5.
The proof is analogous to Theorem 4, however we use Lemma 3 to bound the difference in probability of under and . The key is to find a tuning such that the RHS of Lemma 3 does not exceed . Note that this is of order . Let , then the requirement on yields . Setting for any , then it follows immediately that the RHS in Lemma 3 goes to for . Following the same arguments as in Theorem 4, there exists a sufficiently large up from which there always exists an environment such that the regret is linear in , thereby contradicting the open problem. The formal proof is deferred to Appendix B ∎
5 Implications
The relevance of open problem Eq. 1 has been motivated by its potential implications for other problems such as the Sswitch bandit problem and an unresolved COLT2016 open problem on improved second order bounds for fullinformation. Our negative result for Eq. 1 indeed lead to the expected insights.
5.1 Sswitch
Our lower bound in the adaptive regime shows that adapting to the number of switches is hopeless if the timing of the switch points is not independent of the players actions. Any algorithm adaptive to the number of switches in the regime with oblivious adversary must break in the adaptive case, which rules out bandit over bandit approaches based on importance sampling (Agarwal et al., 2017). The successful algorithm proposed in Cheung et al. (2019) is using a bandit over bandit approach without importance sampling. Nonetheless, all components have adaptive adversarial guarantees. The algorithm splits the time horizon into equal intervals of length . It initializes EXP3 with
arms, corresponding to a grid of learning rates. For each epoch, the EXP3 top algorithm samples an arm and starts a freshly initialized instance of EXP3.S using the learning rate corresponding to the selected arm. This instance is run over the full epoch of length
. It collects the accumulated losses of the algorithm and feeds the loss to the EXP3 top algorithm.If all algorithms in the protocol enjoy guarantees against adaptive adversaries, why do bandit over bandit break against adaptive adversaries? Adaptive adversaries are assumed to pick the losses independent of the choice of arm of the agent at round . In the bandit over bandit protocol, the loss of the arm of the top algorithm dependents on the losses that the selected base suffers in the epoch. An adaptive adversary can adapt the losses in the epoch based on the actions of the base algorithm, that means the loss is not chosen independent of the action . Hence the protocol is broken and the adaptive adversarial regret bounds do not hold.
5.2 Second order bounds for full information.
In an unresolved COLT2016 open problem, Freund (2016) asks if it is possible to ensure a regret bound of order
(3) 
against the best proportion of policies simultaneously for all . We go even a step further and show that the lower bound construction from Section 4.2 directly provides a negative answer for any to the looser bound
(4) 
Theorem 6.
Theorem 6 has the following interpretation. For any fixed , there is no algorithm which enjoys a regret upper bound as in Equation 3 for all problem instances s.t. . This implies we can not hope for a polynomial improvement, in terms of time horizon, over the existing bound of . The detailed proof is found in Appendix B. The high level idea is to initialize the fullinformation algorithm satisfying Eq. 4 with a sufficient number of copies of the baseline policy and to feed importance weighted losses of the experts (i.e. policies) to that algorithm.
As we mention in Section 1, the case is obtainable. Our reduction relates the adaptation to variance to the model selection problem. As in Eq. 1, is the tradeoff between time and complexity. An algorithm satisfying Eq. 4 with merely allows to recover the trivial bound for model selection, and hence does not lead to a contradiction.
6 Conclusion
We derived the Pareto Frontier of minimax regret for model selection in Contextual bandits. Our results have resolved several open problems (Foster et al., 2020b; Freund, 2016).
We like to thank Haipeng Luo and Yoav Freund for discussions about our lower bound proofs. We thank Tor Lattimore for pointing us to the technicalities required for bounding the total variation of improper algorithms.
References
 AbbasiYadkori et al. [2011] Yasin AbbasiYadkori, Dávid Pál, and Csaba Szepesvári. Improved algorithms for linear stochastic bandits. In NIPS, volume 11, pages 2312–2320, 2011.
 Agarwal et al. [2017] Alekh Agarwal, Haipeng Luo, Behnam Neyshabur, and Robert E Schapire. Corralling a band of bandit algorithms. 2017.

Arora et al. [2021]
Raman Arora, Teodor Vanislavov Marinov, and Mehryar Mohri.
Corralling stochastic bandit algorithms.
In
International Conference on Artificial Intelligence and Statistics
, pages 2116–2124. PMLR, 2021. 
Auer [2002]
Peter Auer.
Using confidence bounds for exploitationexploration tradeoffs.
Journal of Machine Learning Research
, 3(Nov):397–422, 2002.  Auer et al. [2002] Peter Auer, Nicolò CesaBianchi, Yoav Freund, and Robert E Schapire. The nonstochastic multiarmed bandit problem. SIAM journal on computing, 32(1), 2002.
 Auer et al. [2019] Peter Auer, Pratik Gajane, and Ronald Ortner. Adaptively tracking the best bandit arm with an unknown number of distribution changes. In Conference on Learning Theory, pages 138–158. PMLR, 2019.
 Balcan et al. [2012] MariaFlorina Balcan, Eric Blais, Avrim Blum, and Liu Yang. Active property testing. In 2012 IEEE 53rd Annual Symposium on Foundations of Computer Science, pages 21–30. IEEE, 2012.
 Besbes et al. [2014] Omar Besbes, Yonatan Gur, and Assaf Zeevi. Stochastic multiarmedbandit problem with nonstationary rewards. Advances in neural information processing systems, 27:199–207, 2014.
 Besbes et al. [2015] Omar Besbes, Yonatan Gur, and Assaf Zeevi. Nonstationary stochastic optimization. Operations research, 63(5):1227–1244, 2015.
 CarlGustav [1942] Esseen CarlGustav. On the liapunoff limit of error in the theory of probability. Arkiv for matematik, astronomi och fysik, A: 1–19, 1942.
 CesaBianchi et al. [2007] Nicolo CesaBianchi, Yishay Mansour, and Gilles Stoltz. Improved secondorder bounds for prediction with expert advice. Machine Learning, 66(2):321–352, 2007.
 Chatterji et al. [2020] Niladri Chatterji, Vidya Muthukumar, and Peter Bartlett. Osom: A simultaneously optimal algorithm for multiarmed and linear contextual bandits. In International Conference on Artificial Intelligence and Statistics, pages 1844–1854. PMLR, 2020.
 Chaudhuri et al. [2009] Kamalika Chaudhuri, Yoav Freund, and Daniel Hsu. A parameterfree hedging algorithm. arXiv preprint arXiv:0903.2851, 2009.
 Chernov and Vovk [2010] Alexey Chernov and Vladimir Vovk. Prediction with advice of unknown number of experts. arXiv preprint arXiv:1006.0475, 2010.
 Cheung et al. [2019] Wang Chi Cheung, David SimchiLevi, and Ruihao Zhu. Learning to optimize under nonstationarity. In The 22nd International Conference on Artificial Intelligence and Statistics, pages 1079–1087. PMLR, 2019.
 Chu et al. [2011] Wei Chu, Lihong Li, Lev Reyzin, and Robert Schapire. Contextual bandits with linear payoff functions. In Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pages 208–214. JMLR Workshop and Conference Proceedings, 2011.
 Cutkosky et al. [2020] Ashok Cutkosky, Abhimanyu Das, and Manish Purohit. Upper confidence bounds for combining stochastic bandits. arXiv preprint arXiv:2012.13115, 2020.
 Foster et al. [2019] Dylan J Foster, Akshay Krishnamurthy, and Haipeng Luo. Model selection for contextual bandits. arXiv preprint arXiv:1906.00531, 2019.
 Foster et al. [2020a] Dylan J Foster, Claudio Gentile, Mehryar Mohri, and Julian Zimmert. Adapting to misspecification in contextual bandits. Advances in Neural Information Processing Systems, 33, 2020a.
 Foster et al. [2020b] Dylan J Foster, Akshay Krishnamurthy, and Haipeng Luo. Open problem: Model selection for contextual bandits. In Conference on Learning Theory, pages 3842–3846. PMLR, 2020b.
 Freund [2016] Yoav Freund. Open problem: Second order regret bounds based on scaling time. In Conference on Learning Theory, pages 1651–1654. PMLR, 2016.
 Ghosh et al. [2021] Avishek Ghosh, Abishek Sankararaman, and Ramchandran Kannan. Problemcomplexity adaptive model selection for stochastic linear bandits. In International Conference on Artificial Intelligence and Statistics, pages 1396–1404. PMLR, 2021.
 Ingster et al. [2010] Yuri I Ingster, Alexandre B Tsybakov, and Nicolas Verzelen. Detection boundary in sparse regression. Electronic Journal of Statistics, 4:1476–1526, 2010.

Koolen and Van Erven [2015]
Wouter M Koolen and Tim Van Erven.
Secondorder quantile methods for experts and combinatorial games.
In Conference on Learning Theory, pages 1155–1175. PMLR, 2015.  Langford and Zhang [2007] John Langford and Tong Zhang. The epochgreedy algorithm for contextual multiarmed bandits. Advances in neural information processing systems, 20(1):96–1, 2007.
 Lattimore [2015] Tor Lattimore. The pareto regret frontier for bandits. arXiv preprint arXiv:1511.00048, 2015.
 Luo and Schapire [2015] Haipeng Luo and Robert E Schapire. Achieving all with no parameters: Adaptive normalhedge. arXiv preprint arXiv:1502.05934, 2015.
 Luo et al. [2018] Haipeng Luo, ChenYu Wei, Alekh Agarwal, and John Langford. Efficient contextual bandits in nonstationary worlds. In Conference On Learning Theory, pages 1739–1776. PMLR, 2018.
 Pacchiano et al. [2020a] Aldo Pacchiano, Christoph Dann, Claudio Gentile, and Peter Bartlett. Regret bound balancing and elimination for model selection in bandits and rl. arXiv preprint arXiv:2012.13045, 2020a.
 Pacchiano et al. [2020b] Aldo Pacchiano, My Phan, Yasin AbbasiYadkori, Anup Rao, Julian Zimmert, Tor Lattimore, and Csaba Szepesvari. Model selection in contextual stochastic bandit problems. arXiv preprint arXiv:2003.01704, 2020b.
 Tyurin [2009] Ilya Tyurin. New estimates of the convergence rate in the lyapunov theorem. arXiv preprint arXiv:0912.0726, 2009.
 Wei et al. [2017] ChenYu Wei, YiTe Hong, and ChiJen Lu. Tracking the best expert in nonstationary stochastic environments. arXiv preprint arXiv:1712.00578, 2017.
 Zhao et al. [2020] Peng Zhao, Lijun Zhang, Yuan Jiang, and ZhiHua Zhou. A simple approach for nonstationary linear bandits. In International Conference on Artificial Intelligence and Statistics, pages 746–755. PMLR, 2020.
 Zhu and Nowak [2021] Yinglun Zhu and Robert Nowak. Pareto optimal model selection in linear bandits. arXiv preprint arXiv:2102.06593, 2021.
 Zimmert and Seldin [2021] Julian Zimmert and Yevgeny Seldin. Tsallisinf: An optimal algorithm for stochastic and adversarial bandits. Journal of Machine Learning Research, 22(28):1–49, 2021.
Checklist

For all authors…

Do the main claims made in the abstract and introduction accurately reflect the paper’s contributions and scope? Our main regret upper bounds are found in Theorem 1 and Theorem 2. Our main lower bounds are found in Theorem 3 and Theorem 4 with Corollary 1 solving the COLT2020 open problem and Theorem 6 solving the COLT2016 open problem.

Did you describe the limitations of your work? See paragraph regarding removing the properness requirement at the end of page 8.

Did you discuss any potential negative societal impacts of your work? This paper is theoretical in nature and we do not foresee any immediate societal impacts.

Have you read the ethics review guidelines and ensured that your paper conforms to them?


If you are including theoretical results…

Did you state the full set of assumptions of all theoretical results?

Did you include complete proofs of all theoretical results? For complete proofs we refer the reader to the appendix.


If you ran experiments…

Did you include the code, data, and instructions needed to reproduce the main experimental results (either in the supplemental material or as a URL)?

Did you specify all the training details (e.g., data splits, hyperparameters, how they were chosen)?

Did you report error bars (e.g., with respect to the random seed after running experiments multiple times)?

Did you include the total amount of compute and the type of resources used (e.g., type of GPUs, internal cluster, or cloud provider)?


If you are using existing assets (e.g., code, data, models) or curating/releasing new assets…

If your work uses existing assets, did you cite the creators?

Did you mention the license of the assets?

Did you include any new assets either in the supplemental material or as a URL?

Did you discuss whether and how consent was obtained from people whose data you’re using/curating?

Did you discuss whether the data you are using/curating contains personally identifiable information or offensive content?


If you used crowdsourcing or conducted research with human subjects…

Did you include the full text of instructions given to participants and screenshots, if applicable?

Did you describe any potential participant risks, with links to Institutional Review Board (IRB) approvals, if applicable?

Did you include the estimated hourly wage paid to participants and the total amount spent on participant compensation?

Appendix A Upper bound proofs
Proof of Theorem 2.
Denote with the arm that algorithm would choose if it would be selected during round . We decompose the regret into
where the last line is by the assumption of the theorem. The first term requires some basic properties of FTRL analysis, see e.g. [Zimmert and Seldin, 2021]. Define , . For TsallisINF with constant learning rate^{5}^{5}5The proof can be adapted to time dependent learning rates. , we have the following properties
The standard FTRL proof (e.g. Zimmert and Seldin [2021]) shows that
The first term is
Let us consider the terms and . First, using the definition of the conjugate function and we know that
Further by Young’s inequality it holds that
The above two displays imply
Thus we can bound