1 Introduction
The contextual bandit (CB) problem [39, 26, 29] is a foundational paradigm for online decisionmaking. In this problem, the decisionmaker takes one out of actions as a function of contextual information that is available in advance, where this function is chosen from a fixed policy class and is typically learned from past outcomes. Most work on CB has centered around designing algorithms that minimize regret with respect to the best such function in hindsight; a particular nontriviality lies in being able to do this in a computationally efficient, scalable manner [2, 36, 16]. When the rewards are realizable under the chosen policy class, this is now an essentially solved problem [17, 35].
A complementary problem to regret minimization with respect to a fixed policy class is choosing the policy class that is best for the problem at hand. This constitutes a datadriven model selection problem, and its importance is paramount in CB, as selecting a policy class that either underfits or overfits the data leads to highly suboptimal accrued rewards. To see why, it is instructive to consider the simplest nontrivial instance of such model selection, which involves deciding whether to use the contexts (with a fixed policy class, say, of dimensional linear functions) or simply run a multiarmed bandit (MAB) algorithm. Making this choice a priori is suboptimal one way or another: if we choose a MAB algorithm, we obtain the optimal regret to the best fixed action, but the latter may be highly suboptimal if the rewards depend on the context. On the other hand, if we choose an offtheshelf linear CB algorithm, we incur regret even when the rewards do not depend on the contexts as a consequence of overfitting — this is highly suboptimal compared to the guarantee that we could have obtained had we known about the simpler structure. This example demonstrates that model selection must be datadriven. In essence, we wish to design a single algorithm that achieves the bestofbothworlds, by retaining the linear CB regret guarantee while adapting onthefly to hidden simpler structure if it exists.
Focusing on the MABvslinear setting, the strongest variant of the model selection objective asks:
Objective 1. Can we design a single algorithm that simultaneously achieves the respective minimaxoptimal rates of under simple MAB structure (when it exists) and under dimensional linear CB structure?
A related but weaker objective is also proposed^{1}^{1}1Foster et al. [14] ask a general form of this question in a statistical learning setup, which specializes to the MABvslinear objective stated here. in a COLT 2020 open problem [14]:
Objective 2. Can we design a single algorithm that simultaneously achieves the rate under simple MAB structure and under linear CB structure, for some ?
The importance of the rate was highlighted by Foster et al. [14], as it verifies that model selection is possible whenever the underlying class is learnable, which is the case if and only if .
Even in the simplest instance of model selection between a multiarmed bandit and a linear contextual bandit, achieving either Objective 1 or Objective 2 under minimal assumptions remains open. Existing approaches to address either the stronger Objective 1 [9] or the weaker Objective 2 [15] make restrictive assumptions^{2}^{2}2We note here that the assumptions made across the two works are of varying strength: in particular, while [9] assumes that the context corresponding to each action will be wellconditioned, [15] only requires wellcondition on the distribution of the average of the contexts across actions. See Sections 1.2 and 2.2 for more details on the distinctions between the two approaches. regarding the conditioning (what we will call diversity) of the contexts. Other, more dataagnostic approaches [3, 32, 31, 28] achieve neither of the above objectives. This leads us to ask whether we can design a universal
model selection approach that is dataagnostic (other than requiring a probability model on the contexts)
and achieves either Objective 1 or 2.Another important question is the adaptivity of approaches to situations in which model selection is especially tractable. At the heart of effective datadriven model selection is a metaexplorationvsexploitation tradeoff: while we need to exploit the currently believed simpler model structure, we also need to explore sufficiently to discover potential complex model structure. Almost all approaches to model selection incorporate forced exploration of an greedy type to navigate this tradeoff; however, such exploration may not always be needed. Indeed, [9] use no forced exploration in their approach and thereby achieve the optimal guarantee of Objective 1; however, their approach only works under restrictive diversity assumptions. It is natural to ask whether we can design dataadaptive exploration schedules that employ forced exploration only when it is absolutely needed, thus recovering the strongest possible guarantee (Objective 1) under favorable situations and a weakerbutstilldesirable guarantee (Objective 2) otherwise.
1.1 Our contributions
From the above discussion, it is clear that algorithm design for model selection that satisfies the criteria posed in [14] involves two nontrivial components: a) designing an efficient statistical test to distinguish between simple and complex model structure that works under minimal assumptions, and b) designing an exploration schedule to ensure that sufficiently expressive data is collected for a) to succeed. In this paper, we advance the stateoftheart for both of these components in the following ways:

We design a new test based on eigenvalue thresholding that works for all stochastic subGaussian contexts; in contrast to
[9] and [15], it does not require any type of context diversity. We critically utilize the fact that “lowenergy” directions can be thresholded and ignored to estimate the gap in error between model classes. See Theorem
1 for our new model selection guarantee, which only requires stochasticity on the contexts in order to meet Objective 2. 
We also design a dataadaptive exploration schedule that performs forced exploration only when necessary. This approach meets Objective 2 under the actionaveraged feature diversity assumption that is made in [15], but also the stronger Objective 1 under the stronger assumption that the feature of each action is diverse (made in [9]). See Theorem 2 for a precise statement of our new adaptive guarantee on model selection.
Taken together, our results advance our understanding of model selection for contextual bandits, by demonstrating how statistical approaches can yield universal (i.e., nearly assumptionfree) and adaptive guarantees.
1.2 Related work
While model selection is a central and classical topic in machine learning and statistics, most results primarily apply to supervised offline and fullinformation online learning. Only recently has attention turned to model selection in online partial information settings including contextual bandits. Here, we focus on this growing body of work, which we organize based on overarching algorithmic principles.
Corralling approaches.
The first line of work constitutes a hierarchical learning scheme where a metaalgorithm uses bandit techniques to compete with many (contextual) bandit algorithms running as base learners. One of the first such approaches is the Corral algorithm of Agarwal et al. [3], which uses online mirror descent with the logbarrier regularizer as the metaalgorithm. Subsequent work focuses on adapting Corral to the stochastic setting [31, 28] and developing UCBstyle metaalgorithms [12, 4]. These approaches are quite general and can often be used with abstract nonlinear function classes. However, they do not seem able to meet either of Objectives 1 or 2 in general. In our setting, these approaches yield the tuple of rates , which clearly cannot be expressed in the form for any value of . Consequently, these approaches still leave the problem of model selection as described in [14] open.
Statistical approaches.
The second line of algorithmic approaches involves constructing statistical tests for model misspecification. This approach was initially used in the context of model selection concurrently by Foster et al. [15] and Chatterji et al. [9], who focus on the linear setting. At a high level, these papers develop efficient misspecification tests under certain covariate assumptions and use these tests to obtain style model selection guarantees. In particular, Foster et al. [15] use a “sublinear” square loss estimator under somewhat mild covariate assumptions to obtain regret, while Chatterji et al. [9] obtain regret under stronger covariate assumptions. As these two works are the foundation for our results, we discuss these papers in detail in the sequel.
Several recent papers extend statistical testing approaches in several ways. Ghosh et al. [18]
estimate the support of the parameter vector, which fundamentally incurs a dependence on the magnitude of the smallest nonzero coefficient. Beyond the linear setting,
Cutkosky et al. [11] use the “putative” regret bound for each model class directly to test for misspecification, while Ghosh et al. [19], Krishnamurthy and Athey [24] consider general function classes with realizability. While these latter approaches are more general than ours, they cannot be directly used to obtain our results. Indeed, central to our results (and those of Foster et al. [15]) is the fact that our statistical test provides a fast rate for detecting misspecification; this guarantee is quantitatively better than what is provided by the putative regret bound, requires carefully adjusting the exploration schedule, and is not available for general function classes.Other related work.
We briefly mention two peripherally related lines of work. The first is on representation selection in bandits and reinforcement learning
[33, 40], which involves identifying a feature mapping with favorable properties from a small class of candidate mappings. While this is somewhat reminiscent of the model selection problem, the main differences are that in representation selection all mappings are of the same dimensionality and realizable, and the goal is to achieve much faster regret rates by leveraging additional structural properties of the “good” representation.The second line of work is on Pareto optimality in noncontextual bandits and related problems. Beginning with the result of Lattimore [27], these results show that certain nonuniform regret guarantees are not achievable in various bandit settings. For example, [27] shows that, in armed bandit problems, one cannot simultaneously achieve regret to one specific arm, while guaranteeing regret to the rest. Such results have been extended to both linear and Lipschitz noncontextual bandits [41, 30, 23] as well as Lipschitz contextual bandits under margin assumptions [20], and they establish that model selection is not possible in these settings. However, these ideas have not been extended to standard contextual bandit settings, which is our focus.
2 Setup
Notation.
We use boldface to denote vectors and matrices (e.g. to denote a vector, and to denote a scalar). For any value of , denotes the finite set . We use to denote the norm of a vector, and to denote the operator norm of a matrix. We use to denote the Euclidean inner product between vectors and . We use bigOh notation in the main text; hides dependences on the number of actions , and denotes a bound that hides a factor and holds with probability at least .
2.1 The banditvscontextual bandit problem
The simplest instance of model selection involves a dimensional linear contextual bandit problem with possibly hidden multiarmed bandit structure. This model was proposed as an initial point of study in [9]. Concretely, actions (which we henceforth call arms) are available to the decisionmaker at every round, and denotes the total number of rounds. At round , the reward of each arm is given by
(1) 
where denotes the bias of arm , denotes the dimensional context corresponding to arm at round , and denotes random noise. Finally, denotes an unknown parameter. We make the following standard assumptions on the problem parameters.

The biases are assumed to be bounded between and .

The unknown parameter is assumed to be bounded, i.e. .

The contexts corresponding to each arm are assumed to be iid across rounds , and subGaussian. We denote by the covariance matrix of the context , and additionally note that as a consequence of the subGaussian assumption. Without loss of generality, we assume that for each arm the mean of the context is equal to the zero vector^{3}^{3}3This is without loss of generality, since bias can be incorporated into ..

The noise is iid across arms and rounds , centered, and subGaussian.
We denote the achieved pseudoregret with respect to the best fixed arm (the standard metric for a MAB problem) by , and the achieved pseudoregret with respect to the best policy under a dimensional model (the standard metric for a linear CB problem) by . Notice that in the special case when , this reduces to a standard multiarmed bandit (MAB) instance. The best possible regret rate is then given by in the worst case, and we also have the instancedependent rate , where . Both of these are known to be informationtheoretically optimal [25, 5]. On the other hand, the minimaxoptimal rate for the linear contextual bandit (linear CB) problem is given by [10, 1]. The following natural dichotomy in algorithm choice presents itself:

While the stateoftheart for the linear CB problem achieves the minimaxoptimal rate , it does not adapt automatically to the simpler MAB case. In particular, the regret will still scale with the dimension of the contexts owing both to unnecessary exploration built into linear CB algorithms and overfitting effects. This precludes achieving the minimaxoptimal rate of in the MAB setting, let alone the instancedependent rate.

On the other hand, any stateoftheart algorithm that is tailored to the MAB problem would not achieve any meaningful regret rate for the linear CB problem, simply because it does not incorporate contextual information into its decisions.
The simulations in [9] empirically illustrate this dichotomy and clearly motivate the model selection problem in its most ambitious form, i.e. Objective 1 as stated in Section 1. Objective 2 constitutes a weaker variant of the model selection problem that was proposed in [15, 14] and justified by the fact that it yields nontrivial model selection guarantees whenever the underlying class is learnable. While Objective 2 is in itself a desirable and nontrivial model selection guarantee, we note that it is strictly weaker than Objective 1. To see this, note that the objectives coincide for , and since we require for sublinear regret in the first place, the rate is a decreasing function in .
2.2 Metaalgorithm and prior instantiations
As mentioned in Section 1.2, the vast toolbox of corrallingtype approaches does not achieve either Objective 1 or 2 for model selection. [9] and [15], which are concurrent to each other, are among the first approaches to tackle the model selection problem and the only ones that achieve Objectives 1 and 2 respectively—but under additional strong assumptions. Both approaches use the same structure of a statistical test to distinguish between a simple (MAB) and complex (CB) instance. This metaapproach is described in Algorithm 1. Here, denotes the arm that is pulled at round , and as is standard in bandit literature, is the relevant filtration at round .
As our results also involve instantiating this metaalgorithm, we now discuss its main elements. The metaalgorithm begins by assuming that the problem is a simple (MAB) instance and primarily uses an optimal MAB algorithm for arm selection: this default choice is denoted by in Algorithm 1. To address model selection, it uses both an exploration schedule and a misspecification test, both of which admit different instantiations. The exploration schedule governs a rate at which the algorithm should choose arms uniformly at random, which can be helpful for detecting misspecification. The misspecification test is simply a surrogate statistical test to check if the instance is, in fact, a complex (CB) instance (i.e. ). If the test detects misspecification, we immediately switch to an optimal linear CB algorithm for the remaining time steps.
Algorithm  Estimator  Forced exploration parameter 

OSOM [9]  Plugin estimator  (no extra exploration) 
ModCB [15]  Fast estimator defined in [15]  
ModCB.U  Fast estimator defined in Algorithm 2  
ModCB.A  Fast estimator defined in [15]  Algorithm 3 
While [9] and [15] both use the metaalgorithmic structure in Algorithm 1, they instantiate it with difference choices of misspecification test and exploration schedule. The highlevel details of where the approaches diverge are summarized in Table 1, and the results that they obtain are summarized in Table 2. We provide a brief description of the salient differences below.

Chatterji et al. [9] do not incorporate any forced exploration in their procedure, as evidenced by the choice of parameter above for all values of . They also use the plugin estimator of the linear model parameter to obtain an estimate of the gap in performance between the two model classes. The error rate of this plugin estimator scales as as a function of the number of samples , and matches the putative regret bound for linear CB. Consequently, they achieve the optimal model selection rate of Objective 1, as well as the stronger instanceoptimal rate in the case of MAB, but require a strong assumption of feature diversity for each arm; that is, they require for all . Intuitively, feature diversity^{4}^{4}4Such feature diversity was previously used to show that the greedy algorithm can obtain competitive performance with LinUCB in contextual bandits, both for regular regret minimization and fairness objectives [7, 21, 34]. eliminates the need for forced exploration to successfully test for potential complex model structure.

Foster et al. [15] incorporate forced exploration of an greedystyle by setting the forced exploration parameter . This automatically precludes achieving the stronger Objective 1, but leaves the door open to achieving Objective 2 for some smaller choice of . To do this, they critically leverage fast estimators^{5}^{5}5
The ideas for this fast estimation are rooted in approaches to quickly estimate the signaltonoise ratio in highdimensional linear models
[37, 13, 22]. of the gap between the two model classes, whose error rate can be verified to scale as as a function of the number of samples . In particular, this is significantly better in its dependence on than the standard plugin estimator. Moreover, as a consequence of forced exploration, they do not require restrictive feature diversity assumptions on each arm; nevertheless, an armaveraged feature diversity assumption is still required. Specifically, they assume that where , which is strictly weaker than the armspecific condition of Chatterji et al. [9]. Essentially, is exactly the covariance matrix of the mixed context obtained from uniform exploration, which we denote by where .
This discussion tells us that the initial attempts at model selection [9, 15] fall short both in their breadth of applicability and their ability to adapt to structure in the model selection problem. This naturally motivates the question of whether we can design new algorithms with two key properties:

Universality: Can we meet Objective 2 for some value of under stochastic contexts but with no additional diversity assumptions?

Adaptivity: Can we meet Objective 1 under maximally favorable conditions (feature diversity for all arms), and Objective 2 otherwise?
3 Main results
We now introduce and analyze two new algorithms that provide a nearly complete answer to the problems of universality and adaptivity for the MABvslinear CB problem.
3.1 Universal model selection under stochasticity
In this section, we present ModCB.U, a simple variant of ModCB [15] that achieves Objective 2 of model selection without requiring any feature diversity assumptions, armaveraged or otherwise. Therefore, this constitutes a universal model selection algorithm between an MAB instance and a linear CB instance.
Our starting point is the approach to model selection in [15] described above in Section 2.2. Here, we recap the details of the fast estimator of the squarelossgap, which is given by . The squarelossgap can be verified to be an upper bound on the expected gap of the bestinclass performance between the CB and MAB models (see [15] for details on this upper bound), but is also equal to iff (and is full rank). Therefore, it is a suitable surrogate statistic to test for misspecification, as detailed in the metaalgorithmic structure of Algorithm 1. The estimator is denoted by , and is described as a blackbox procedure in Algorithm 2 with access to an estimator of the covariance matrix that is constructed from unlabeled samples. The estimator that is used byModCB [15] is simply the sample covariance matrix at round , defined by
(2) 
Note that such an estimator can be easily constructed as we have access to all past contexts at any round . This effective fullinformation access to contexts, in fact, forms the crux of both of our algorithmic ideas.
This approach is summarized in the subroutine Algorithm 2, which is instantiated in ModCB for any time step with
That is, the set of training examples used is the set of contextreward pairs on all designated exploration rounds. Above, constitutes the estimate of the sample means constructed only from past exploration rounds^{6}^{6}6This is done to make sure that the estimates are unbiased, i.e. .. As a consequence of this choice, we note that for this instantiation of the subroutine Algorithm 1, we have and at any given time step .
A key bottleneck lies in the obtainable estimation error rate of : while the leading dependence is given by (which is at the heart of the rate that ModCB achieves), there is also an inverse dependence on the minimum eigenvalue of the armaveraged covariance matrix , which we denote here by . This dependence arises as a consequence of needing to estimate the inverse covariance matrix from unlabeled samples. In essence, this requires to be wellconditioned, in the sense that we need to be a positive constant to ensure the model selection rate of . This precludes nontrivial model selection rates from ModCB for cases where could itself decay with , the dimension of the contexts, or , the number of rounds. It also does not allow for cases in which may be singular.
Our first main contribution is to adjust ModCB to successfully achieve Objective 2 in model selection with arbitrary stochastic, subGaussian contexts. Because our algorithm achieves a universal model selection guarantee over all stochastic context distributions, we name it ModCB.U. The algorithmic procedure is identical to that of ModCB except for the choice of estimator for the inverse covariance matrix, , that is plugged into Algorithm 2. Our key observation is as follows: if certain directions are small in magnitude for the contexts corresponding to all arms (as will be the case when has vanishingly small eigenvalues), then we may not actually want try to estimate the square loss gap along them: ignoring them might be a better option. Our approach to ignoring lowvalue directions simply uses eigenvalue thresholding to construct an improved biased estimate of the inverse covariance matrix . We formally define the eigenvalue thresholding operator below.
Definition 3.1.
Define the clipping operator . Then, for any matrix with diagonalization and any value of , we define the thresholding operator
We use Definition 3.1 to specify our (biased) estimators of the covariance and inversecovariance matrices . In particular, we let denote the sample covariance matrix of from unlabeled samples, given in Eq. (2). Then, our estimators are given by
(3) 
and we simply plug the estimate into Algorithm 2. Note that is always invertible for any . In essence, this lets us set as a tunable parameter to tradeoff the estimation error of a surrogate approximation to the squareloss gap (which will decrease in ) and the approximation error that arises from ignoring all directions with value less than (which will increase in ). As our first main result shows, we can set a value of that scales with and and successfully achieve Objective 2 of model selection for any stochastic subGaussian context distribution.
Algorithm  Obj. 1 (optimal rates)  Obj. 2 ( rates)  context assumption 

OSOM [9]  Yes  Yes )  
ModCB [15]  No  Yes  
ModCB.U  No  Yes  iid contexts only 
Corralstyle  No  No  iid contexts only 
Theorem 1.
ModCB.U with achieves model selection rates
(4) 
with probability at least .
Equation (4) clearly demonstrates model selection rates of the form required from Objective 2, and shows that Objective 2 can be met for some value of with the sole requirement of stochasticity on the contexts. Table 2 allows us to compare the achievable rate to both OSOM [9] and ModCB [15]; corralling approaches, which are assumptionfree but meet neither Objectives 1 nor 2, are also included as a benchmark. In particular, it is clear from a quick read of the table that as we go from OSOM to ModCB to our approach, the assumptions required on the context distributions weaken, as do the obtainable rates (recall that because the rate decreases in , a guarantee with a larger value of implies one with a smaller value of ).
The proof of Theorem 1 is provided in Appendix A. In Appendix C, we describe how this procedure and result extends to the more complex case of linear contextual bandits under an additional assumption of blockdiagonal structure on the covariance, where the blocks designate the dimensions of the model classes.
3.2 Dataadaptive algorithms for model selection
In this section, we introduce a new dataadaptive exploration schedule and show that it provably achieves Objective 1 under the strongest assumption of feature diversity for each arm (as in [9]), but also achieves Objective 2 under the weaker assumption of armaveraged feature diversity (as in [15]). At a high level, our key insight is as follows: the armspecific feature diversity condition that is critically used by [9] is itself testable from past contextual information; therefore, it can be tested for before we decide on an arm and receive a reward.
To describe this idea more formally, we introduce some more notation. At time step , we denote the exploration set that we have built up thus far by . Now, we use an inductive principle. Suppose that the contexts that are present in the exploration set are already sufficiently “diverse” in a certain quantitative sense (that we will specify shortly). Then, we can easily check whether the arm that we would ideally pull when the true model is simple, i.e. (ostensibly, the “greedy” arm), continues to preserve this property of diversity. Importantly, because we are able to observe the contexts before making a decision, we can check this condition before deciding on the value of .
This new subroutine for dataadaptive exploration, which we call ModCB.A, is described in Algorithm 3. We elaborate on the algorithm description along three critical verticals: a) the decision to forcibly explore, b) the choice of estimator, and c) the designated “exploration rounds” that are used for the estimator.
When to forcibly explore:
At time step , ModCB.A
uses the random variables
and to decide whether to stick with the “greedy” arm , or to forcibly explore, i.e. . For time step , the random variable denotes the indicator that the diversity condition continues to be met by context . This means that if , we will pick . On the other hand, if the diversity condition is not met (i.e. ), we revert to the forced exploration schedule used by ModCB. This schedule sets a variable , and selects if and otherwise. In summary, we end up picking if or , while ModCB would have picked only if . As a result, our dataadaptive procedure, which we call ModCB.A, allows us to adapt onthefly to friendly feature diversity structure (and explore much less) while preserving more general guarantees.The choice of estimator :
First, we specify the choice of estimator of the square loss gap, from samples in the designated exploration set . (We will specify the procedure for construction of this exploration set shortly.) For convenience, we index the elements of the exploration set in ascending order by . We also recall that denotes the sample covariance matrix^{7}^{7}7For technical reasons related to a random covariate shift induced by dataadaptive exploration, eigenvalue thresholding does not work quite as well here. as defined in Equation (2). Armed with this exploration set, we define our estimator of in accordance with the subroutine in Algorithm 2 with the examples from the exploration set, i.e. . In particular, we estimate an adjusted square loss gap, given by
(5) 
Note that because is random, the adjusted square loss gap is also random; nevertheless, it turns out that it is almost surely a good proxy for the true square loss gap . We estimate this adjusted squared loss gap with the estimator that is given by
(6) 
where are defined just as in Section 3.1.
How to build the exploration set :
To complete our description of ModCB.A, we specify the dataadaptive exploration set at round that is used for the estimation subroutine. Notice from the pseudocode in Algorithm 3 that we did not include the rounds for which in the exploration set. Interestingly, the two cases for which this happens are undesirable for two distinct reasons, as detailed below.

Rounds on which and constitute rounds on which there was no forced exploration and the context corresponding to arm need not be wellconditioned: therefore, we do not want to include these samples for estimation.

Rounds on which and are picked as a sole consequence of wellconditioning on the context . When this condition holds, induces good conditioning, however its distribution is affected by the filtering process, inducing bias that complicates estimating the square loss gap. To avoid these complexities, we filter out these rounds. Note that there is no bias when both , because the choice can be attributed to and not because the context feature induces adequate conditioning. (Interestingly, the proof of Theorem 2 highlights that these rounds would make a minimal difference to the overall sample complexity of estimation and ensuing model selection rates.)
This completes our description of our adaptive algorithm, ModCB.A. Our second main result is that ModCB.A achieves the following dataadaptive model selection guarantee.
Theorem 2.
ModCB.A with parameter choice achieves the following model selection rates, each with probability at least :

If featurediversity holds for every arm with parameter , then
(7) 
If armaveraged feature diversity is satisfied with parameter , then
(8)
The proof of Theorem 2 is provided in Appendix B. Observe that Equation (7) is identical to the OSOM rate, and Equation (8) is identical to the ModCB rate. Consequently, our dataadaptive exploration subroutine results in a single algorithm that achieves both rates under the requisite conditions. As summarized in Table 3, OSOM will not work even under armaveraged feature diversity if armspecific diversity does not hold. On the other hand, ModCB can be verified not to improve under the stronger condition of armspecific feature diversity. In conclusion, we can think of ModCB.A as achieving the “bestofbothworlds” model selection guarantee between the two approaches, by meeting Objective 1 under armspecific feature diversity and Objective 2 otherwise.
Algorithm  Armspecific diversity  Armaveraged diversity 

OSOM [9]  and  None 
ModCB [15]  and  and 
ModCB.A  and  and 
4 Discussion and future work
In this paper, we introduced improved statistical estimation routines and exploration schedules to plugandplay with model selection algorithms. The result of these improvements is that we advance the stateoftheart for model selection along the axes of universality and adaptivity (as defined at the end of Section 2). Our results are most complete for the model selection problem of MABvslinear CB. Appendix C presents some extensions to the more general problem of model selection among linear contextual bandits. Our work does leave several remaining questions open, as listed below:

Extending our eigenvaluethresholding approach to the more general linear contextual bandit problem is challenging due to crosscorrelation terms. Appendix C shows that the ideas can be extended when the covariance matrix of armaveraged contexts has a certain blockdiagonal structure; however, doing this more generally remains open.

The dataadaptive approach that we present here cracks the door open to potentially studying instanceoptimal model selection. While Theorem 2 represents an advance in this direction, we do not believe that the rate is yet instanceoptimal. This remains an interesting open problem both in terms of improved algorithms and characterizing fundamental limits.

All approaches that meet either Objective 1 or 2 continue to be heavily tailored to the linear setting: whether model selection is possible in general settings, as posed by [14], remains open.
Acknowledgements
VM acknowledges helpful initial discussions with Weihao Kong, and support from a SimonsBerkeley Research Fellowship for the program “Theory of Reinforcement Learning”. This work was done in part while the authors were visiting the Simons Institute for the Theory of Computing.
References
 [1] (2011) Improved algorithms for linear stochastic bandits. In Proceedings of the Advances in Neural Information Processing Systems, Cited by: §2.1.
 [2] (2014) Taming the monster: a fast and simple algorithm for contextual bandits. In Proceedings of the International Conference on Machine Learning, Cited by: §1.
 [3] (2017) Corralling a band of bandit algorithms. In Proceedings of the Conference on Learning Theory, Cited by: §1.2, §1.

[4]
(2021)
Corralling stochastic bandit algorithms.
In
Proceedings of the International Conference on Artificial Intelligence and Statistics
, Cited by: §1.2.  [5] (2009) Minimax policies for adversarial and stochastic bandits.. In COLT, Vol. 7, pp. 1–122. Cited by: §2.1.
 [6] (2002) Finitetime analysis of the multiarmed bandit problem. Machine Learning. Cited by: 6.
 [7] (2021) Mostly explorationfree algorithms for contextual bandits. Management Science. Cited by: footnote 4.
 [8] (2004) Convex optimization. Cambridge university press. Cited by: Appendix A, Appendix C.
 [9] (2020) Osom: a simultaneously optimal algorithm for multiarmed and linear contextual bandits. In Proceedings of the International Conference on Artificial Intelligence and Statistics, Cited by: Appendix B, Appendix B, 1st item, 2nd item, §1.2, §1, §1, item 1, item 2, §2.1, §2.1, §2.2, §2.2, Table 1, §3.1, §3.2, Table 2, Table 3, footnote 2.
 [10] (2011) Contextual bandits with linear payoff functions. In Proceedings of the International Conference on Artificial Intelligence and Statistics, Cited by: §2.1, 7.
 [11] (2021) Dynamic balancing for model selection in bandits and RL. In Proceedings of the International Conference on Machine Learning, Cited by: §1.2.
 [12] (2020) Upper confidence bounds for combining stochastic bandits. arXiv:2012.13115. Cited by: §1.2.
 [13] (2014) Variance estimation in highdimensional linear models. Biometrika. Cited by: footnote 5.
 [14] (2020) Open problem: Model selection for contextual bandits. In Proceedings of the Conference on Learning Theory, Cited by: §1.1, §1.2, §1, §1, §2.1, 4th item, footnote 1.
 [15] (2019) Model selection for contextual bandits. Proceedings of the Advances in Neural Information Processing Systems. Cited by: Appendix A, Appendix A, Appendix A, Appendix A, Appendix A, Appendix B, Appendix B, Appendix B, Appendix B, Appendix C, Appendix C, Appendix C, Appendix C, Appendix C, Appendix C, 1st item, 2nd item, §1.2, §1.2, §1, item 2, §2.1, §2.2, §2.2, Table 1, §3.1, §3.1, §3.1, §3.2, Table 2, Table 3, footnote 2.
 [16] (2018) Contextual bandits with surrogate losses: margin bounds and efficient algorithms. In Proceedings of the Advances in Neural Information Processing Systems, Cited by: §1.
 [17] (2020) Beyond UCB: optimal and efficient contextual bandits with regression oracles. In Proceedings of the International Conference on Machine Learning, Cited by: §1.
 [18] (2021) Problemcomplexity adaptive model selection for stochastic linear bandits. In Proceedings of the International Conference on Artificial Intelligence and Statistics, Cited by: §1.2.
 [19] (2021) Model selection for generic contextual bandits. arXiv:2107.03455. Cited by: §1.2.
 [20] (2021) Smoothnessadaptive contextual bandits. Available at SSRN. Cited by: §1.2.
 [21] (2018) A smoothed analysis of the greedy algorithm for the linear contextual bandit problem. In Proceedings of the Advances in Neural Information Processing Systems, Cited by: footnote 4.
 [22] (2018) Estimating learnability in the sublinear data regime. Proceedings of the Advances in Neural Information Processing Systems. Cited by: footnote 5.
 [23] (2019) Contextual bandits with continuous actions: smoothing, zooming, and adapting. In Proceedings of the Conference On Learning Theory, Cited by: §1.2.
 [24] (2021) Optimal model selection in contextual bandits with many classes via offline oracles. arXiv:2106.06483. Cited by: §1.2.
 [25] (1985) Asymptotically efficient adaptive allocation rules. Advances in Applied Mathematics. Cited by: §2.1.

[26]
(2008)
The epochgreedy algorithm for multiarmed bandits with side information
. In Proceedings of the Advances in Neural Information Processing Systems, Cited by: §1.  [27] (2015) The pareto regret frontier for bandits. In Proceedings of the Advances in Neural Information Processing Systems, Cited by: §1.2.
 [28] (2021) Online model selection for reinforcement learning with function approximation. In Proceedings of the International Conference on Artificial Intelligence and Statistics, Cited by: §1.2, §1.
 [29] (2010) A contextualbandit approach to personalized news article recommendation. In Proceedings of the International conference on World Wide Web, Cited by: §1.
 [30] (2018) Adaptivity to smoothness in xarmed bandits. In Proceedings of the Conference on Learning Theory, Cited by: §1.2.
 [31] (2020) Regret bound balancing and elimination for model selection in bandits and RL. arXiv:2012.13045. Cited by: §1.2, §1.
 [32] (2020) Model selection in contextual stochastic bandit problems. arXiv:2003.01704. Cited by: §1.
 [33] (2021) Leveraging good representations in linear contextual bandits. arXiv:2104.03781. Cited by: §1.2.
 [34] (2018) The externalities of exploration and how data diversity helps exploitation. In Proceedings of the Conference On Learning Theory, Cited by: footnote 4.
 [35] (2020) Bypassing the monster: a faster and simpler optimal algorithm for contextual bandits under realizability. Available at SSRN. Cited by: §1.
 [36] (2016) Efficient algorithms for adversarial contextual learning. In Proceedings of the International Conference on Machine Learning, Cited by: §1.
 [37] (2018) Adaptive estimation of highdimensional signaltonoise ratios. Bernoulli. Cited by: footnote 5.
 [38] (2019) Highdimensional statistics: a nonasymptotic viewpoint. Cambridge University Press. Cited by: Appendix B.
 [39] (1979) A onearmed bandit problem with a concomitant variable. Journal of the American Statistical Association. Cited by: §1.

[40]
(2021)
Provably efficient representation learning in lowrank markov decision processes
. arXiv:2106.11935. Cited by: §1.2.  [41] (2021) Pareto optimal model selection in linear bandits. arXiv:2102.06593. Cited by: §1.2.
Appendix A Proof of Theorem 1
The following lemma (intended to replace Theorem 2 in Foster et al. [15]) characterizes how the estimation error of our thresholded estimator will depend on the choice of .
Lemma 1.
Suppose that we have labeled samples and unlabeled samples. Provided that , the estimator provided in Algorithm 2 guarantees that
(9)  
with probability at least .
This is similar to the bound in Foster et al. [15] (with slightly improved inverse dependences on the threshold due to the relative simplicity of the MABvslinear CB setting), except that we are not assuming any spectral conditions on and we incur an extra additive term of in the estimation error arising from the bias induced by the thresholding operator. For comparison, the bound provided in Foster et al. [15] is for the choice
(10) 
but only holds if we have .
Before proving Lemma 1, we sketch how it leads to the statement provided in Theorem 1. We follow the outline that is given in Appendix C.2.4 of Foster et al. [15]. An examination of that proof, specialized to the case of model classes (in our case, MAB and linear CB), demonstrates that the dominant terms in the overall regret under the complex model (see, e.g. Eqs. (19), (20), (21) and (22) in Appendix C.2.4 of [15]) are given by
where denotes the set of designated exploration rounds. We set to be the forced exploration parameter (as defined in Algorithm 1), and specify a choice of subsequently. Just as in [15], we then have with probability at least . Plugging and into Lemma 1 then gives us
with probability at least . Note that the extra term comes from the estimation error due to misspecification (bias) that we now incur. We now need to select the truncation amount and the exploration factor to minimize the above expression. One way of doing this is given by equating the third and fourth terms (ignoring universal constants and log factors). This gives us . Substituting this into the above gives us
and further substituting gives us
which clearly satisfies the form for the case .
Proof.
Before beginning the proof, we define a term called the truncated squareloss gap as below:
(11) 
We also recall that we defined and , where recall that is the truncated second moment estimate. The proof is carried out in three distinct steps:

Upperbounding , the “variance” estimation error arising from samples.

Upperbounding , the biasterm with respect to the truncated squared loss gap.

Upperbounding , the bias arising from truncation.
1. Upperbounding . We note that . We consider the random vector
and show that it is subexponential with parameter . This follows because , where the secondlast inequality follows by the definition of the truncation operator.
Thus, using the subexponential tail bound just as in [15, Lemma 17], we get
Now, we note that . Therefore, we apply the AMGM inequality to deduce that
(12) 
2. Upperbounding . We denote as shorthand. It is then easy to verify that and . Then, following an identical sequence of steps to [15], we get
We now state and prove the following lemma on operator norm control.
Lemma 2.
We have
(13) 
where we denote as shorthand.
Note that substituting Lemma 2 above directly gives us
(14) 
We will prove Lemma 2 at the end of this proof.
3. Upperbounding . Observe that
This directly implies that
where the second inequality follows because we have assumed bounded signal, i.e. . It remains to control the operator norm terms above. We denote , and note that . Thus, we get
Putting these together gives us
(15) 
Thus, putting together Equations (12), (14) and (15) completes the proof. It only remains to prove Lemma 2, which we now do. ∎
Proof of Lemma 2.
First, recall that , and so we really want to upper bound the quantity . It is well known (see, e.g. [8]) that for any positive semidefinite matrix , the operator is a proximal operator with respect to the convex nuclear norm functional. The nonexpansiveness of proximal operators then gives us
with probability at least . Here, the last step follows by standard arguments on the concentration of the empirical covariance matrix. Recall that we defined as shorthand.
Appendix B Proof of Theorem 2
Proof.
Metaanalysis:
We begin the analysis by providing a common lemma for both cases that will characterize a highprobability regret bound as a functional of two random quantities: a) , the number of designated exploration rounds that we use for fast estimation, and , the total number of forcedexploration rounds. Here, we define
It is easy to verify that by definition, we have . Indeed, recall from the pseudocode in Algorithm 3 that we defined
We first state our guarantee on estimation error. For any , we define
(16) 
Lemma 3.
For every , we have
(17) 
with probability at least , and is the adjusted square loss gap given by
and was defined in Equation (5).
Proof.
This proof essentially constitutes a martingale adaptation of the proof of fast estimation in [15]. Let denote the random stopping time at which exploration samples have been collected. Moreover, let denote the (again random) times at which exploration samples were collected, and denote the corresponding actions that were taken. Then, we define a timeaveraged covariance matrix as
for every value of . We state the following technical lemma, which is proved in Appendix D and critically uses the fact that the rounds on which is picked as a sole consequence of wellconditioning of the context (i.e. if and ) are filtered out of the considered exploration set