The contextual bandit (CB) problem [39, 26, 29] is a foundational paradigm for online decision-making. In this problem, the decision-maker takes one out of actions as a function of contextual information that is available in advance, where this function is chosen from a fixed policy class and is typically learned from past outcomes. Most work on CB has centered around designing algorithms that minimize regret with respect to the best such function in hindsight; a particular non-triviality lies in being able to do this in a computationally efficient, scalable manner [2, 36, 16]. When the rewards are realizable under the chosen policy class, this is now an essentially solved problem [17, 35].
A complementary problem to regret minimization with respect to a fixed policy class is choosing the policy class that is best for the problem at hand. This constitutes a data-driven model selection problem, and its importance is paramount in CB, as selecting a policy class that either underfits or overfits the data leads to highly suboptimal accrued rewards. To see why, it is instructive to consider the simplest non-trivial instance of such model selection, which involves deciding whether to use the contexts (with a fixed policy class, say, of -dimensional linear functions) or simply run a multi-armed bandit (MAB) algorithm. Making this choice a priori is suboptimal one way or another: if we choose a MAB algorithm, we obtain the optimal regret to the best fixed action, but the latter may be highly suboptimal if the rewards depend on the context. On the other hand, if we choose an off-the-shelf linear CB algorithm, we incur regret even when the rewards do not depend on the contexts as a consequence of overfitting — this is highly suboptimal compared to the guarantee that we could have obtained had we known about the simpler structure. This example demonstrates that model selection must be data-driven. In essence, we wish to design a single algorithm that achieves the best-of-both-worlds, by retaining the linear CB regret guarantee while adapting on-the-fly to hidden simpler structure if it exists.
Focusing on the MAB-vs-linear setting, the strongest variant of the model selection objective asks:
Objective 1. Can we design a single algorithm that simultaneously achieves the respective minimax-optimal rates of under simple MAB structure (when it exists) and under -dimensional linear CB structure?
A related but weaker objective is also proposed111Foster et al.  ask a general form of this question in a statistical learning setup, which specializes to the MAB-vs-linear objective stated here. in a COLT 2020 open problem :
Objective 2. Can we design a single algorithm that simultaneously achieves the rate under simple MAB structure and under linear CB structure, for some ?
The importance of the -rate was highlighted by Foster et al. , as it verifies that model selection is possible whenever the underlying class is learnable, which is the case if and only if .
Even in the simplest instance of model selection between a multi-armed bandit and a linear contextual bandit, achieving either Objective 1 or Objective 2 under minimal assumptions remains open. Existing approaches to address either the stronger Objective 1  or the weaker Objective 2  make restrictive assumptions222We note here that the assumptions made across the two works are of varying strength: in particular, while  assumes that the context corresponding to each action will be well-conditioned,  only requires well-condition on the distribution of the average of the contexts across actions. See Sections 1.2 and 2.2 for more details on the distinctions between the two approaches. regarding the conditioning (what we will call diversity) of the contexts. Other, more data-agnostic approaches [3, 32, 31, 28] achieve neither of the above objectives. This leads us to ask whether we can design a universal
model selection approach that is data-agnostic (other than requiring a probability model on the contexts)and achieves either Objective 1 or 2.
Another important question is the adaptivity of approaches to situations in which model selection is especially tractable. At the heart of effective data-driven model selection is a meta-exploration-vs-exploitation tradeoff: while we need to exploit the currently believed simpler model structure, we also need to explore sufficiently to discover potential complex model structure. Almost all approaches to model selection incorporate forced exploration of an -greedy type to navigate this tradeoff; however, such exploration may not always be needed. Indeed,  use no forced exploration in their approach and thereby achieve the optimal guarantee of Objective 1; however, their approach only works under restrictive diversity assumptions. It is natural to ask whether we can design data-adaptive exploration schedules that employ forced exploration only when it is absolutely needed, thus recovering the strongest possible guarantee (Objective 1) under favorable situations and a weaker-but-still-desirable guarantee (Objective 2) otherwise.
1.1 Our contributions
From the above discussion, it is clear that algorithm design for model selection that satisfies the criteria posed in  involves two non-trivial components: a) designing an efficient statistical test to distinguish between simple and complex model structure that works under minimal assumptions, and b) designing an exploration schedule to ensure that sufficiently expressive data is collected for a) to succeed. In this paper, we advance the state-of-the-art for both of these components in the following ways:
We design a new test based on eigenvalue thresholding that works for all stochastic sub-Gaussian contexts; in contrast to and 
, it does not require any type of context diversity. We critically utilize the fact that “low-energy” directions can be thresholded and ignored to estimate the gap in error between model classes. See Theorem1 for our new model selection guarantee, which only requires stochasticity on the contexts in order to meet Objective 2.
We also design a data-adaptive exploration schedule that performs forced exploration only when necessary. This approach meets Objective 2 under the action-averaged feature diversity assumption that is made in , but also the stronger Objective 1 under the stronger assumption that the feature of each action is diverse (made in ). See Theorem 2 for a precise statement of our new adaptive guarantee on model selection.
Taken together, our results advance our understanding of model selection for contextual bandits, by demonstrating how statistical approaches can yield universal (i.e., nearly assumption-free) and adaptive guarantees.
1.2 Related work
While model selection is a central and classical topic in machine learning and statistics, most results primarily apply to supervised offline and full-information online learning. Only recently has attention turned to model selection in online partial information settings including contextual bandits. Here, we focus on this growing body of work, which we organize based on overarching algorithmic principles.
The first line of work constitutes a hierarchical learning scheme where a meta-algorithm uses bandit techniques to compete with many (contextual) bandit algorithms running as base learners. One of the first such approaches is the Corral algorithm of Agarwal et al. , which uses online mirror descent with the log-barrier regularizer as the meta-algorithm. Subsequent work focuses on adapting Corral to the stochastic setting [31, 28] and developing UCB-style meta-algorithms [12, 4]. These approaches are quite general and can often be used with abstract non-linear function classes. However, they do not seem able to meet either of Objectives 1 or 2 in general. In our setting, these approaches yield the tuple of rates , which clearly cannot be expressed in the form for any value of . Consequently, these approaches still leave the problem of model selection as described in  open.
The second line of algorithmic approaches involves constructing statistical tests for model misspecification. This approach was initially used in the context of model selection concurrently by Foster et al.  and Chatterji et al. , who focus on the linear setting. At a high level, these papers develop efficient misspecification tests under certain covariate assumptions and use these tests to obtain -style model selection guarantees. In particular, Foster et al.  use a “sublinear” square loss estimator under somewhat mild covariate assumptions to obtain regret, while Chatterji et al.  obtain regret under stronger covariate assumptions. As these two works are the foundation for our results, we discuss these papers in detail in the sequel.
Several recent papers extend statistical testing approaches in several ways. Ghosh et al. 
estimate the support of the parameter vector, which fundamentally incurs a dependence on the magnitude of the smallest non-zero coefficient. Beyond the linear setting,Cutkosky et al.  use the “putative” regret bound for each model class directly to test for misspecification, while Ghosh et al. , Krishnamurthy and Athey  consider general function classes with realizability. While these latter approaches are more general than ours, they cannot be directly used to obtain our results. Indeed, central to our results (and those of Foster et al. ) is the fact that our statistical test provides a fast rate for detecting misspecification; this guarantee is quantitatively better than what is provided by the putative regret bound, requires carefully adjusting the exploration schedule, and is not available for general function classes.
Other related work.
We briefly mention two peripherally related lines of work. The first is on representation selection in bandits and reinforcement learning[33, 40], which involves identifying a feature mapping with favorable properties from a small class of candidate mappings. While this is somewhat reminiscent of the model selection problem, the main differences are that in representation selection all mappings are of the same dimensionality and realizable, and the goal is to achieve much faster regret rates by leveraging additional structural properties of the “good” representation.
The second line of work is on Pareto optimality in non-contextual bandits and related problems. Beginning with the result of Lattimore , these results show that certain non-uniform regret guarantees are not achievable in various bandit settings. For example,  shows that, in -armed bandit problems, one cannot simultaneously achieve regret to one specific arm, while guaranteeing regret to the rest. Such results have been extended to both linear and Lipschitz non-contextual bandits [41, 30, 23] as well as Lipschitz contextual bandits under margin assumptions , and they establish that model selection is not possible in these settings. However, these ideas have not been extended to standard contextual bandit settings, which is our focus.
We use boldface to denote vectors and matrices (e.g. to denote a vector, and to denote a scalar). For any value of , denotes the finite set . We use to denote the -norm of a vector, and to denote the operator norm of a matrix. We use to denote the Euclidean inner product between vectors and . We use big-Oh notation in the main text; hides dependences on the number of actions , and denotes a bound that hides a factor and holds with probability at least .
2.1 The bandit-vs-contextual bandit problem
The simplest instance of model selection involves a -dimensional linear contextual bandit problem with possibly hidden multi-armed bandit structure. This model was proposed as an initial point of study in . Concretely, actions (which we henceforth call arms) are available to the decision-maker at every round, and denotes the total number of rounds. At round , the reward of each arm is given by
where denotes the bias of arm , denotes the -dimensional context corresponding to arm at round , and denotes random noise. Finally, denotes an unknown parameter. We make the following standard assumptions on the problem parameters.
The biases are assumed to be bounded between and .
The unknown parameter is assumed to be bounded, i.e. .
The contexts corresponding to each arm are assumed to be iid across rounds , and -sub-Gaussian. We denote by the covariance matrix of the context , and additionally note that as a consequence of the -sub-Gaussian assumption. Without loss of generality, we assume that for each arm the mean of the context is equal to the zero vector333This is without loss of generality, since bias can be incorporated into ..
The noise is iid across arms and rounds , centered, and -sub-Gaussian.
We denote the achieved pseudo-regret with respect to the best fixed arm (the standard metric for a MAB problem) by , and the achieved pseudo-regret with respect to the best policy under a -dimensional model (the standard metric for a linear CB problem) by . Notice that in the special case when , this reduces to a standard multi-armed bandit (MAB) instance. The best possible regret rate is then given by in the worst case, and we also have the instance-dependent rate , where . Both of these are known to be information-theoretically optimal [25, 5]. On the other hand, the minimax-optimal rate for the linear contextual bandit (linear CB) problem is given by [10, 1]. The following natural dichotomy in algorithm choice presents itself:
While the state-of-the-art for the linear CB problem achieves the minimax-optimal rate , it does not adapt automatically to the simpler MAB case. In particular, the regret will still scale with the dimension of the contexts owing both to unnecessary exploration built into linear CB algorithms and overfitting effects. This precludes achieving the minimax-optimal rate of in the MAB setting, let alone the instance-dependent rate.
On the other hand, any state-of-the-art algorithm that is tailored to the MAB problem would not achieve any meaningful regret rate for the linear CB problem, simply because it does not incorporate contextual information into its decisions.
The simulations in  empirically illustrate this dichotomy and clearly motivate the model selection problem in its most ambitious form, i.e. Objective 1 as stated in Section 1. Objective 2 constitutes a weaker variant of the model selection problem that was proposed in [15, 14] and justified by the fact that it yields non-trivial model selection guarantees whenever the underlying class is learnable. While Objective 2 is in itself a desirable and non-trivial model selection guarantee, we note that it is strictly weaker than Objective 1. To see this, note that the objectives coincide for , and since we require for sublinear regret in the first place, the rate is a decreasing function in .
2.2 Meta-algorithm and prior instantiations
As mentioned in Section 1.2, the vast toolbox of corralling-type approaches does not achieve either Objective 1 or 2 for model selection.  and , which are concurrent to each other, are among the first approaches to tackle the model selection problem and the only ones that achieve Objectives 1 and 2 respectively—but under additional strong assumptions. Both approaches use the same structure of a statistical test to distinguish between a simple (MAB) and complex (CB) instance. This meta-approach is described in Algorithm 1. Here, denotes the arm that is pulled at round , and as is standard in bandit literature, is the relevant filtration at round .
As our results also involve instantiating this meta-algorithm, we now discuss its main elements. The meta-algorithm begins by assuming that the problem is a simple (MAB) instance and primarily uses an optimal MAB algorithm for arm selection: this default choice is denoted by in Algorithm 1. To address model selection, it uses both an exploration schedule and a misspecification test, both of which admit different instantiations. The exploration schedule governs a rate at which the algorithm should choose arms uniformly at random, which can be helpful for detecting misspecification. The misspecification test is simply a surrogate statistical test to check if the instance is, in fact, a complex (CB) instance (i.e. ). If the test detects misspecification, we immediately switch to an optimal linear CB algorithm for the remaining time steps.
|Algorithm||Estimator||Forced exploration parameter|
|OSOM ||Plug-in estimator||(no extra exploration)|
|ModCB ||Fast estimator defined in |
|ModCB.U||Fast estimator defined in Algorithm 2|
|ModCB.A||Fast estimator defined in ||Algorithm 3|
While  and  both use the meta-algorithmic structure in Algorithm 1, they instantiate it with difference choices of misspecification test and exploration schedule. The high-level details of where the approaches diverge are summarized in Table 1, and the results that they obtain are summarized in Table 2. We provide a brief description of the salient differences below.
Chatterji et al.  do not incorporate any forced exploration in their procedure, as evidenced by the choice of parameter above for all values of . They also use the plug-in estimator of the linear model parameter to obtain an estimate of the gap in performance between the two model classes. The error rate of this plug-in estimator scales as as a function of the number of samples , and matches the putative regret bound for linear CB. Consequently, they achieve the optimal model selection rate of Objective 1, as well as the stronger instance-optimal rate in the case of MAB, but require a strong assumption of feature diversity for each arm; that is, they require for all . Intuitively, feature diversity444Such feature diversity was previously used to show that the greedy algorithm can obtain competitive performance with LinUCB in contextual bandits, both for regular regret minimization and fairness objectives [7, 21, 34]. eliminates the need for forced exploration to successfully test for potential complex model structure.
Foster et al.  incorporate forced exploration of an -greedy-style by setting the forced exploration parameter . This automatically precludes achieving the stronger Objective 1, but leaves the door open to achieving Objective 2 for some smaller choice of . To do this, they critically leverage fast estimators555
The ideas for this fast estimation are rooted in approaches to quickly estimate the signal-to-noise ratio in high-dimensional linear models[37, 13, 22]. of the gap between the two model classes, whose error rate can be verified to scale as as a function of the number of samples . In particular, this is significantly better in its dependence on than the standard plug-in estimator. Moreover, as a consequence of forced exploration, they do not require restrictive feature diversity assumptions on each arm; nevertheless, an arm-averaged feature diversity assumption is still required. Specifically, they assume that where , which is strictly weaker than the arm-specific condition of Chatterji et al. . Essentially, is exactly the covariance matrix of the mixed context obtained from uniform exploration, which we denote by where .
This discussion tells us that the initial attempts at model selection [9, 15] fall short both in their breadth of applicability and their ability to adapt to structure in the model selection problem. This naturally motivates the question of whether we can design new algorithms with two key properties:
Universality: Can we meet Objective 2 for some value of under stochastic contexts but with no additional diversity assumptions?
Adaptivity: Can we meet Objective 1 under maximally favorable conditions (feature diversity for all arms), and Objective 2 otherwise?
3 Main results
We now introduce and analyze two new algorithms that provide a nearly complete answer to the problems of universality and adaptivity for the MAB-vs-linear CB problem.
3.1 Universal model selection under stochasticity
In this section, we present ModCB.U, a simple variant of ModCB  that achieves Objective 2 of model selection without requiring any feature diversity assumptions, arm-averaged or otherwise. Therefore, this constitutes a universal model selection algorithm between an MAB instance and a linear CB instance.
Our starting point is the approach to model selection in  described above in Section 2.2. Here, we recap the details of the fast estimator of the square-loss-gap, which is given by . The square-loss-gap can be verified to be an upper bound on the expected gap of the best-in-class performance between the CB and MAB models (see  for details on this upper bound), but is also equal to iff (and is full rank). Therefore, it is a suitable surrogate statistic to test for misspecification, as detailed in the meta-algorithmic structure of Algorithm 1. The estimator is denoted by , and is described as a black-box procedure in Algorithm 2 with access to an estimator of the covariance matrix that is constructed from unlabeled samples. The estimator that is used byModCB  is simply the sample covariance matrix at round , defined by
Note that such an estimator can be easily constructed as we have access to all past contexts at any round . This effective full-information access to contexts, in fact, forms the crux of both of our algorithmic ideas.
This approach is summarized in the sub-routine Algorithm 2, which is instantiated in ModCB for any time step with
That is, the set of training examples used is the set of context-reward pairs on all designated exploration rounds. Above, constitutes the estimate of the sample means constructed only from past exploration rounds666This is done to make sure that the estimates are unbiased, i.e. .. As a consequence of this choice, we note that for this instantiation of the sub-routine Algorithm 1, we have and at any given time step .
A key bottleneck lies in the obtainable estimation error rate of : while the leading dependence is given by (which is at the heart of the rate that ModCB achieves), there is also an inverse dependence on the minimum eigenvalue of the arm-averaged covariance matrix , which we denote here by . This dependence arises as a consequence of needing to estimate the inverse covariance matrix from unlabeled samples. In essence, this requires to be well-conditioned, in the sense that we need to be a positive constant to ensure the model selection rate of . This precludes non-trivial model selection rates from ModCB for cases where could itself decay with , the dimension of the contexts, or , the number of rounds. It also does not allow for cases in which may be singular.
Our first main contribution is to adjust ModCB to successfully achieve Objective 2 in model selection with arbitrary stochastic, sub-Gaussian contexts. Because our algorithm achieves a universal model selection guarantee over all stochastic context distributions, we name it ModCB.U. The algorithmic procedure is identical to that of ModCB except for the choice of estimator for the inverse covariance matrix, , that is plugged into Algorithm 2. Our key observation is as follows: if certain directions are small in magnitude for the contexts corresponding to all arms (as will be the case when has vanishingly small eigenvalues), then we may not actually want try to estimate the square loss gap along them: ignoring them might be a better option. Our approach to ignoring low-value directions simply uses eigenvalue thresholding to construct an improved biased estimate of the inverse covariance matrix . We formally define the eigenvalue thresholding operator below.
Define the clipping operator . Then, for any matrix with diagonalization and any value of , we define the thresholding operator
We use Definition 3.1 to specify our (biased) estimators of the covariance and inverse-covariance matrices . In particular, we let denote the sample covariance matrix of from unlabeled samples, given in Eq. (2). Then, our estimators are given by
and we simply plug the estimate into Algorithm 2. Note that is always invertible for any . In essence, this lets us set as a tunable parameter to tradeoff the estimation error of a surrogate approximation to the square-loss gap (which will decrease in ) and the approximation error that arises from ignoring all directions with value less than (which will increase in ). As our first main result shows, we can set a value of that scales with and and successfully achieve Objective 2 of model selection for any stochastic sub-Gaussian context distribution.
and second moment matrix estimate(which can be constructed from unlabeled sample).
|Algorithm||Obj. 1 (optimal rates)||Obj. 2 ( rates)||context assumption|
|OSOM ||Yes||Yes )|
|ModCB.U||No||Yes||iid contexts only|
|Corral-style||No||No||iid contexts only|
ModCB.U with achieves model selection rates
with probability at least .
Equation (4) clearly demonstrates model selection rates of the form required from Objective 2, and shows that Objective 2 can be met for some value of with the sole requirement of stochasticity on the contexts. Table 2 allows us to compare the achievable rate to both OSOM  and ModCB ; corralling approaches, which are assumption-free but meet neither Objectives 1 nor 2, are also included as a benchmark. In particular, it is clear from a quick read of the table that as we go from OSOM to ModCB to our approach, the assumptions required on the context distributions weaken, as do the obtainable rates (recall that because the rate decreases in , a guarantee with a larger value of implies one with a smaller value of ).
The proof of Theorem 1 is provided in Appendix A. In Appendix C, we describe how this procedure and result extends to the more complex case of linear contextual bandits under an additional assumption of block-diagonal structure on the covariance, where the blocks designate the dimensions of the model classes.
3.2 Data-adaptive algorithms for model selection
In this section, we introduce a new data-adaptive exploration schedule and show that it provably achieves Objective 1 under the strongest assumption of feature diversity for each arm (as in ), but also achieves Objective 2 under the weaker assumption of arm-averaged feature diversity (as in ). At a high level, our key insight is as follows: the arm-specific feature diversity condition that is critically used by  is itself testable from past contextual information; therefore, it can be tested for before we decide on an arm and receive a reward.
To describe this idea more formally, we introduce some more notation. At time step , we denote the exploration set that we have built up thus far by . Now, we use an inductive principle. Suppose that the contexts that are present in the exploration set are already sufficiently “diverse” in a certain quantitative sense (that we will specify shortly). Then, we can easily check whether the arm that we would ideally pull when the true model is simple, i.e. (ostensibly, the “greedy” arm), continues to preserve this property of diversity. Importantly, because we are able to observe the contexts before making a decision, we can check this condition before deciding on the value of .
This new sub-routine for data-adaptive exploration, which we call ModCB.A, is described in Algorithm 3. We elaborate on the algorithm description along three critical verticals: a) the decision to forcibly explore, b) the choice of estimator, and c) the designated “exploration rounds” that are used for the estimator.
When to forcibly explore:
At time step , ModCB.A
uses the random variablesand to decide whether to stick with the “greedy” arm , or to forcibly explore, i.e. . For time step , the random variable denotes the indicator that the diversity condition continues to be met by context . This means that if , we will pick . On the other hand, if the diversity condition is not met (i.e. ), we revert to the forced exploration schedule used by ModCB. This schedule sets a variable , and selects if and otherwise. In summary, we end up picking if or , while ModCB would have picked only if . As a result, our data-adaptive procedure, which we call ModCB.A, allows us to adapt on-the-fly to friendly feature diversity structure (and explore much less) while preserving more general guarantees.
The choice of estimator :
First, we specify the choice of estimator of the square loss gap, from samples in the designated exploration set . (We will specify the procedure for construction of this exploration set shortly.) For convenience, we index the elements of the exploration set in ascending order by . We also recall that denotes the sample covariance matrix777For technical reasons related to a random covariate shift induced by data-adaptive exploration, eigenvalue thresholding does not work quite as well here. as defined in Equation (2). Armed with this exploration set, we define our estimator of in accordance with the sub-routine in Algorithm 2 with the examples from the exploration set, i.e. . In particular, we estimate an adjusted square loss gap, given by
Note that because is random, the adjusted square loss gap is also random; nevertheless, it turns out that it is almost surely a good proxy for the true square loss gap . We estimate this adjusted squared loss gap with the estimator that is given by
where are defined just as in Section 3.1.
How to build the exploration set :
To complete our description of ModCB.A, we specify the data-adaptive exploration set at round that is used for the estimation subroutine. Notice from the pseudocode in Algorithm 3 that we did not include the rounds for which in the exploration set. Interestingly, the two cases for which this happens are undesirable for two distinct reasons, as detailed below.
Rounds on which and constitute rounds on which there was no forced exploration and the context corresponding to arm need not be well-conditioned: therefore, we do not want to include these samples for estimation.
Rounds on which and are picked as a sole consequence of well-conditioning on the context . When this condition holds, induces good conditioning, however its distribution is affected by the filtering process, inducing bias that complicates estimating the square loss gap. To avoid these complexities, we filter out these rounds. Note that there is no bias when both , because the choice can be attributed to and not because the context feature induces adequate conditioning. (Interestingly, the proof of Theorem 2 highlights that these rounds would make a minimal difference to the overall sample complexity of estimation and ensuing model selection rates.)
This completes our description of our adaptive algorithm, ModCB.A. Our second main result is that ModCB.A achieves the following data-adaptive model selection guarantee.
ModCB.A with parameter choice achieves the following model selection rates, each with probability at least :
If feature-diversity holds for every arm with parameter , then
If arm-averaged feature diversity is satisfied with parameter , then
The proof of Theorem 2 is provided in Appendix B. Observe that Equation (7) is identical to the OSOM rate, and Equation (8) is identical to the ModCB rate. Consequently, our data-adaptive exploration subroutine results in a single algorithm that achieves both rates under the requisite conditions. As summarized in Table 3, OSOM will not work even under arm-averaged feature diversity if arm-specific diversity does not hold. On the other hand, ModCB can be verified not to improve under the stronger condition of arm-specific feature diversity. In conclusion, we can think of ModCB.A as achieving the “best-of-both-worlds” model selection guarantee between the two approaches, by meeting Objective 1 under arm-specific feature diversity and Objective 2 otherwise.
|Algorithm||Arm-specific diversity||Arm-averaged diversity|
4 Discussion and future work
In this paper, we introduced improved statistical estimation routines and exploration schedules to plug-and-play with model selection algorithms. The result of these improvements is that we advance the state-of-the-art for model selection along the axes of universality and adaptivity (as defined at the end of Section 2). Our results are most complete for the model selection problem of MAB-vs-linear CB. Appendix C presents some extensions to the more general problem of model selection among linear contextual bandits. Our work does leave several remaining questions open, as listed below:
Extending our eigenvalue-thresholding approach to the more general linear contextual bandit problem is challenging due to cross-correlation terms. Appendix C shows that the ideas can be extended when the covariance matrix of arm-averaged contexts has a certain block-diagonal structure; however, doing this more generally remains open.
The data-adaptive approach that we present here cracks the door open to potentially studying instance-optimal model selection. While Theorem 2 represents an advance in this direction, we do not believe that the rate is yet instance-optimal. This remains an interesting open problem both in terms of improved algorithms and characterizing fundamental limits.
All approaches that meet either Objective 1 or 2 continue to be heavily tailored to the linear setting: whether model selection is possible in general settings, as posed by , remains open.
VM acknowledges helpful initial discussions with Weihao Kong, and support from a Simons-Berkeley Research Fellowship for the program “Theory of Reinforcement Learning”. This work was done in part while the authors were visiting the Simons Institute for the Theory of Computing.
-  (2011) Improved algorithms for linear stochastic bandits. In Proceedings of the Advances in Neural Information Processing Systems, Cited by: §2.1.
-  (2014) Taming the monster: a fast and simple algorithm for contextual bandits. In Proceedings of the International Conference on Machine Learning, Cited by: §1.
-  (2017) Corralling a band of bandit algorithms. In Proceedings of the Conference on Learning Theory, Cited by: §1.2, §1.
Corralling stochastic bandit algorithms.
Proceedings of the International Conference on Artificial Intelligence and Statistics, Cited by: §1.2.
-  (2009) Minimax policies for adversarial and stochastic bandits.. In COLT, Vol. 7, pp. 1–122. Cited by: §2.1.
-  (2002) Finite-time analysis of the multiarmed bandit problem. Machine Learning. Cited by: 6.
-  (2021) Mostly exploration-free algorithms for contextual bandits. Management Science. Cited by: footnote 4.
-  (2004) Convex optimization. Cambridge university press. Cited by: Appendix A, Appendix C.
-  (2020) Osom: a simultaneously optimal algorithm for multi-armed and linear contextual bandits. In Proceedings of the International Conference on Artificial Intelligence and Statistics, Cited by: Appendix B, Appendix B, 1st item, 2nd item, §1.2, §1, §1, item 1, item 2, §2.1, §2.1, §2.2, §2.2, Table 1, §3.1, §3.2, Table 2, Table 3, footnote 2.
-  (2011) Contextual bandits with linear payoff functions. In Proceedings of the International Conference on Artificial Intelligence and Statistics, Cited by: §2.1, 7.
-  (2021) Dynamic balancing for model selection in bandits and RL. In Proceedings of the International Conference on Machine Learning, Cited by: §1.2.
-  (2020) Upper confidence bounds for combining stochastic bandits. arXiv:2012.13115. Cited by: §1.2.
-  (2014) Variance estimation in high-dimensional linear models. Biometrika. Cited by: footnote 5.
-  (2020) Open problem: Model selection for contextual bandits. In Proceedings of the Conference on Learning Theory, Cited by: §1.1, §1.2, §1, §1, §2.1, 4th item, footnote 1.
-  (2019) Model selection for contextual bandits. Proceedings of the Advances in Neural Information Processing Systems. Cited by: Appendix A, Appendix A, Appendix A, Appendix A, Appendix A, Appendix B, Appendix B, Appendix B, Appendix B, Appendix C, Appendix C, Appendix C, Appendix C, Appendix C, Appendix C, 1st item, 2nd item, §1.2, §1.2, §1, item 2, §2.1, §2.2, §2.2, Table 1, §3.1, §3.1, §3.1, §3.2, Table 2, Table 3, footnote 2.
-  (2018) Contextual bandits with surrogate losses: margin bounds and efficient algorithms. In Proceedings of the Advances in Neural Information Processing Systems, Cited by: §1.
-  (2020) Beyond UCB: optimal and efficient contextual bandits with regression oracles. In Proceedings of the International Conference on Machine Learning, Cited by: §1.
-  (2021) Problem-complexity adaptive model selection for stochastic linear bandits. In Proceedings of the International Conference on Artificial Intelligence and Statistics, Cited by: §1.2.
-  (2021) Model selection for generic contextual bandits. arXiv:2107.03455. Cited by: §1.2.
-  (2021) Smoothness-adaptive contextual bandits. Available at SSRN. Cited by: §1.2.
-  (2018) A smoothed analysis of the greedy algorithm for the linear contextual bandit problem. In Proceedings of the Advances in Neural Information Processing Systems, Cited by: footnote 4.
-  (2018) Estimating learnability in the sublinear data regime. Proceedings of the Advances in Neural Information Processing Systems. Cited by: footnote 5.
-  (2019) Contextual bandits with continuous actions: smoothing, zooming, and adapting. In Proceedings of the Conference On Learning Theory, Cited by: §1.2.
-  (2021) Optimal model selection in contextual bandits with many classes via offline oracles. arXiv:2106.06483. Cited by: §1.2.
-  (1985) Asymptotically efficient adaptive allocation rules. Advances in Applied Mathematics. Cited by: §2.1.
The epoch-greedy algorithm for multi-armed bandits with side information. In Proceedings of the Advances in Neural Information Processing Systems, Cited by: §1.
-  (2015) The pareto regret frontier for bandits. In Proceedings of the Advances in Neural Information Processing Systems, Cited by: §1.2.
-  (2021) Online model selection for reinforcement learning with function approximation. In Proceedings of the International Conference on Artificial Intelligence and Statistics, Cited by: §1.2, §1.
-  (2010) A contextual-bandit approach to personalized news article recommendation. In Proceedings of the International conference on World Wide Web, Cited by: §1.
-  (2018) Adaptivity to smoothness in x-armed bandits. In Proceedings of the Conference on Learning Theory, Cited by: §1.2.
-  (2020) Regret bound balancing and elimination for model selection in bandits and RL. arXiv:2012.13045. Cited by: §1.2, §1.
-  (2020) Model selection in contextual stochastic bandit problems. arXiv:2003.01704. Cited by: §1.
-  (2021) Leveraging good representations in linear contextual bandits. arXiv:2104.03781. Cited by: §1.2.
-  (2018) The externalities of exploration and how data diversity helps exploitation. In Proceedings of the Conference On Learning Theory, Cited by: footnote 4.
-  (2020) Bypassing the monster: a faster and simpler optimal algorithm for contextual bandits under realizability. Available at SSRN. Cited by: §1.
-  (2016) Efficient algorithms for adversarial contextual learning. In Proceedings of the International Conference on Machine Learning, Cited by: §1.
-  (2018) Adaptive estimation of high-dimensional signal-to-noise ratios. Bernoulli. Cited by: footnote 5.
-  (2019) High-dimensional statistics: a non-asymptotic viewpoint. Cambridge University Press. Cited by: Appendix B.
-  (1979) A one-armed bandit problem with a concomitant variable. Journal of the American Statistical Association. Cited by: §1.
Provably efficient representation learning in low-rank markov decision processes. arXiv:2106.11935. Cited by: §1.2.
-  (2021) Pareto optimal model selection in linear bandits. arXiv:2102.06593. Cited by: §1.2.
Appendix A Proof of Theorem 1
The following lemma (intended to replace Theorem 2 in Foster et al. ) characterizes how the estimation error of our thresholded estimator will depend on the choice of .
Suppose that we have labeled samples and unlabeled samples. Provided that , the estimator provided in Algorithm 2 guarantees that
with probability at least .
This is similar to the bound in Foster et al.  (with slightly improved inverse dependences on the threshold due to the relative simplicity of the MAB-vs-linear CB setting), except that we are not assuming any spectral conditions on and we incur an extra additive term of in the estimation error arising from the bias induced by the thresholding operator. For comparison, the bound provided in Foster et al.  is for the choice
but only holds if we have .
Before proving Lemma 1, we sketch how it leads to the statement provided in Theorem 1. We follow the outline that is given in Appendix C.2.4 of Foster et al. . An examination of that proof, specialized to the case of model classes (in our case, MAB and linear CB), demonstrates that the dominant terms in the overall regret under the complex model (see, e.g. Eqs. (19), (20), (21) and (22) in Appendix C.2.4 of ) are given by
where denotes the set of designated exploration rounds. We set to be the forced exploration parameter (as defined in Algorithm 1), and specify a choice of subsequently. Just as in , we then have with probability at least . Plugging and into Lemma 1 then gives us
with probability at least . Note that the extra term comes from the estimation error due to misspecification (bias) that we now incur. We now need to select the truncation amount and the exploration factor to minimize the above expression. One way of doing this is given by equating the third and fourth terms (ignoring universal constants and log factors). This gives us . Substituting this into the above gives us
and further substituting gives us
which clearly satisfies the form for the case .
Before beginning the proof, we define a term called the truncated square-loss gap as below:
We also recall that we defined and , where recall that is the truncated second moment estimate. The proof is carried out in three distinct steps:
Upper-bounding , the “variance” estimation error arising from samples.
Upper-bounding , the bias-term with respect to the truncated squared loss gap.
Upper-bounding , the bias arising from truncation.
1. Upper-bounding . We note that . We consider the random vector
and show that it is sub-exponential with parameter . This follows because , where the second-last inequality follows by the definition of the truncation operator.
Thus, using the sub-exponential tail bound just as in [15, Lemma 17], we get
Now, we note that . Therefore, we apply the AM-GM inequality to deduce that
2. Upper-bounding . We denote as shorthand. It is then easy to verify that and . Then, following an identical sequence of steps to , we get
We now state and prove the following lemma on operator norm control.
where we denote as shorthand.
Note that substituting Lemma 2 above directly gives us
We will prove Lemma 2 at the end of this proof.
3. Upper-bounding . Observe that
This directly implies that
where the second inequality follows because we have assumed bounded signal, i.e. . It remains to control the operator norm terms above. We denote , and note that . Thus, we get
Putting these together gives us
Proof of Lemma 2.
First, recall that , and so we really want to upper bound the quantity . It is well known (see, e.g. ) that for any positive semidefinite matrix , the operator is a proximal operator with respect to the convex nuclear norm functional. The non-expansiveness of proximal operators then gives us
with probability at least . Here, the last step follows by standard arguments on the concentration of the empirical covariance matrix. Recall that we defined as shorthand.
Appendix B Proof of Theorem 2
We begin the analysis by providing a common lemma for both cases that will characterize a high-probability regret bound as a functional of two random quantities: a) , the number of designated exploration rounds that we use for fast estimation, and , the total number of forced-exploration rounds. Here, we define
It is easy to verify that by definition, we have . Indeed, recall from the pseudocode in Algorithm 3 that we defined
We first state our guarantee on estimation error. For any , we define
For every , we have
with probability at least , and is the adjusted square loss gap given by
and was defined in Equation (5).
This proof essentially constitutes a martingale adaptation of the proof of fast estimation in . Let denote the random stopping time at which exploration samples have been collected. Moreover, let denote the (again random) times at which exploration samples were collected, and denote the corresponding actions that were taken. Then, we define a time-averaged covariance matrix as
for every value of . We state the following technical lemma, which is proved in Appendix D and critically uses the fact that the rounds on which is picked as a sole consequence of well-conditioning of the context (i.e. if and ) are filtered out of the considered exploration set