Log In Sign Up

Episodic Bandits with Stochastic Experts

We study a version of the contextual bandit problem where an agent is given soft control of a node in a graph-structured environment through a set of stochastic expert policies. The agent interacts with the environment over episodes, with each episode having different context distributions; this results in the `best expert' changing across episodes. Our goal is to develop an agent that tracks the best expert over episodes. We introduce the Empirical Divergence-based UCB (ED-UCB) algorithm in this setting where the agent does not have any knowledge of the expert policies or changes in context distributions. With mild assumptions, we show that bootstrapping from Õ(Nlog(NT^2√(E))) samples results in a regret of Õ(E(N+1) + N√(E)/T^2). If the expert policies are known to the agent a priori, then we can improve the regret to Õ(EN) without requiring any bootstrapping. Our analysis also tightens pre-existing logarithmic regret bounds to a problem-dependent constant in the non-episodic setting when expert policies are known. We finally empirically validate our findings through simulations.


page 1

page 2

page 3

page 4


Contextual Bandits with Stochastic Experts

We consider the problem of contextual bandits with stochastic experts, w...

Learning from both experts and data

In this work we study the problem of inferring a discrete probability di...

Reproducible Bandits

In this paper, we introduce the notion of reproducible policies in the c...

Collaborative Learning of Stochastic Bandits over a Social Network

We consider a collaborative online learning paradigm, wherein a group of...

E-HBA: Using Action Policies for Expert Advice and Agent Typification

Past research has studied two approaches to utilise predefined policy se...

Robust Optimization for Tree-Structured Stochastic Network Design

Stochastic network design is a general framework for optimizing network ...

Latent Bandits Revisited

A latent bandit problem is one in which the learning agent knows the arm...

1 Introduction

Recommendation systems for suggesting items to users are commonplace in online services such as marketplaces, content delivery platforms and ad placement systems. Such systems, over time, learn from user feedback, and improve their recommendations. An important caveat, however, is that both the distribution of user types and their respective preferences change over time, thus inducing changes in the optimal recommendation and requiring the system to periodically “reset” its learning.

In this paper, we consider systems with known change-points (aka episodes) in the distribution of user-types and preferences. Examples include seasonality in product recommendations where there are marked changes in interests based on time-of-year, or ad-placements based on time-of-day. While a baseline strategy would be to re-learn the recommendation algorithm in each episode, it is often advantageous to share some learning across episodes. Specifically, one often has access to (potentially, a very) large number of pre-trained recommendation algorithms (aka experts), and the goal then is to quickly determine (in an online manner) which expert is best suited to a specific episode. Crucially, the relationship among these experts can be learned over time – meaning that given samples of (recommended action, reward) from the deployment of one expert, we can infer what would have been the reward if some other expert were used. Such learned “transfer” across experts uses data from all the deployed experts over past episodes, and extracts invariant relationships holding across episodes; while the data collected in each episode, alongside this learned transfer, permits one to quickly determine the episode-dependent best expert.

To motivate the above episodic setting, we take the case of online advertising agencies which are companies that have proprietary ad-recommendation algorithms that place ads for other product companies on newspaper websites based on past campaigns. In each campaign, the agencies places ads for a specific product of the client (eg, a flagship car, gaming consoles, etc) in order to maximize the click-through rate of users on the newspaper website. At any given time, the agency signs contracts for new campaigns with new companies. The information about product features and user profiles form the context, whose distribution changes across campaigns due to change in user traffic and updated product line ups. This could also cause shifts in user preferences. In practice, the agency already has a finite inventory of ad-recommendation models (aka experts, typically logistic models for their very low inference delays of micro-seconds that is mandated by real-time user traffic) from past campaigns. On a new campaign, online ad agencies bid for slots in news media outlets depending on the profile of the user that visits their website, using these pre-learned experts (see [22, 31]). In this setting, agencies only re-learn which experts in their inventory works best (and possibly fine tune) for their new campaign. Our work models this episodic setup, albeit, without fine tuning of experts between campaigns.

1.1 Main contributions

We formulate this problem as an Episodic Bandit with Stochastic Experts. Here, an agent interacts with an environment through a set of experts over episodes. Each expert is characterized by a fixed and unknown conditional distribution over actions in a set given the context from . At the start of episode , the context distribution as well the distribution of rewards changes and remains fixed over the length of the episode. At each time, the agent observes the context , chooses one of the experts to and plays the recommended action to receive a reward . Note here that the expert policies remain invariant across all episodes.

The goal of the agent is to track the episode-dependent best expert in order to maximize the cumulative sum of rewards. Here, the best expert in a given episode is one that generates the maximum average reward averaged over the randomness in contexts, recommendations and rewards. Due to the stochastic nature of experts, we can use Importance Sampling (IS) estimators to share reward information to leverage the information leakage.

Our main contributions are as follows:

1. Empirical Divergence-Based Upper Confidence Bound (ED-UCB) Algorithm:

We develop the ED-UCB algorithm (Algorithm 1) for the episodic bandit problem with stochastic experts. Similar in spirit to the D-UCB algorithm in [25]

, ED-UCB employs a clipped IS estimator to predict rewards of each expert based on the estimated expert policies allowing for the samples collected under a particular expert to be used to estimate the behavior of the remainder by appropriate scaling and clipping. In a single episode setting, we show that with high probability, ED-UCB with approximate oracles for expert policies can provide

constant average cumulative regret where the constant does not scale with the duration of interaction.

Specifically, for experts, if policies are well approximated with probability at least , we show that with the same probability, ED-UCB incurs a regret of at most where is a constant that does not scale in the duration of play . Our analysis also improves the existing regret bound for D-UCB which promises regret in the full information case. Specifically, we show that this can be tightened to which holds with probability 1 for a problem-dependent constant .

2. Episodic behavior with bootstrapping:

We also specify the construction of the approximate experts used by ED-UCB in the case when the supports of are finite. We show that if the agent is bootstrapped with samples per expert, the use of ED-UCB over episodes, each of length provides regret bounded as where the dominant term does not scale with . D-UCB in the full-information setting guarantees a regret . Naive algorithm such as UCB in [2] (or KL-UCB of [7] suffers a regret of over which we improve order-wise in terms of , demonstrating the merits of sharing information among experts. We also mention how our methods can easily be extended to continuous context spaces.

3. Empirical evaluation:

We validate our findings empirically through simulations on the Movielens 1M dataset

[16]. We split users into contexts randomly and pick a selection of movies and generate random experts for recommendation. By varying the context distribution in each episode, we compare the performance of ED-UCB with naive optimistic algorithms, which we outperform heavily and D-UCB, with which our performance is comparable.

1.2 Related work

Adapting to changing environments forms the basis of meta-learning [27, 5] where agents learn to perform well over new tasks that appear in phases but share underlying similarities with the tasks seen in the past. Our approach can be viewed as an instance of meta-learning for bandits, where we are presented with varying environments in each episode with similarities across episodes. Here, the objective is to act to achieve the maximum possible reward through bandit feedback, while also using the past observations (including offline data if present). This setting is studied in [4] where a finite hypothesis space maps actions to rewards with each phase having its own true hypothesis. The authors propose an UCB based algorithm that learns the hypothesis space across phases, while quickly learning the true hypothesis in each phase with the current knowledge. Similarly, linear bandits where instances have common unknown but sparse support is studied in [29]. In [9, 17], meta-learning is viewed from a Bayesian perspective where in each phase an instance is drawn from a common meta-prior which is unknown. In particular, [9]

studies meta-linear bandits and provide regret guarantees for a regularized ridge regression, whereas


uses Thompson sampling for general problems, with Bayesian regret bounds for K-armed bandits.

Collective learning in a fixed and contextual environment with bandit feedback, where the reward of various arms and context pairs share a latent structure is known as Contextual Bandits ([3, 11, 6, 18, 13, 1, 26, 12] among several others), where actions are taken with respect to a context that is revealed in each round. In various works, [1, 26, 15, 14] a space of hypothesis is assumed to capture the mapping of arms and context pairs to reward, either exactly (realizable setting) or approximately (non-realizable), and bandit feedback is used to find the true hypothesis which provides the greedy optimal action, while adding enough exploration to aid learning.

Importance Sampling (IS) is used to transfer knowledge about random quantities under a known target distribution using samples from a known behavior distribution in the context of off-policy evaluation in reinforcement learning


. Further, clipping is a common method used to control the high variance of IS estimates by introducing a controlled amount of bias. In the case of best-arm identification, these methods were studied in

[10, 19, 24]. Finally, bootstrapping has been used in [30] in order to use offline supervised data to accelerate the online learning process.

Meta-learning algorithms take a model-based approach, where the invariant-structure (hypothesis space in [4] or meta-prior in [9, 17]) is first learnt to make the optimal decisions, while most contextual bandit algorithms are policy-based, trying to learn the optimal mapping by imposing structure on the policy space. Our approach falls in the latter category of optimizing over policies (aka experts) from a given finite set of policies. However, contrary to the commonly assumed deterministic policies, each policy in our setting is given by fixed distributions over arms conditioned on the context which is learnt by bootstrapping from offline data. Using the estimated experts, in each episode (where both the arm reward per context and context distributions change), we quickly learn the average rewards of the experts by collectively using samples from all the experts. In [25], a single episode of our setting is considered for the case where policy and context distributions are known to the agent, thus it does not capture episodic learning. Instead, we build on the Importance sampling (IS) approach therein and propose empirical IS by the learning of expert policies via bootstrapping from offline data, and adapting to changing reward and context distributions online. Furthermore, we tighten the single episode regret from logarithmic in episode length to constant.

2 Problem setup

We follow the setting in [25] where an agent acting on a contextual environment with contexts , actions and rewards . The agent is provided with a set of experts in , where each expert is characterized by conditional distributions or policies over for each context in . At each time , the agent receives the context and picks an expert (or simply, expert ). The action is sampled from the distribution , after which the agent receives a reward based on . The agent can use the historical observations and the new context to instruct its decisions at time . This setting can be viewed as a Directed Acyclic Graph with nodes and where the agent is given soft control on the node through

We assume that the experiment proceeds in episodes. In each episode , the distribution over the context set is written as . The distribution of rewards in this episode is written as . Further, we assume that agent is not provided with knowledge of the expert, context and reward distributions.

The goal of the agent is to remain competitive with the best expert in over all the episodes. Specifically, in episode , let be the mean reward obtained by expert where

is the expectation taken under the joint distribution

. The best expert in episode is then defined as with mean . Note that the best expert in each episode need not be the same due to episode-dependent distributions and . The agent seeks to minimize the cumulative regret across episodes, each with steps defined by

Possible approaches and performance:

A baseline approach for this model is to apply the Upper Confidence Bound (UCB) algorithm of [2] (or equivalently, the KL-UCB algorithm of [7]) in each episode by treating the experts as being arms of a standard multi-armed bandit problem. This approach is valid here since the mean rewards are averaged over the contexts in and provides a regret upper bound of . D-UCB in [25], which assumes access to expert policies and context distributions uses clipped IS and Median of Means estimates with specially constructed divergence metrics in order to share samples across experts. Under some assumptions, for a single episode of length , D-UCB provides a regret upper bound of , its worst-case upper bound matches that of the naive UCB algorithm.

The remainder of the paper is organized as follows: In Section 3, we show that access to approximate oracles leads to high-probability constant regret upper bounds in the single-episode setting. Our analysis can be extended to tighten the worst case bound of D-UCB to for a problem-dependent constant that does not scale with . In Section 4, for the episodic case, we show that these approximate oracles can be constructed using samples from the true experts. We characterize the regret in both the case where the agent is allowed to bootstrap from these samples as well as when the sampling is performed online. Since our exposition involves heavy notation, we consolidate all notations with descriptions and definitions in the Appendix for reference.

2.1 Assumptions

Before we develop our methods, we make the following assumptions:

Assumption 1.

The minimum probability of any occurring in any episode is bounded below by

Assumption 2.

The minimum probability of any expert setting the value of node under any is bounded below by

Assumption 3.

The minimum reward obtained by any expert in any episode is bounded below by

Remarks on Assumptions:
Assumption 1 ensures that the divergence metrics between arms can be computed reliably. Assumption 2 guarantees that arbitrary experts under any context are absolutely continuous with respect to each other. These assumptions are critical to the use of Importance Sampling to estimate the arm means. The latter assumption is also made implicitly to prove results for D-UCB by assuming bounded divergences. However, the former is avoided by assuming full knowledge of the context distribution.

Assumption 3 is standard and is also made for D-UCB. It controls the multiplicative constants in the overall regret bound.

3 The single episode setting

In this section, we develop the Empirical Divergence-based UCB (ED-UCB) algorithm for the single-episode case. To ease notation, we drop all subscripts that relate to episodes in this section. Inspired by the clipped IS estimator of [25], ED-UCB first uses estimates of the true expert policies in order to build estimates of expert means, which are biased due to the uncertainty in the environment. Then, these estimators are appropriately modified to serve as overestimates of their true means, leading to UCB being an optimistic bandit algorithm.

We only specify the necessary variables. Definitions of the remaining quantities can be found in the appendix.

Approximate Experts: The agent is provided access to estimators of the true experts for each and . In particular, we assume access to -approximate experts for each expert and context , i.e., all the approximate experts satisfy

These estimates can be formed by bootstrapping from prior data. In our advertising example, these estimates can be inferred from the expert’s behavior in previous episodes. We formalize this in the section to follow.
Divergence Estimates and ratio errors: We denote the ratios as . The following divergence metric is computed using the -approximate experts:


where . serves as a lower bound to the true divergence used by D-UCB.

Additionally, we also use the upper and lower confidence values for the importance sampling ratios to form underestimates (overestimates) that we call (). These estimates are derived using the ratio concentrations from [8]. We abuse notation further to write , where is the expert picked at time and are the respective realizations of nodes and to ease exposition.
Clipped Importance Sampling Estimator: We define our empirical clipped IS estimator for the mean of arm at time as


where is the normalizing factor.

To define the clipper levels, we use the function . We let for a constant . We use for analysis. Along with the clipper level, this quantity also controls the bias of the estimate. It is easy to check that the clipper level is increasing in since .
Upper Confidence Bound estimate: The UCB index of expert at time is set to be


The use of estimates for the divergence and the IS ratios causes the estimator to be inconsistent asymptotically. We denote the maximum deviation of the estimator at time as .
Note: In the case of full information, the estimator used by D-UCB is identical to that in Equation 2, with all the estimated quantities replaced by their true value. Further, the UCB index for D-UCB does not suffer from additional estimation error for expert .
Putting it all together: The ED-UCB algorithm is summarized in Algorithm 1. It is provided with the problem parameters and -oracles for the expert policies. Before interacting with the environment, the divergence estimates are computed. Then, at each time , the agent is given the context , chooses expert and observes the expert recommendation and the reward . Finally, the agent updates the indices for each expert .

1:  Inputs: -approximate experts for , parameters .
2:  Initialization: for all , as in Equation (1).
3:  for  do
4:     Receive .
5:     Play arm and observe .
6:     Compute the estimate according to Equation (2) for all experts.
7:     Set as in Equation (3) for all experts.
8:  end for
Algorithm 1 ED-UCB: Empirical Divergence-based UCB

3.1 Regret analysis of ED-UCB

Now, we will provide the high-probability constant average regret guarantee for the ED-UCB algorithm. Without loss of generality, for the remainder of this section, we assume that the experts are ordered in terms of their means as . We also define the suboptiality gaps for any expert as . This section is organized as a proof sketch leading to our main result in Theorem 2.

Step 1: Analyzing the estimator:

First we show that the estimator for arm is indeed concentrated in an interval around .

Theorem 1.

The estimator , as in Equation (2), satisfies

for , and .

In order to prove this result, we establish a similar bound for the estimator that uses a deterministic number of samples from a specific expert through the use of standard Chernoff bounds. This is then extended to the online case with a random number of samples per expert by constructing a specific martingale sequence and using the Azuma-Hoeffding inequality.

Step 2: Per expert concentrations:

Since samples are shared across all experts, once the suboptimal experts are well-estimated and sufficiently separated from the best expert, they need not be played. Their estimates will continue to improve due to samples received by playing the best expert. To this end, we define the following times:

Then, the following hold with probability at least :
1. At time , ,
2. For any , for any , .

Together, these imply the for any suboptimal expert ,

Thus, after the time , the number of times a sub-optimal expert is played is at most on average, which is the key observation that leads to constant regret.

Note: The above inequality can also be shown to hold for D-UCB with full information by redefining the times used in [25] appropriately. We specify this consequence in the appendix. This leads to the constant regret bound of with probability 1.

Step 3: Main regret result:

We define the regret in the a single episode as . Using the ED-UCB algorithm in this episode, we have the following regret bound.

Theorem 2.

Suppose the provided -oracles are such that

Consider for are as defined in the above Lemma. Then, for , with probability at least , the expected cumulative regret of ED-UCB is bounded as


is the vector of suboptimality gaps in the instance.

We reiterate here that all the quantities for are constants that are defined by the problem parameters and thus this high probability regret bound is a problem-dependent constant. We note here that the condition implicitly enforces that . Smaller gaps can be accommodated by appropriately decreasing .

4 Extending to episodes

In this section, we study the episodic setting. Here, the agent is to act on the environment for a total of episodes, each with time steps. In each episode, the distribution of contexts as well as the reward distribution may change. The changes in the former are handled by estimating the worst-case divergence metrics to be used to transfer information across experts. The latter however, is handled by our use of the ED-UCB algorithm by re-instantiating it at the start of each episode. First, we consider the case where the agent is allowed to sample offline and warm start the expert selection process using the collected estimates. Then, we describe how this process can be naturally carried out online at the cost of some additional regret. All the quantities that were introduced in Section 3 are additionally indexed by the episode number wherever appropriate.

4.1 Bootstrapping with offline sampling and online sampling

Here, the agent is given access to a sampling oracle of the true agents a priori, which can be used to form the -approximate experts offline for all . However, we do not assume that the agent can ask for a particular to be fixed while sampling the oracle. Instead, the oracle acts using a fixed distribution satisfying Assumption 1. This can be thought of as attempting to learn from data collected from these experts in previous campaigns in the advertising example. Bootstrapping and warm-starting methods are common in contextual bandit algorithms, for examples, see [30, 14] among others.

We define and set . We would like to gather at least samples for each expert under each context. However, due to the randomness in the sampling process, this event can only be guaranteed to happen up to a certain probability. We formalize this in the following lemma:

Lemma 3.

Define . Let be the event that after sampling each expert times, we acquire at least samples under each context . Then, we have that

1:  Inputs: Sampling oracles for true agent policies , parameters
2:  Bootstrapping: Play each expert times to build -approximate experts .
3:  Episodic Interaction
4:  for  do
5:     Play a fresh instance of ED-UCB (Algorithm 1) with inputs for steps.
6:  end for
Algorithm 2 Meta-Algorithm: ED-UCB for Episodic Bandits with Bootstrapping

Under the high probability event , the agent builds -approximate experts and uses them to instantiate the D-UCB algorithm in each episode. The confidence of in these estimates uses the multinomial concentrations of [28]. Therefore, with probability at least , the agent suffers constant regret in each of the episodes. However, in case of the lower probability event, we assume that the expert suffers a worst case regret of . We assume that is large enough. Specifically, . We also use to be the vector of gaps in episode . Our algorithm for multiple episodes is summarized in Algorithm 2

Then, a straightforward application of the Law of Total Probability gives us the following result:

Theorem 4.

An agent which is bootstrapped with samples times from the true environment for each expert and uses Algorithm 1 in each of the episodes there on suffers a regret

This result extends to the online setting naturally. In this case, the agent spends the first time steps collecting samples and builds the empirical estimates of the experts. After this time, the agent continues as if it were bootstrapped. Since these expert policies do not change with episodes, the agent only incurs this additional regret once. This leads to the following guarantee:

Theorem 5.

The online estimation of the estimation oracles adds an additional regret of to that in Theorem 4. The total regret of the online process can be bounded as

Comparing the UCB and D-UCB:

The standard UCB algorithm of [2] provides a regret upper bound of the order of by treating each expert as a separate arm of a multi-armed bandit. In the bootstrapping case, our methods improve the scaling in terms of order-wise (from logarithmic to constant with a decaying error term in ). In the online case, we improve the scaling in terms of since we carry out the initial estimation phase only once, thus not incurring any regret of the form . Our bootstrapped methods match the dominant term of D-UCB (assuming access to true expert policies and true context distributions), with an error term that trails off as increases.

(a) Regret Comparison
(b) Context Distributions
Figure 1:

Experiments on the Movielens 1M data set: An experiment with 5 episodes with 50000 steps each. Plots are averaged over 100 independent runs, error bars indicate one standard deviation. ED-UCB easily outperforms the naive UCB, KL-UCB algorithms and performs comparably to D-UCB which is given full knowledge. The effect of the ED-UCB’s regret not scaling with

can be seen here. The figure on the right displays the context distributions for each episode.
Remark 6 (Infinite context spaces).

In all the discussion so far, the use of the context distribution is restricted to defining the quantities and the upper bound . In the case of continuous contexts, the results in Section 3.1 can be extended by changing Assumption 1 to assume knowledge of upper and lower bounds on the density function and swapping out the summation for an integral in Equation (1). This can further be extended to the episodic case by assuming access to approximate oracles instead of the bootstrapping and online estimation phases.

Remark 7 (Random episodes).

We present our methods assuming a fixed number of episodes , each lasting for times steps. However, our methods readily extend to the case where the number of episodes and and their lengths are chosen randomly, with and serving as deterministic upper bounds for these quantities respectively.

5 Experiments

We now present numerical experiments to validate our results above. We use the Movielens 1M data set [16] to construct a semi-synthetic bandit instance. To this end, we first pick movies that have a reasonable number of ratings and complete the remaining entries of the rating matrix using the SoftImpute algorithm of [21] using the fancyimpute package [23]. The 5 best movies are then sampled to act as actions. The experts are build randomly as distributions over these movies under each context satisfying Assumption 2. In each episode, a new context distribution according to Assumption 1 is sampled randomly and is fixed for the length of the episode. We set and use

to be the mean of the worst expert across all episodes. Rewards are generated and from a Bernoulli distribution with the appropriate means under the given user context and movie recommended.

We compare our ED-UCB algorithm with three algorithms namely, UCB [2], KL-UCB [7] and D-UCB [25]. The former two algorithms serve as naive baselines for our setting. Each of these treats each expert as an arm of a multi-armed bandit problem and do not exploit the structure in the experts in any way. Each of them incur a regret of the order of . D-UCB captures the full information setting, where the expert policies and context distributions are revealed to the agent before interaction begins. We use as the parameters of D-UCB. We bootstrap our ED-UCB algorithm (Algorithm 1, with ) with samples as instructed by Theorem 4 and perform the experiment over episodes of steps each, averaged over 100 independent runs. The results are summarized in Figure 1.

6 Conclusions

We study the episodic bandit problem with stochastic experts, where context and reward distributions vary with episodes. By leveraging the expert structure to share information, we develop the ED-UCB algorithm and provide regret guarantees that do not scale in the length of episodes. We also specify the amount of bootstrapping samples which grows logarithmically in the number and size of episodes. Finally, we empirically compare our performance with the D-UCB algorithm in the full-information setting and observe comparable performance in spite of imperfect information.

Limitations: The assumptions we make in Section 2.1 are crucial to the theory we develop in this work. Further, as our regret analysis compares to the best expert, it is implicitly assumed that the best expert provides mean rewards that are reasonably high in order to be suitable for use in practice. The problem of designing agents in the case where and are not known beforehand remains open.

Societal Impact: Our main contributions are mainly algorithmic and analytical. As we are motivated by real-world settings, it is important to judge the data that is being used to build as well as run the system in practice. In particular, our methods can be used to recommend articles and advertisements to user populations in an online fashion. In this setting, biases derived from the data in the bootstrapping phase of our methods can prevail in the online phase, leading to insensitive recommendations, thus requiring sufficient testing before deployment.


This research was partially supported by NSF Grants 1826320 and 2019844, ARO grant W911NF-17-1-0359, US DOD grant H98230-18-D-0007, and the Wireless Networking and Communications Group Industrial Affiliates Program.


  • [1] A. Agarwal, D. Hsu, S. Kale, J. Langford, L. Li, and R. Schapire (2014) Taming the monster: a fast and simple algorithm for contextual bandits. In

    International Conference on Machine Learning

    pp. 1638–1646. Cited by: §1.2.
  • [2] P. Auer, N. Cesa-Bianchi, and P. Fischer (2002) Finite-time analysis of the multiarmed bandit problem. Machine learning 47 (2), pp. 235–256. Cited by: Appendix E, §1.1, §2, §4.1, §5.
  • [3] P. Auer, N. Cesa-Bianchi, Y. Freund, and R. E. Schapire (2002) The nonstochastic multiarmed bandit problem. SIAM journal on computing 32 (1), pp. 48–77. Cited by: §1.2.
  • [4] M. G. Azar, A. Lazaric, and E. Brunskill (2013) Sequential transfer in multi-armed bandit with finite set of models. arXiv preprint arXiv:1307.6887. Cited by: §1.2, §1.2.
  • [5] J. Baxter (1998) Theoretical models of learning to learn. In Learning to learn, pp. 71–94. Cited by: §1.2.
  • [6] S. Bubeck and N. Cesa-Bianchi (2012) Regret analysis of stochastic and nonstochastic multi-armed bandit problems. arXiv preprint arXiv:1204.5721. Cited by: §1.2.
  • [7] O. Cappé, A. Garivier, O. Maillard, R. Munos, G. Stoltz, et al. (2013) Kullback–leibler upper confidence bounds for optimal sequential allocation. The Annals of Statistics 41 (3), pp. 1516–1541. Cited by: Appendix E, §1.1, §2, §5.
  • [8] S. Cayci, A. Eryilmaz, and R. Srikant (2019) Learning to control renewal processes with bandit feedback. Proceedings of the ACM on Measurement and Analysis of Computing Systems 3 (2), pp. 1–32. Cited by: Appendix B, Appendix B, §3.
  • [9] L. Cella, A. Lazaric, and M. Pontil (2020) Meta-learning with stochastic linear bandits. In International Conference on Machine Learning, pp. 1360–1370. Cited by: §1.2, §1.2.
  • [10] D. Charles, M. Chickering, and P. Simard (2013) Counterfactual reasoning and learning systems: the example of computational advertising. Journal of Machine Learning Research 14. Cited by: §1.2.
  • [11] W. Chu, L. Li, L. Reyzin, and R. Schapire (2011) Contextual bandits with linear payoff functions. In

    Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics

    pp. 208–214. Cited by: §1.2.
  • [12] A. A. Deshmukh, U. Dogan, and C. Scott (2017) Multi-task learning for contextual bandits. arXiv preprint arXiv:1705.08618. Cited by: §1.2.
  • [13] M. Dudik, D. Hsu, S. Kale, N. Karampatziakis, J. Langford, L. Reyzin, and T. Zhang (2011) Efficient optimal learning for contextual bandits. arXiv preprint arXiv:1106.2369. Cited by: §1.2.
  • [14] D. Foster, A. Agarwal, M. Dudik, H. Luo, and R. Schapire (2018) Practical contextual bandits with regression oracles. In International Conference on Machine Learning, pp. 1539–1548. Cited by: §1.2, §4.1.
  • [15] D. Foster and A. Rakhlin (2020) Beyond ucb: optimal and efficient contextual bandits with regression oracles. In International Conference on Machine Learning, pp. 3199–3210. Cited by: §1.2.
  • [16] F. M. Harper and J. A. Konstan (2015) The movielens datasets: history and context. Acm transactions on interactive intelligent systems (tiis) 5 (4), pp. 1–19. Cited by: Appendix E, §1.1, §5.
  • [17] B. Kveton, M. Konobeev, M. Zaheer, C. Hsu, M. Mladenov, C. Boutilier, and C. Szepesvari (2021) Meta-thompson sampling. arXiv preprint arXiv:2102.06129. Cited by: §1.2, §1.2.
  • [18] J. Langford and T. Zhang (2007) Epoch-greedy algorithm for multi-armed bandits with side information. Advances in Neural Information Processing Systems (NIPS 2007) 20, pp. 1. Cited by: §1.2.
  • [19] F. Lattimore, T. Lattimore, and M. D. Reid (2016) Causal bandits: learning good interventions via causal inference. arXiv preprint arXiv:1606.03203. Cited by: §1.2.
  • [20] A. R. Mahmood, H. Van Hasselt, and R. S. Sutton (2014) Weighted importance sampling for off-policy learning with linear function approximation.. In NIPS, pp. 3014–3022. Cited by: §1.2.
  • [21] R. Mazumder, T. Hastie, and R. Tibshirani (2010) Spectral regularization algorithms for learning large incomplete matrices. Journal of machine learning research 11 (Aug), pp. 2287–2322. Cited by: Appendix E, §5.
  • [22] C. Perlich, B. Dalessandro, T. Raeder, O. Stitelman, and F. Provost (2014)

    Machine learning for targeted display advertising: transfer learning in action

    Machine learning 95 (1), pp. 103–127. Cited by: §1.
  • [23] A. Rubinsteyn, S. Feldman, T. O’Donnell, and B. Beaulieu-Jones Fancyimpute 0.5.4. Note: DOI: 10.5281/zenodo.51773; License- Apache Software License ( Cited by: §5.
  • [24] R. Sen, K. Shanmugam, A. G. Dimakis, and S. Shakkottai (2017) Identifying best interventions through online importance sampling. In International Conference on Machine Learning, pp. 3057–3066. Cited by: §C.1, §1.2.
  • [25] R. Sen, K. Shanmugam, and S. Shakkottai (2018) Contextual bandits with stochastic experts. In International Conference on Artificial Intelligence and Statistics, pp. 852–861. Cited by: §C.5, §C.5, Appendix E, §1.1, §1.2, §2, §2, §3.1, §3, §5, Lemma 16.
  • [26] D. Simchi-Levi and Y. Xu (2020) Bypassing the monster: a faster and simpler optimal algorithm for contextual bandits under realizability. Available at SSRN. Cited by: §1.2.
  • [27] S. Thrun (1998) Lifelong learning algorithms. In Learning to learn, pp. 181–209. Cited by: §1.2.
  • [28] T. Weissman, E. Ordentlich, G. Seroussi, S. Verdu, and M. J. Weinberger (2003) Inequalities for the l1 deviation of the empirical distribution. Hewlett-Packard Labs, Tech. Rep. Cited by: Appendix B, Appendix B, §4.1.
  • [29] J. Yang, W. Hu, J. D. Lee, and S. S. Du (2020) Provable benefits of representation learning in linear bandits. arXiv preprint arXiv:2010.06531. Cited by: §1.2.
  • [30] C. Zhang, A. Agarwal, H. Daumé III, J. Langford, and S. N. Negahban (2019) Warm-starting contextual bandits: robustly combining supervised and bandit feedback. arXiv preprint arXiv:1901.00301. Cited by: §1.2, §4.1.
  • [31] W. Zhang, S. Yuan, and J. Wang (2014) Optimal real-time bidding for display advertising. In Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 1077–1086. Cited by: §1.

Appendix A Notations

In this section, we provide exact definitions and descriptions of all the notation used in the main body of this paper.

Notations in Section 2:

  • : These represent the finite sets that the contexts and actions are picked from respectively.

  • : The set of experts provided to the agent. Expert is equivalently referred to as expert for any . Each expert is characterized by probability distributions over the set . Specifically, given a context , the expert is characterized by the conditional distribution for each .

  • : The reward variable. It is assumed that but this can be extended to any bounded interval with appropriate shifting and scaling.

  • : The number of episodes and the length of each episode respectively.

  • : The context and reward distributions at each time in episode . Specifically, in episode , for each the context , where is the action recommended by the agent chosen by the agent.

  • : The mean of expert in episode . Mathematically, where the expectation over the joint distribution .

  • : The mean of the best expert in episode , .

Notations in Section 2.1:

  • : The lower bound on the minimum probability with which any context is seen in any episode.

  • : The lower bound on the minimum probability with which any action is picked by any expert.

  • : The lower bound on the minimum reward achieved by any expert in any episode.


Notations in Section 3:

  • -approximate experts: These are estimates of the true experts which satisfy for all and ,

  • : The Importance Sampling (IS) ratio between experts on the action under context . .

  • : Upper and lower confidence bounds on the estimated IS ratio . Specifically,

  • : Estimates of the -divergence between the experts and . Here, .

  • : The normalizing factor in the IS sampling estimator. for each expert where is the expert picked at time .

  • : The bias of the IS sampling estimator due to clipping. For this, we define . Then,

  • : The IS estimate of the mean of expert by time .

  • : The maximum error in the IS estimate due to use of estimated quantities. Specifically,

  • : The Upper Confidence Bound estimate of the mean of expert that includes the IS estimate, the total bias due to clipping and estimation.

Notation in Section 3.1:

  • : The true divergence between experts given by

  • : The upper bound on the true maximum divergence between any two experts , . A trivial upper bound is used, given by . We also set in the definition of for each expert .

  • : The vector of suboptimality gaps, where . We assume here that without loss of generality, experts are arranged in decreasing order of means: .

  • : The first time at which all clippers are active for all experts, i.e.,

  • : The first time at which one can conclude with high probability that the best arm is never underestimated. We define

    For any with probability at least ,

  • : The first time at which one can conclude with high probability that the -best expert is never overestimated. We define

    For any , with probability at least , .

  • : The regret suffered by the agent by time , .

Appendix B Some useful concentrations

We start with some useful concentration inequalities. First we consider a modified result from [28]:

Lemma 8.

Let be a probability vector with points of support. Let be an empirical estimate of using i.i.d. draws. Then, for any and , it holds that

Proof: The result follows from that of [28] as for any vector .

The next lemma is a adapted from the Lemma 5.1 in [8]

. It provides confidence bounds for ratios of random variables.

Lemma 9.

Suppose are probability vectors that share the same support with at least mass on each support point. Let be their respective empirical estimates such that . Call, for each , and . With probability at least , it holds that

Proof: The proof follows the result from of Lemma 5.1 in [8]. To ease notation, we fix an arbitrary and denote , for . Under the event that for , we have

Similarly, we have that

Since the choice of was arbitrary and the event holds with probability at least , the result follows.

Appendix C Proofs of results in Section 3.1

To prove the concentration result in Theorem 1, we will first consider a simpler setting of two arms with deterministic samples. We refer the reader to Section A for definitions and descriptions of quantities used here.

c.1 A simpler case of two arms

We begin by considering 2 arms, . For simplicity, we assume for this section (In case all the statements hold with probability at least ). We are given access to samples from arm and seek to estimate the mean of arm using the approximate experts . The arguments in this section closely follow the analysis of the two armed estimator in the full information case in [24]. We start with the following assumption:

Assumption 4.

is such that

We define

Note that by Lemma 9, we have that

For an arbitrarily chosen , we write


Then, we have the following claims:

Claim 1.

With as above, for any