1 Introduction
Recommendation systems for suggesting items to users are commonplace in online services such as marketplaces, content delivery platforms and ad placement systems. Such systems, over time, learn from user feedback, and improve their recommendations. An important caveat, however, is that both the distribution of user types and their respective preferences change over time, thus inducing changes in the optimal recommendation and requiring the system to periodically “reset” its learning.
In this paper, we consider systems with known changepoints (aka episodes) in the distribution of usertypes and preferences. Examples include seasonality in product recommendations where there are marked changes in interests based on timeofyear, or adplacements based on timeofday. While a baseline strategy would be to relearn the recommendation algorithm in each episode, it is often advantageous to share some learning across episodes. Specifically, one often has access to (potentially, a very) large number of pretrained recommendation algorithms (aka experts), and the goal then is to quickly determine (in an online manner) which expert is best suited to a specific episode. Crucially, the relationship among these experts can be learned over time – meaning that given samples of (recommended action, reward) from the deployment of one expert, we can infer what would have been the reward if some other expert were used. Such learned “transfer” across experts uses data from all the deployed experts over past episodes, and extracts invariant relationships holding across episodes; while the data collected in each episode, alongside this learned transfer, permits one to quickly determine the episodedependent best expert.
To motivate the above episodic setting, we take the case of online advertising agencies which are companies that have proprietary adrecommendation algorithms that place ads for other product companies on newspaper websites based on past campaigns. In each campaign, the agencies places ads for a specific product of the client (eg, a flagship car, gaming consoles, etc) in order to maximize the clickthrough rate of users on the newspaper website. At any given time, the agency signs contracts for new campaigns with new companies. The information about product features and user profiles form the context, whose distribution changes across campaigns due to change in user traffic and updated product line ups. This could also cause shifts in user preferences. In practice, the agency already has a finite inventory of adrecommendation models (aka experts, typically logistic models for their very low inference delays of microseconds that is mandated by realtime user traffic) from past campaigns. On a new campaign, online ad agencies bid for slots in news media outlets depending on the profile of the user that visits their website, using these prelearned experts (see [22, 31]). In this setting, agencies only relearn which experts in their inventory works best (and possibly fine tune) for their new campaign. Our work models this episodic setup, albeit, without fine tuning of experts between campaigns.
1.1 Main contributions
We formulate this problem as an Episodic Bandit with Stochastic Experts. Here, an agent interacts with an environment through a set of experts over episodes. Each expert is characterized by a fixed and unknown conditional distribution over actions in a set given the context from . At the start of episode , the context distribution as well the distribution of rewards changes and remains fixed over the length of the episode. At each time, the agent observes the context , chooses one of the experts to and plays the recommended action to receive a reward . Note here that the expert policies remain invariant across all episodes.
The goal of the agent is to track the episodedependent best expert in order to maximize the cumulative sum of rewards. Here, the best expert in a given episode is one that generates the maximum average reward averaged over the randomness in contexts, recommendations and rewards. Due to the stochastic nature of experts, we can use Importance Sampling (IS) estimators to share reward information to leverage the information leakage.
Our main contributions are as follows:
1. Empirical DivergenceBased Upper Confidence Bound (EDUCB) Algorithm:
We develop the EDUCB algorithm (Algorithm 1) for the episodic bandit problem with stochastic experts. Similar in spirit to the DUCB algorithm in [25]
, EDUCB employs a clipped IS estimator to predict rewards of each expert based on the estimated expert policies allowing for the samples collected under a particular expert to be used to estimate the behavior of the remainder by appropriate scaling and clipping. In a single episode setting, we show that with high probability, EDUCB with approximate oracles for expert policies can provide
constant average cumulative regret where the constant does not scale with the duration of interaction.Specifically, for experts, if policies are well approximated with probability at least , we show that with the same probability, EDUCB incurs a regret of at most where is a constant that does not scale in the duration of play . Our analysis also improves the existing regret bound for DUCB which promises regret in the full information case. Specifically, we show that this can be tightened to which holds with probability 1 for a problemdependent constant .
2. Episodic behavior with bootstrapping:
We also specify the construction of the approximate experts used by EDUCB in the case when the supports of are finite. We show that if the agent is bootstrapped with samples per expert, the use of EDUCB over episodes, each of length provides regret bounded as where the dominant term does not scale with . DUCB in the fullinformation setting guarantees a regret . Naive algorithm such as UCB in [2] (or KLUCB of [7] suffers a regret of over which we improve orderwise in terms of , demonstrating the merits of sharing information among experts. We also mention how our methods can easily be extended to continuous context spaces.
3. Empirical evaluation:
We validate our findings empirically through simulations on the Movielens 1M dataset
[16]. We split users into contexts randomly and pick a selection of movies and generate random experts for recommendation. By varying the context distribution in each episode, we compare the performance of EDUCB with naive optimistic algorithms, which we outperform heavily and DUCB, with which our performance is comparable.1.2 Related work
Adapting to changing environments forms the basis of metalearning [27, 5] where agents learn to perform well over new tasks that appear in phases but share underlying similarities with the tasks seen in the past. Our approach can be viewed as an instance of metalearning for bandits, where we are presented with varying environments in each episode with similarities across episodes. Here, the objective is to act to achieve the maximum possible reward through bandit feedback, while also using the past observations (including offline data if present). This setting is studied in [4] where a finite hypothesis space maps actions to rewards with each phase having its own true hypothesis. The authors propose an UCB based algorithm that learns the hypothesis space across phases, while quickly learning the true hypothesis in each phase with the current knowledge. Similarly, linear bandits where instances have common unknown but sparse support is studied in [29]. In [9, 17], metalearning is viewed from a Bayesian perspective where in each phase an instance is drawn from a common metaprior which is unknown. In particular, [9]
studies metalinear bandits and provide regret guarantees for a regularized ridge regression, whereas
[17]uses Thompson sampling for general problems, with Bayesian regret bounds for Karmed bandits.
Collective learning in a fixed and contextual environment with bandit feedback, where the reward of various arms and context pairs share a latent structure is known as Contextual Bandits ([3, 11, 6, 18, 13, 1, 26, 12] among several others), where actions are taken with respect to a context that is revealed in each round. In various works, [1, 26, 15, 14] a space of hypothesis is assumed to capture the mapping of arms and context pairs to reward, either exactly (realizable setting) or approximately (nonrealizable), and bandit feedback is used to find the true hypothesis which provides the greedy optimal action, while adding enough exploration to aid learning.
Importance Sampling (IS) is used to transfer knowledge about random quantities under a known target distribution using samples from a known behavior distribution in the context of offpolicy evaluation in reinforcement learning
[20]. Further, clipping is a common method used to control the high variance of IS estimates by introducing a controlled amount of bias. In the case of bestarm identification, these methods were studied in
[10, 19, 24]. Finally, bootstrapping has been used in [30] in order to use offline supervised data to accelerate the online learning process.Metalearning algorithms take a modelbased approach, where the invariantstructure (hypothesis space in [4] or metaprior in [9, 17]) is first learnt to make the optimal decisions, while most contextual bandit algorithms are policybased, trying to learn the optimal mapping by imposing structure on the policy space. Our approach falls in the latter category of optimizing over policies (aka experts) from a given finite set of policies. However, contrary to the commonly assumed deterministic policies, each policy in our setting is given by fixed distributions over arms conditioned on the context which is learnt by bootstrapping from offline data. Using the estimated experts, in each episode (where both the arm reward per context and context distributions change), we quickly learn the average rewards of the experts by collectively using samples from all the experts. In [25], a single episode of our setting is considered for the case where policy and context distributions are known to the agent, thus it does not capture episodic learning. Instead, we build on the Importance sampling (IS) approach therein and propose empirical IS by the learning of expert policies via bootstrapping from offline data, and adapting to changing reward and context distributions online. Furthermore, we tighten the single episode regret from logarithmic in episode length to constant.
2 Problem setup
We follow the setting in [25] where an agent acting on a contextual environment with contexts , actions and rewards . The agent is provided with a set of experts in , where each expert is characterized by conditional distributions or policies over for each context in . At each time , the agent receives the context and picks an expert (or simply, expert ). The action is sampled from the distribution , after which the agent receives a reward based on . The agent can use the historical observations and the new context to instruct its decisions at time . This setting can be viewed as a Directed Acyclic Graph with nodes and where the agent is given soft control on the node through
We assume that the experiment proceeds in episodes. In each episode , the distribution over the context set is written as . The distribution of rewards in this episode is written as . Further, we assume that agent is not provided with knowledge of the expert, context and reward distributions.
The goal of the agent is to remain competitive with the best expert in over all the episodes. Specifically, in episode , let be the mean reward obtained by expert where
is the expectation taken under the joint distribution
. The best expert in episode is then defined as with mean . Note that the best expert in each episode need not be the same due to episodedependent distributions and . The agent seeks to minimize the cumulative regret across episodes, each with steps defined byPossible approaches and performance:
A baseline approach for this model is to apply the Upper Confidence Bound (UCB) algorithm of [2] (or equivalently, the KLUCB algorithm of [7]) in each episode by treating the experts as being arms of a standard multiarmed bandit problem. This approach is valid here since the mean rewards are averaged over the contexts in and provides a regret upper bound of . DUCB in [25], which assumes access to expert policies and context distributions uses clipped IS and Median of Means estimates with specially constructed divergence metrics in order to share samples across experts. Under some assumptions, for a single episode of length , DUCB provides a regret upper bound of , its worstcase upper bound matches that of the naive UCB algorithm.
The remainder of the paper is organized as follows: In Section 3, we show that access to approximate oracles leads to highprobability constant regret upper bounds in the singleepisode setting. Our analysis can be extended to tighten the worst case bound of DUCB to for a problemdependent constant that does not scale with . In Section 4, for the episodic case, we show that these approximate oracles can be constructed using samples from the true experts. We characterize the regret in both the case where the agent is allowed to bootstrap from these samples as well as when the sampling is performed online. Since our exposition involves heavy notation, we consolidate all notations with descriptions and definitions in the Appendix for reference.
2.1 Assumptions
Before we develop our methods, we make the following assumptions:
Assumption 1.
The minimum probability of any occurring in any episode is bounded below by
Assumption 2.
The minimum probability of any expert setting the value of node under any is bounded below by
Assumption 3.
The minimum reward obtained by any expert in any episode is bounded below by
Remarks on Assumptions:
Assumption 1 ensures that the divergence metrics between arms can be computed reliably. Assumption 2 guarantees that arbitrary experts under any context are absolutely continuous with respect to each other. These assumptions are critical to the use of Importance Sampling to estimate the arm means. The latter assumption is also made implicitly to prove results for DUCB by assuming bounded divergences. However, the former is avoided by assuming full knowledge of the context distribution.
Assumption 3 is standard and is also made for DUCB. It controls the multiplicative constants in the overall regret bound.
3 The single episode setting
In this section, we develop the Empirical Divergencebased UCB (EDUCB) algorithm for the singleepisode case. To ease notation, we drop all subscripts that relate to episodes in this section. Inspired by the clipped IS estimator of [25], EDUCB first uses estimates of the true expert policies in order to build estimates of expert means, which are biased due to the uncertainty in the environment. Then, these estimators are appropriately modified to serve as overestimates of their true means, leading to UCB being an optimistic bandit algorithm.
We only specify the necessary variables. Definitions of the remaining quantities can be found in the appendix.
Approximate Experts: The agent is provided access to estimators of the true experts for each and . In particular, we assume access to approximate experts for each expert and context , i.e., all the approximate experts satisfy
These estimates can be formed by bootstrapping from prior data. In our advertising example, these estimates can be inferred from the expert’s behavior in previous episodes. We formalize this in the section to follow.
Divergence Estimates and ratio errors: We denote the ratios as . The following divergence metric is computed using the approximate experts:
(1) 
where . serves as a lower bound to the true divergence used by DUCB.
Additionally, we also use the upper and lower confidence values for the importance sampling ratios to form underestimates (overestimates) that we call (). These estimates are derived using the ratio concentrations from [8]. We abuse notation further to write , where is the expert picked at time and are the respective realizations of nodes and to ease exposition.
Clipped Importance Sampling Estimator: We define our empirical clipped IS estimator for the mean of arm at time as
(2) 
where is the normalizing factor.
To define the clipper levels, we use the function . We let for a constant . We use for analysis. Along with the clipper level, this quantity also controls the bias of the estimate. It is easy to check that the clipper level is increasing in since .
Upper Confidence Bound estimate: The UCB index of expert at time is set to be
(3) 
The use of estimates for the divergence and the IS ratios causes the estimator to be inconsistent asymptotically. We denote the maximum deviation of the estimator at time as .
Note: In the case of full information, the estimator used by DUCB is identical to that in Equation 2, with all the estimated quantities replaced by their true value. Further, the UCB index for DUCB does not suffer from additional estimation error for expert .
Putting it all together: The EDUCB algorithm is summarized in Algorithm 1. It is provided with the problem parameters and oracles for the expert policies. Before interacting with the environment, the divergence estimates are computed. Then, at each time , the agent is given the context , chooses expert and observes the expert recommendation and the reward . Finally, the agent updates the indices for each expert .
3.1 Regret analysis of EDUCB
Now, we will provide the highprobability constant average regret guarantee for the EDUCB algorithm. Without loss of generality, for the remainder of this section, we assume that the experts are ordered in terms of their means as . We also define the suboptiality gaps for any expert as . This section is organized as a proof sketch leading to our main result in Theorem 2.
Step 1: Analyzing the estimator:
First we show that the estimator for arm is indeed concentrated in an interval around .
Theorem 1.
In order to prove this result, we establish a similar bound for the estimator that uses a deterministic number of samples from a specific expert through the use of standard Chernoff bounds. This is then extended to the online case with a random number of samples per expert by constructing a specific martingale sequence and using the AzumaHoeffding inequality.
Step 2: Per expert concentrations:
Since samples are shared across all experts, once the suboptimal experts are wellestimated and sufficiently separated from the best expert, they need not be played. Their estimates will continue to improve due to samples received by playing the best expert. To this end, we define the following times:
Then, the following hold with probability at least :
1. At time , ,
2. For any , for any , .
Together, these imply the for any suboptimal expert ,
Thus, after the time , the number of times a suboptimal expert is played is at most on average, which is the key observation that leads to constant regret.
Note: The above inequality can also be shown to hold for DUCB with full information by redefining the times used in [25] appropriately. We specify this consequence in the appendix. This leads to the constant regret bound of with probability 1.
Step 3: Main regret result:
We define the regret in the a single episode as . Using the EDUCB algorithm in this episode, we have the following regret bound.
Theorem 2.
Suppose the provided oracles are such that
Consider for are as defined in the above Lemma. Then, for , with probability at least , the expected cumulative regret of EDUCB is bounded as
We reiterate here that all the quantities for are constants that are defined by the problem parameters and thus this high probability regret bound is a problemdependent constant. We note here that the condition implicitly enforces that . Smaller gaps can be accommodated by appropriately decreasing .
4 Extending to episodes
In this section, we study the episodic setting. Here, the agent is to act on the environment for a total of episodes, each with time steps. In each episode, the distribution of contexts as well as the reward distribution may change. The changes in the former are handled by estimating the worstcase divergence metrics to be used to transfer information across experts. The latter however, is handled by our use of the EDUCB algorithm by reinstantiating it at the start of each episode. First, we consider the case where the agent is allowed to sample offline and warm start the expert selection process using the collected estimates. Then, we describe how this process can be naturally carried out online at the cost of some additional regret. All the quantities that were introduced in Section 3 are additionally indexed by the episode number wherever appropriate.
4.1 Bootstrapping with offline sampling and online sampling
Here, the agent is given access to a sampling oracle of the true agents a priori, which can be used to form the approximate experts offline for all . However, we do not assume that the agent can ask for a particular to be fixed while sampling the oracle. Instead, the oracle acts using a fixed distribution satisfying Assumption 1. This can be thought of as attempting to learn from data collected from these experts in previous campaigns in the advertising example. Bootstrapping and warmstarting methods are common in contextual bandit algorithms, for examples, see [30, 14] among others.
We define and set . We would like to gather at least samples for each expert under each context. However, due to the randomness in the sampling process, this event can only be guaranteed to happen up to a certain probability. We formalize this in the following lemma:
Lemma 3.
Define . Let be the event that after sampling each expert times, we acquire at least samples under each context . Then, we have that
Under the high probability event , the agent builds approximate experts and uses them to instantiate the DUCB algorithm in each episode. The confidence of in these estimates uses the multinomial concentrations of [28]. Therefore, with probability at least , the agent suffers constant regret in each of the episodes. However, in case of the lower probability event, we assume that the expert suffers a worst case regret of . We assume that is large enough. Specifically, . We also use to be the vector of gaps in episode . Our algorithm for multiple episodes is summarized in Algorithm 2
Then, a straightforward application of the Law of Total Probability gives us the following result:
Theorem 4.
An agent which is bootstrapped with samples times from the true environment for each expert and uses Algorithm 1 in each of the episodes there on suffers a regret
This result extends to the online setting naturally. In this case, the agent spends the first time steps collecting samples and builds the empirical estimates of the experts. After this time, the agent continues as if it were bootstrapped. Since these expert policies do not change with episodes, the agent only incurs this additional regret once. This leads to the following guarantee:
Theorem 5.
The online estimation of the estimation oracles adds an additional regret of to that in Theorem 4. The total regret of the online process can be bounded as
Comparing the UCB and DUCB:
The standard UCB algorithm of [2] provides a regret upper bound of the order of by treating each expert as a separate arm of a multiarmed bandit. In the bootstrapping case, our methods improve the scaling in terms of orderwise (from logarithmic to constant with a decaying error term in ). In the online case, we improve the scaling in terms of since we carry out the initial estimation phase only once, thus not incurring any regret of the form . Our bootstrapped methods match the dominant term of DUCB (assuming access to true expert policies and true context distributions), with an error term that trails off as increases.
Experiments on the Movielens 1M data set: An experiment with 5 episodes with 50000 steps each. Plots are averaged over 100 independent runs, error bars indicate one standard deviation. EDUCB easily outperforms the naive UCB, KLUCB algorithms and performs comparably to DUCB which is given full knowledge. The effect of the EDUCB’s regret not scaling with
can be seen here. The figure on the right displays the context distributions for each episode.Remark 6 (Infinite context spaces).
In all the discussion so far, the use of the context distribution is restricted to defining the quantities and the upper bound . In the case of continuous contexts, the results in Section 3.1 can be extended by changing Assumption 1 to assume knowledge of upper and lower bounds on the density function and swapping out the summation for an integral in Equation (1). This can further be extended to the episodic case by assuming access to approximate oracles instead of the bootstrapping and online estimation phases.
Remark 7 (Random episodes).
We present our methods assuming a fixed number of episodes , each lasting for times steps. However, our methods readily extend to the case where the number of episodes and and their lengths are chosen randomly, with and serving as deterministic upper bounds for these quantities respectively.
5 Experiments
We now present numerical experiments to validate our results above. We use the Movielens 1M data set [16] to construct a semisynthetic bandit instance. To this end, we first pick movies that have a reasonable number of ratings and complete the remaining entries of the rating matrix using the SoftImpute algorithm of [21] using the fancyimpute package [23]. The 5 best movies are then sampled to act as actions. The experts are build randomly as distributions over these movies under each context satisfying Assumption 2. In each episode, a new context distribution according to Assumption 1 is sampled randomly and is fixed for the length of the episode. We set and use
to be the mean of the worst expert across all episodes. Rewards are generated and from a Bernoulli distribution with the appropriate means under the given user context and movie recommended.
We compare our EDUCB algorithm with three algorithms namely, UCB [2], KLUCB [7] and DUCB [25]. The former two algorithms serve as naive baselines for our setting. Each of these treats each expert as an arm of a multiarmed bandit problem and do not exploit the structure in the experts in any way. Each of them incur a regret of the order of . DUCB captures the full information setting, where the expert policies and context distributions are revealed to the agent before interaction begins. We use as the parameters of DUCB. We bootstrap our EDUCB algorithm (Algorithm 1, with ) with samples as instructed by Theorem 4 and perform the experiment over episodes of steps each, averaged over 100 independent runs. The results are summarized in Figure 1.
6 Conclusions
We study the episodic bandit problem with stochastic experts, where context and reward distributions vary with episodes. By leveraging the expert structure to share information, we develop the EDUCB algorithm and provide regret guarantees that do not scale in the length of episodes. We also specify the amount of bootstrapping samples which grows logarithmically in the number and size of episodes. Finally, we empirically compare our performance with the DUCB algorithm in the fullinformation setting and observe comparable performance in spite of imperfect information.
Limitations: The assumptions we make in Section 2.1 are crucial to the theory we develop in this work. Further, as our regret analysis compares to the best expert, it is implicitly assumed that the best expert provides mean rewards that are reasonably high in order to be suitable for use in practice. The problem of designing agents in the case where and are not known beforehand remains open.
Societal Impact: Our main contributions are mainly algorithmic and analytical. As we are motivated by realworld settings, it is important to judge the data that is being used to build as well as run the system in practice. In particular, our methods can be used to recommend articles and advertisements to user populations in an online fashion. In this setting, biases derived from the data in the bootstrapping phase of our methods can prevail in the online phase, leading to insensitive recommendations, thus requiring sufficient testing before deployment.
Acknowledgements
This research was partially supported by NSF Grants 1826320 and 2019844, ARO grant W911NF1710359, US DOD grant H9823018D0007, and the Wireless Networking and Communications Group Industrial Affiliates Program.
References

[1]
(2014)
Taming the monster: a fast and simple algorithm for contextual bandits.
In
International Conference on Machine Learning
, pp. 1638–1646. Cited by: §1.2.  [2] (2002) Finitetime analysis of the multiarmed bandit problem. Machine learning 47 (2), pp. 235–256. Cited by: Appendix E, §1.1, §2, §4.1, §5.
 [3] (2002) The nonstochastic multiarmed bandit problem. SIAM journal on computing 32 (1), pp. 48–77. Cited by: §1.2.
 [4] (2013) Sequential transfer in multiarmed bandit with finite set of models. arXiv preprint arXiv:1307.6887. Cited by: §1.2, §1.2.
 [5] (1998) Theoretical models of learning to learn. In Learning to learn, pp. 71–94. Cited by: §1.2.
 [6] (2012) Regret analysis of stochastic and nonstochastic multiarmed bandit problems. arXiv preprint arXiv:1204.5721. Cited by: §1.2.
 [7] (2013) Kullback–leibler upper confidence bounds for optimal sequential allocation. The Annals of Statistics 41 (3), pp. 1516–1541. Cited by: Appendix E, §1.1, §2, §5.
 [8] (2019) Learning to control renewal processes with bandit feedback. Proceedings of the ACM on Measurement and Analysis of Computing Systems 3 (2), pp. 1–32. Cited by: Appendix B, Appendix B, §3.
 [9] (2020) Metalearning with stochastic linear bandits. In International Conference on Machine Learning, pp. 1360–1370. Cited by: §1.2, §1.2.
 [10] (2013) Counterfactual reasoning and learning systems: the example of computational advertising. Journal of Machine Learning Research 14. Cited by: §1.2.

[11]
(2011)
Contextual bandits with linear payoff functions.
In
Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics
, pp. 208–214. Cited by: §1.2.  [12] (2017) Multitask learning for contextual bandits. arXiv preprint arXiv:1705.08618. Cited by: §1.2.
 [13] (2011) Efficient optimal learning for contextual bandits. arXiv preprint arXiv:1106.2369. Cited by: §1.2.
 [14] (2018) Practical contextual bandits with regression oracles. In International Conference on Machine Learning, pp. 1539–1548. Cited by: §1.2, §4.1.
 [15] (2020) Beyond ucb: optimal and efficient contextual bandits with regression oracles. In International Conference on Machine Learning, pp. 3199–3210. Cited by: §1.2.
 [16] (2015) The movielens datasets: history and context. Acm transactions on interactive intelligent systems (tiis) 5 (4), pp. 1–19. Cited by: Appendix E, §1.1, §5.
 [17] (2021) Metathompson sampling. arXiv preprint arXiv:2102.06129. Cited by: §1.2, §1.2.
 [18] (2007) Epochgreedy algorithm for multiarmed bandits with side information. Advances in Neural Information Processing Systems (NIPS 2007) 20, pp. 1. Cited by: §1.2.
 [19] (2016) Causal bandits: learning good interventions via causal inference. arXiv preprint arXiv:1606.03203. Cited by: §1.2.
 [20] (2014) Weighted importance sampling for offpolicy learning with linear function approximation.. In NIPS, pp. 3014–3022. Cited by: §1.2.
 [21] (2010) Spectral regularization algorithms for learning large incomplete matrices. Journal of machine learning research 11 (Aug), pp. 2287–2322. Cited by: Appendix E, §5.

[22]
(2014)
Machine learning for targeted display advertising: transfer learning in action
. Machine learning 95 (1), pp. 103–127. Cited by: §1.  [23] Fancyimpute 0.5.4. Note: DOI: 10.5281/zenodo.51773; License Apache Software License (http://www.apache.org/licenses/LICENSE2.0.html) Cited by: §5.
 [24] (2017) Identifying best interventions through online importance sampling. In International Conference on Machine Learning, pp. 3057–3066. Cited by: §C.1, §1.2.
 [25] (2018) Contextual bandits with stochastic experts. In International Conference on Artificial Intelligence and Statistics, pp. 852–861. Cited by: §C.5, §C.5, Appendix E, §1.1, §1.2, §2, §2, §3.1, §3, §5, Lemma 16.
 [26] (2020) Bypassing the monster: a faster and simpler optimal algorithm for contextual bandits under realizability. Available at SSRN. Cited by: §1.2.
 [27] (1998) Lifelong learning algorithms. In Learning to learn, pp. 181–209. Cited by: §1.2.
 [28] (2003) Inequalities for the l1 deviation of the empirical distribution. HewlettPackard Labs, Tech. Rep. Cited by: Appendix B, Appendix B, §4.1.
 [29] (2020) Provable benefits of representation learning in linear bandits. arXiv preprint arXiv:2010.06531. Cited by: §1.2.
 [30] (2019) Warmstarting contextual bandits: robustly combining supervised and bandit feedback. arXiv preprint arXiv:1901.00301. Cited by: §1.2, §4.1.
 [31] (2014) Optimal realtime bidding for display advertising. In Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 1077–1086. Cited by: §1.
Appendix A Notations
In this section, we provide exact definitions and descriptions of all the notation used in the main body of this paper.
Notations in Section 2:

: These represent the finite sets that the contexts and actions are picked from respectively.

: The set of experts provided to the agent. Expert is equivalently referred to as expert for any . Each expert is characterized by probability distributions over the set . Specifically, given a context , the expert is characterized by the conditional distribution for each .

: The reward variable. It is assumed that but this can be extended to any bounded interval with appropriate shifting and scaling.

: The number of episodes and the length of each episode respectively.

: The context and reward distributions at each time in episode . Specifically, in episode , for each the context , where is the action recommended by the agent chosen by the agent.

: The mean of expert in episode . Mathematically, where the expectation over the joint distribution .

: The mean of the best expert in episode , .
Notations in Section 2.1:

: The lower bound on the minimum probability with which any context is seen in any episode.

: The lower bound on the minimum probability with which any action is picked by any expert.

: The lower bound on the minimum reward achieved by any expert in any episode.
Mathematically,
Notations in Section 3:

approximate experts: These are estimates of the true experts which satisfy for all and ,

: The Importance Sampling (IS) ratio between experts on the action under context . .

: Upper and lower confidence bounds on the estimated IS ratio . Specifically,

: Estimates of the divergence between the experts and . Here, .

: The normalizing factor in the IS sampling estimator. for each expert where is the expert picked at time .

: The bias of the IS sampling estimator due to clipping. For this, we define . Then,

: The IS estimate of the mean of expert by time .

: The maximum error in the IS estimate due to use of estimated quantities. Specifically,

: The Upper Confidence Bound estimate of the mean of expert that includes the IS estimate, the total bias due to clipping and estimation.
Notation in Section 3.1:

: The true divergence between experts given by

: The upper bound on the true maximum divergence between any two experts , . A trivial upper bound is used, given by . We also set in the definition of for each expert .

: The vector of suboptimality gaps, where . We assume here that without loss of generality, experts are arranged in decreasing order of means: .

: The first time at which all clippers are active for all experts, i.e.,

: The first time at which one can conclude with high probability that the best arm is never underestimated. We define
For any with probability at least ,

: The first time at which one can conclude with high probability that the best expert is never overestimated. We define
For any , with probability at least , .

: The regret suffered by the agent by time , .
Appendix B Some useful concentrations
We start with some useful concentration inequalities. First we consider a modified result from [28]:
Lemma 8.
Let be a probability vector with points of support. Let be an empirical estimate of using i.i.d. draws. Then, for any and , it holds that
Proof: The result follows from that of [28] as for any vector .
The next lemma is a adapted from the Lemma 5.1 in [8]
. It provides confidence bounds for ratios of random variables.
Lemma 9.
Suppose are probability vectors that share the same support with at least mass on each support point. Let be their respective empirical estimates such that . Call, for each , and . With probability at least , it holds that
Proof: The proof follows the result from of Lemma 5.1 in [8]. To ease notation, we fix an arbitrary and denote , for . Under the event that for , we have
Similarly, we have that
Since the choice of was arbitrary and the event holds with probability at least , the result follows.
Appendix C Proofs of results in Section 3.1
To prove the concentration result in Theorem 1, we will first consider a simpler setting of two arms with deterministic samples. We refer the reader to Section A for definitions and descriptions of quantities used here.
c.1 A simpler case of two arms
We begin by considering 2 arms, . For simplicity, we assume for this section (In case all the statements hold with probability at least ). We are given access to samples from arm and seek to estimate the mean of arm using the approximate experts . The arguments in this section closely follow the analysis of the two armed estimator in the full information case in [24]. We start with the following assumption:
Assumption 4.
is such that
We define
Note that by Lemma 9, we have that
For an arbitrarily chosen , we write
(4) 
Then, we have the following claims:
Claim 1.
With as above, for any