Bypassing the Monster: A Faster and Simpler Optimal Algorithm for Contextual Bandits under Realizability

03/28/2020 ∙ by David Simchi-Levi, et al. ∙ MIT 0

We consider the general (stochastic) contextual bandit problem under the realizability assumption, i.e., the expected reward, as a function of contexts and actions, belongs to a general function class F. We design a fast and simple algorithm that achieves the statistically optimal regret with only O(log T) calls to an offline least-squares regression oracle across all T rounds (the number of oracle calls can be further reduced to O(loglog T) if T is known in advance). Our algorithm provides the first universal and optimal reduction from contextual bandits to offline regression, solving an important open problem for the realizable setting of contextual bandits. Our algorithm is also the first provably optimal contextual bandit algorithm with a logarithmic number of oracle calls.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The contextual bandit problem is a fundamental framework for online decision making and interactive machine learning, with diverse applications ranging from hearthcare to electronic commerce, see a NIPS 2013 tutorial (

https://hunch.net/jl/interact.pdf) for the theoretical background, and a recent ICML 2017 tutorial (https://hunch.net/rwil/) for further illustrations on its practical importance.

Broadly speaking, approaches to contextual bandits can be put into two groups (see Foster et al. 2018): realizability-based approaches which rely on weak or strong assumptions on the model representation, and agnostic approaches which are completely model-free. While many different contextual bandit algorithms (realizability-based or agnostic) have been proposed over the past twenty years, most of them suffer from either theoretical or practical issues (see Bietti et al. 2018). Existing realizability-based algorithms built on upper confidence bounds (e.g., Filippi et al. 2010, Abbasi-Yadkori et al. 2011, Chu et al. 2011, Li et al. 2017

) and Thompson sampling (e.g.,

Agrawal and Goyal 2013, Russo et al. 2018) rely on strong assumptions on the model representation and are only tractable for specific parametrized families of models like generalized linear models. Meanwhile, agnostic algorithms that make no assumption on the model representation (e.g., Dudik et al. 2011, Agarwal et al. 2014) may lead to overly conservative exploration in practice (Bietti et al. 2018), and their reliance on an offline cost-sensitive classification oracle

as a subroutine typically causes implementation difficulties as the oracle itself is computationally intractable in general. At this moment, designing a provably optimal contextual bandit algorithm that is applicable for large-scale real-world deployments is still widely deemed a very challenging task (see

Agarwal et al. 2016, Foster and Rakhlin 2020).

Recently, Foster et al. (2018) propose an approach to solve contextual bandits with general model representations (i.e., general function classes) using an offline regression oracle — an oracle that can typically be implemented efficiently and has wide availability for numerous function classes due to its core role in modern machine learning. In particular, the (weighted) least-squares regression oracle assumed in the algorithm of Foster et al. (2018)

is highly practical as it has a strongly convex loss function and is amenable to gradient-based methods. As

Foster et al. (2018) point out, designing offline-regression-oracle-based algorithms is a promising direction for making contextual bandits practical, as they seem to combine the advantages of both realizability-based and agnostic algorithms: they are general and flexible enough to work with any given function class, while using a more realistic and reasonable oracle than the computationally-expensive classification oracle. Indeed, according to multiple experiments and extensive empirical evaluations conducted by Bietti et al. (2018) and Foster et al. (2018), the algorithm of Foster et al. (2018) “works the best overall” compared with almost all existing approaches.

Despite its empirical success, the algorithm of Foster et al. (2018) is, however, theoretically sub-optimal — it could incur regret in the worst case. Whether the optimal regret of contextual bandits can be attained via an offline-regression-oracle-based algorithm is listed as an open problem in Foster et al. (2018). In fact, this problem has been open to the bandit community since 2012 — it dates back to Agarwal et al. (2012), where the authors propose a computationally inefficient contextual bandit algorithm that achieves the optimal regret for a general finite function class , but leave designing computationally tractable algorithms as an open problem.

More recently, Foster and Rakhlin (2020) propose an algorithm that achieves the optimal regret for contextual bandits using an online regression oracle (which is not an offline optimization oracle and has to work with an adaptive adversary). Their finding that contextual bandits can be completely reduced to online regression is novel and important, and their result is also very general: it requires only the minimal realizability assumption, works with possibly nonparametric function classes, and holds true even when the contexts are chosen adversarially. However, the online regression oracle that they appeal to is still much stronger than an offline regression oracle, and to our knowledge computationally efficient algorithms for online regression are only known for specific function classes. Whether the optimal regret of contextual bandits can be attained via a reduction to an offline regression oracle is listed as an open problem again in Foster and Rakhlin (2020).

In this paper, we give an affirmative answer to the above open problem repeatedly mentioned in the literature (Agarwal et al. 2012, Foster et al. 2018, Foster and Rakhlin 2020). Specifically, we provide the first optimal black-box reduction from contextual bandits to offline regression, with only the minimal realizability assumption. The significance of this result is that it reduces contextual bandits, a prominent online decision-making problem, to offline regression, a very basic and common offline optimization task that serves as the building block of modern machine learning. A direct consequence of this result is that any advances in solving offline regression problems immediately translate to contextual bandits, statistically and computationally. Note that such online-to-offline reduction is highly nontrivial (and impossible without specialized structures) for online learning problems in general (Hazan and Koren 2016).

Our reduction is accomplished by a surprisingly fast and simple algorithm that achieves the optimal regret for general finite function class with only calls to an offline least-squares regression oracle over rounds (the number of oracle calls can be further reduced to if is known) — notably, this can be understood as a “triply exponential” speedup over previous work: (1) compared with the previously known regret-optimal algorithm Agarwal et al. (2012) for this setting, which requires enumerating over at each round, our algorithm accesses the function class only through an offline regression oracle, thus typically avoids an exponential cost at each round; (2) compared with the sub-optimal algorithm of Foster et al. (2018) which requires calls for non-convex , and the classification-oracle-based algorithm of Agarwal et al. (2014) which requires calls to a computationally expensive classification oracle, our algorithm requires only calls to a simple regression oracle, which implies an exponential speedup over all existing provably optimal oracle-efficient algorithms, even when we ignore the difference between regression and classification oracles; (3) when the number of rounds is known in advance, our algorithm can further reduce the number of oracle calls to , which is an exponential speedup by itself. Our algorithm is thus highly computational efficient.

The statistical analysis of our algorithm is also quite interesting. Unlike existing analysis of other realizability-based algorithms in the literature, we do not directly analyze the decision outcomes of our algorithm — instead, we find a dual interpretation of our algorithm as sequentially maintaining a dense distribution over all (possibly improper) policies, where a policy is defined as a deterministic decision function mapping contexts to actions. We analyze how the realizability assumption enables us to establish uniform-convergence-type results for some implicit quantities in the universal policy space, regardless of the huge capacity of the universal policy space. Note that while the dual interpretation itself is not easy to compute in the universal policy space, it is only applied for the purpose of analysis and has nothing to do with our original algorithm’s implementation. Through this lens, we find that our algorithm’s dual interpretation satisfies a series of sufficient conditions for optimal contextual bandit learning. Our identified sufficient conditions for optimal contextual bandit learning in the universal policy space are built on the previous work of Dudik et al. (2011), Agarwal et al. (2012) and Agarwal et al. (2014) — the first one is colloquially referred to as the “monster paper” by its authors due to its complexity (link), and the third one is titled as “taming the monster” by its authors due to its improved computational efficiency. Since our algorithm achieve all the conditions required for regret optimality in the universal policy space in a completely implicit way (which means that all the requirements are automatically satisfied without explicit computation), our algorithm comes with significantly reduced computational cost compared with previous work (thanks to the realizability assumption), and we thus title our paper as “bypassing the monster”. Overall, our algorithm is fast, simple, memory-efficient, and has the potential to be implemented on a large scale. We will go over the details in the rest of this article.

1.1 Learning Setting

The stochastic contextual bandit problem can be stated as follows. Let be a finite set of actions and be an arbitrary space of contexts (e.g., a feature space). The interaction between the learner and nature happens over rounds, where is possibly unknown. At each round , nature sample a context

and a (context-dependent) reward vector

i.i.d. according to a fixed but unknown distribution , with component denoting the reward for action ; the learner observes , picks an action , and observes the reward for her action . Depending on whether there is an assumption about nature’s reward model, prior literature studies the contextual bandit problem in two different but closely related settings.

Agnostic setting. Let be a class of policies (i.e., decision functions) that map contexts to actions , and be the optimal policy in that maximizes the expected reward. The learner’s goal is to compete with the (in-class) optimal policy and minimizes her (empirical cumulative) regret after rounds, which is defined as

The above setting is called agnostic in the sense that it imposes no assumption on nature.

Realizable setting. Let be a class of predictors (i.e., reward functions), where each predictor is a function describing a potential reward model . The standard realizability assumption is as follows: [Realizability] There exists a predictor such that

Given a predictor , the associated reward-maximizing policy always picks the action with the highest predicted reward, i.e., . The learner’s goal is to compete with the “ground truth” optimal policy and minimizes her (empirical cumulative) regret after rounds, which is defined as

The above setting is called realizable in the sense that it assumes that nature can be well-specified by a predictor in .

We make some remarks on the above two settings from a pure modeling perspective. First, the agnostic setting does not require realizability and is more general than the realizable setting. Indeed, given any function class , one can construct an induced policy class , thus any realizable contextual bandit problem can be reduced to an agnostic contextual bandit problem. Second, the realizable setting has its own merit, as the additional realizability assumption enables stronger performance guarantee: once the realizability assumption holds, the learner’s competing policy is guaranteed to be the “ground truth” (i.e., no policy can be better than ), thus small regret necessarily means large total reward. By contrast, in the no-realizability agnostic setting, the “optimal policy in ” is not necessarily an effective policy if there are significantly more effective polices outside of . More comparisons between the two settings regarding theoretical tractability, computational efficiency and practical implementability will be provided in §1.2.

1.2 Related Work

Contextual bandits have been extensively studied for nearly twenty years, see Chapter 5 of Lattimore and Szepesvári (2018) and Chapter 8 of Slivkins (2019) for detailed surveys. Here we mention some important and closely related work.

1.2.1 Agnostic Approaches

Papers studying contextual bandits in the agnostic setting aim to design general-purpose and computationally-tractable algorithms that are provably efficient for any given policy class while avoiding the computational complexity of enumerating over (as the size of is usually extremely large). The primary focus of prior literature is on the case of general finite , as this is the starting point for further studies of infinite (parametric or nonparametric) . For this case, the EXP4-family algorithms (Auer et al. 2002, McMahan and Streeter 2009, Beygelzimer et al. 2011) achieve the optimal regret but requires running time at each round, which makes the algorithms intractable for large . In order to circumvent the running time barrier, researchers (e.g., Langford and Zhang 2008, Dudik et al. 2011, Agarwal et al. 2014) restrict their attention to oracle-based algorithms that access the policy space only through an offline optimization oracle — specifically, an offline cost-sensitive classification oracle that solves

(1)

for any given sequence of context and reward vectors . An oracle-efficient algorithm refers to an algorithm whose number of oracle calls is polynomial in over rounds.

The first provably optimal oracle-efficient algorithm is the Randomized UCB algorithm of Dudik et al. (2011), which achieves the optimal regret with calls to the cost-sensitive classification oracle. A breakthrough is achieved by the ILOVETOCONBANDITS algorithm in the celebrated work of Agarwal et al. (2014), where the number of oracle calls is significantly reduced to . The above results are fascinating in theory because they enable a “online-to-offline reduction” from contextual bandits to cost-sensitive classification, which is highly non-trivial for online learning problems in general (Hazan and Koren 2016). However, the practibility of the above algorithms is heavily restricted due to their reliance on the cost-sensitive classification oracle (1), as this task is computationally intractable even for simple policy classes (Klivans and Sherstov 2009

) and typically involves solving NP-hard problems. As a result, the practical implementations of the above classification-oracle-based algorithms typically resort to heuristics (

Agarwal et al. 2014, Foster et al. 2018, Bietti et al. 2018). Moreover, the above algorithms are memory hungry: since they must feed augmented versions of the dataset (rather than the original version of the dataset) into the oracle, they have to repeatedly create auxiliary data and store them in memory. We refer to Foster et al. (2018) and Foster and Rakhlin (2020) for more detailed descriptions on the drawbacks of these approaches in terms of computational efficiency and practical implementability.

1.2.2 Realizibility-based Approaches

In contrast to the agnostic setting where research primarily focuses on designing general-purpose algorithms that work for any given , a majority of research in the realizable setting tends to design specialized algorithms that work well for a particular parametrized family of . Two of the dominant strategies for the realizable setting are upper confidence bounds (e.g., Filippi et al. 2010, Abbasi-Yadkori et al. 2011, Chu et al. 2011, Li et al. 2017, 2019) and Thompson sampling (e.g., Agrawal and Goyal 2013, Russo et al. 2018). While these approaches have received practical success in several scenarios (Li et al. 2010), their theoretical guarantee and computational tractability critically rely on their strong assumptions on , which restrict their usage in other scenarios (Bietti et al. 2018).

To our knowledge, Agarwal et al. (2012) is the first paper studying contextual bandits with a general finite , under the minimal realizability assumption. They propose a eliminaton-based algorithm, namely Regressor Elimination, that achieves the optimal regret. However, their algorithm is computational inefficient, as it enumerates over the whole function class and requires computational cost at each round (note that the size of is typically extremely large). The computational issues of Agarwal et al. (2012) are resolved by Foster et al. (2018), who propose an oracle-efficient contextual bandit algorithm RegCB, which always accesses the function class through a weighted least-squares regression oracle that solves

(2)

for any given input sequence . As Foster et al. (2018) mention, the above oracle can often be solved efficiently and is very common in machine learning practice, far more reasonable than the cost-sensitive classification oracle (1). However, unlike Regressor Elimination, the RegCB algorithm is not minimax optimal — its worst-case regret could be as large as . Whether the optimal regret is attainable for an offline-regression-oracle-based algorithm remains unknown in the literature.

More recently, Foster and Rakhlin (2020) propose an algorithm that achieves the optimal regret for contextual bandits using an online regression oracle. Their algorithm, namely SquareCB, is built on the A/BW algorithm of Abe and Long (1999) (see also the journal version Abe et al. 2003) originally developed for linear contextual bandits — specifically, SquareCB replaces the “Widrow-Hofff predictor” used in the A/BW algorithm by a general online regression predictor, then follows the same probabilistic action selection strategy as the A/BW algorithm. Foster and Rakhlin (2020) show that by using this simple strategy, contextual bandits can be (surprisingly) reduced to online regression in a black-box manner. While the implication that contextual bandits are no harder than online regression is important and insightful, online regression with a general function class itself is a challenging problem. Note that an online regression oracle is not an offline optimization oracle, which means that algorithms for solving this oracle are not direct and have to be designed on a case-by-case basis — while there is a beautiful theory characterizing the minimax regret rate of online regression with general function classes (Rakhlin and Sridharan 2014), to our knowledge computational efficient algorithms are only known for specific function classes. For the case of general finite , the algorithm given by Rakhlin and Sridharan (2014) actually requires computational cost at each round. As a result, in parallel with the excellent work of Foster and Rakhlin (2020), a more thorough “online-to-offline reduction” from contextual bandits to offline regression is highly desirable.

1.2.3 Empirical Evaluation and Summary

Recently, Bietti et al. (2018) and Foster et al. (2018) conduct some extensive empirical evaluations on different approaches to contextual bandits. The experimental results show that offline-regression-oracle-based algorithms like RegCB typically outperforms other algorithms (including classification-oracle-based algorithms like ILOVETOCONBANDITS) across multiple datasets, statistically and computationally. Given the empirical success of RegCB, a huge gap between the theory and practice of contextual bandits is that, however, a provable optimal offline-regression-oracle-based algorithm is still unknown. This is the major motivation of our study, and we hope that our work can contribute to closing this gap.

1.3 Research Question

In this paper, we study the following open question which is repeatedly mentioned in the contextual bandit literature (Agarwal et al. 2012, Foster et al. 2018, Foster and Rakhlin 2020): Is there an offline-regression-oracle-based algorithm that achieves the optimal regret for contextual bandits?

Similar to Dudík et al. (2011), Agarwal et al. (2012, 2014), we mainly focus on the case of general finite , as this is the starting point for further studies of infinite (parametric or nonparametric) . For this case, the gold standard is an algorithm that achieves regret with the total number of oracle calls being polynomial/sublinear in (this is what is asked by Agarwal et al. 2012, Foster et al. 2018). As for the optimization oracle, we assume access to the following (unweighted) least-sqaures regression oracle that solves

(3)

for any input sequence . Without loss of generality, we assume that the oracle (3) always return a same solution for two input sequences that are completely the same111This is just for ease of presentation. If there are some (unknown) internal randomness (inside the oracle) when there are multiple optimal solutions for (3), then we can just incorporate such randomness into the sigma field generated by the history, and all our proofs will still hold.. Note that the above least-squares regression oracle that we assume is even simpler than the weighted one (2) assumed in Foster et al. (2018), as it does not need to consider the weights.

1.4 Main Results

We give an affirmative answer to the above question, by providing the first optimal black-box reduction from contextual bandits to offline regression, with only the minimal realizability assumption. As we mention before, a direct consequence of this result is that (stochastic) contextual bandits become no harder than offline regression: any advances in solving offline regression problems immediately translate to contextual bandits, statistically and computationally.

Moreover (and quite surprisingly), we go far beyond the conventional “polynomial/sublinear oracle calls” criteria of computational efficiency: we propose an algorithm achieving the optimal regret using only calls to the regression oracle (the number of oracle calls can be further reduced to if is known). As we mention before, this can be understood as a “triply exponential” speedup over existing algorithms. Overall, our algorithm is fast, simple, memory-efficient, and has the potential to be implemented on a large scale. We compare our algorithm’s properties with existing (general-purpose) contextual bandit algorithms222While we focus on stochastic contextual bandits in the realizable setting, we would like to point out that Agarwal et al. (2014) and Foster and Rakhlin (2020) have their own merits outside of this setting. The algorithm of Agarwal et al. (2014) works when there is no realizability assumption. The algorithm of Foster and Rakhlin (2020) works when the contexts are chosen adversarially. in Table 1.

Algorithm Statistical optimality Computational complexity
Regressor Elimination optimal
(Agarwal et al. 2012) intractable
ILOVETOCONBANDITS optimal calls to an
(Agarwal et al. 2014) offline classification oracle
RegCB suboptimal calls to an
(Foster et al. 2018) offline regression oracle
SquareCB optimal calls to an
(Foster and Rakhlin 2020) online regression oracle
FALCON optimal or calls to an
(this paper) offline regression oracle
Table 1: Algorithms’ performance with general finite . Advantages are marked in bold.

Our approach is closely related to (and reveals connections between) three lines of research in contextual bandits over twenty years: (1) a celebrated theory of optimal contextual bandit learning in the agnostic setting using a (seemingly unavoidable) classification oracle, represented by Dudik et al. (2011) (the “monster paper”) and Agarwal et al. (2014)

(“taming the monster”); (2) a simple probabilistic selection strategy mapping the predicted rewards of actions to the probabilities of actions, pioneered by

Abe and Long (1999) (see also Abe et al. 2003) and followed up by Foster and Rakhlin (2020); and (3) some technical preliminaries developed in an early work of Agarwal et al. (2012). In particular, we rethink the philosophy behind Dudik et al. (2011) and Agarwal et al. (2014), reform it with our own understanding of the value of realizability, and come up with a new idea of “bypassing” the classification oracle under realizability — our algorithm is essentially a direct consequence of this new idea, see the derivation of our algorithm in §3.6. Interestingly, our derived algorithm turns out to use a different but similar probabilistic selection strategy like Abe and Long (1999) and Foster and Rakhlin (2020) — this is kind of surprising, as the idea behind the derivation of our algorithm is very different from the ideas behind Abe and Long (1999) and Foster and Rakhlin (2020). This suggests that such simple probabilistic selection strategies might be more intriguing and more essential for bandits than people think, and we believe that they worth further attention of the bandit community.

As a final remark, we emphasize that compared with each line of research that we mention above, our approach has new contributions beyond them which seem necessary for our arguments to hold. We will elaborate on such new contributions in §2 and §3.

1.5 Organization and Notations

The rest of the paper is organized as follows. In §2, we introduce our algorithm and state its properties as well as theoretical guarantees. In §3, we present our statistical analysis and explain the idea behind our algorithm. We conclude our paper in §4. All the proofs of our results are deferred to the appendix.

Throughout the paper, we use to hide constant factors, and to hide logarithmic factors. Given , Let denote the marginal distribution over . We use to denote the

-algebra generated by a random variable

, and use to denote the power set of a discrete set . We use to denote the set of all positive integers, and to denote the set of all non-negative real numbers. Without loss of generality, we assume that .

2 The Algorithm

We present our algorithm, “FAst Leasr-squares-regression-oracle CONtextual bandits” (FALCON), in Algorithm 1.

inputepoch schedule , confidence parameter .

1:  for epoch  do
2:     Let (for epoch 1, ).
3:     Compute via the offline regression oracle.
4:     for round  do
5:        Observe context .
6:        Compute for each action . Let . Define
7:        Sample and observe reward .
8:     end for
9:  end for
Algorithm 1 FAst Leasr-squares-regression-oracle CONtextual bandits (FALCON)

Our algorithm runs in a doubling epoch schedule to reduce oracle calls, i.e., it only calls the oracle on certain pre-specified rounds . For , we refer to the rounds from to as epoch . While all the results in our paper generally hold true for any epoch schedule as long as , for simplicity, we assume that for and . As a concrete example, consider , then for any (possibly unknown) , our algorithm runs in epochs.

At the start of each epoch , our algorithm makes two updates. First, it updates a (epoch-varying) learning rate , which aims to strike a balance between exploration and exploitation. Second, it computes a “greedy” predictor from that minimizes the empirical square loss . This predictor can be computed via a single call to the offline regression oracle — notably, is almost the best way that we can imagine for our oracle to be called, with no augmented data generated, no weights maintained, and no additional optimization problem constructed.

The decision rule in epoch is then completely determined by and . For each round in epoch , given a context , the algorithm uses to predict each action’s reward and finds a greedy action that maximizes the predicted reward. Yet the algorithm does not direct select — instead, it randomizes over all actions according to a probabilistic selection strategy that picks each action other than with probability roughly inversely proportional to how much worse it is predicted to be as compared with , as well as roughly inversely proportional to the learning rate . The effects of this strategy are twofold. First, at each round, by assigning the greedy action the highest probability and each non-greedy action a probability roughly inverse to the predicted reward gap, we ensure that the better an action is predicted to be, the more likely it will be selected. Second, across different epochs, by controlling the probabilites of non-greedy actions roughly inverse to the gradually increasing learning rate , we ensure that the algorithm “explores more” in the beginning rounds where the learning rate is small, and gradually “exploits more” in later rounds where the learning rate becomes larger — this is why we view our learning rate a sequential balancer between exploration and exploitation.

As we mention before, the idea of mapping the predicted rewards of actions to the probabilities of actions via an “inverse proportional to the gap” rule is not new: a similar probabilistic selection strategy is firstly proposed by Abe and Long (1999) in their study of linear contextual bandits, and recently adopted by Foster and Rakhlin (2020) in their reduction from contextual bandits to online regression. The strategy that we use here goes beyond the previous strategies used in Abe and Long (1999) and Foster and Rakhlin (2020) by incorporating some subtle yet non-trivial dynamics: while the above two papers adopt a constant learning rate that does not change in the running process of their algorithms (specifically, Abe and Long 1999 set and Foster and Rakhlin 2020 set ), we appeal to an epoch-varying (or time-varying) learning rate that gradually increases as our algorithm proceeds. Seemingly a small component of the algorithm, this “rate changer” is a “game changer” and plays a fundamental role in our statistical analysis — while we cannot directly say this is a must, the choice of a epoch-varying learning rate is necessary at least in our analysis approach, as the proof of our regret guarantee critically relies on an inductive argument which requires the learning rate to change carefully with respect to epochs and gradually increase over time, see §3.4.333A more obvious (but less important) advantage of using a time-varying parameter (rather than a fixed parameter determined by ) is that the algorithm does not need to know in advance any more — since this is already well-known in the literature (e.g., Langford and Zhang 2008), we do not emphasize it here.

Besides the epoch-varying probabilistic selection strategy, the way that our algorithm generates predictions is also interesting and quite different from prior literature. Our algorithm makes predictions in a surprisingly simple and straightforward way: it always picks the greedy predictor and directly applies it on contexts without any modification — that is, in terms of making predictions, the algorithm is fully greedy. This is in sharp contrast to previous elimination-based algorithms (e.g., Dudík et al. 2011, Agarwal et al. 2012) and confidence-bounds-based algorithms (e.g., Abbasi-Yadkori et al. 2011, Chu et al. 2011

) ubiquitous in the bandit literature, which spend a lot of efforts and computation resources maintaining complex confidence intervals, version spaces, or distributions over predictors. Even when one compares our algorithm’s prediction strategy with

Abe and Long (1999) and Foster and Rakhlin (2020) which share some common features on how to select actions after predictions are made, one can find that neither of them trust greedy predictors: Abe and Long (1999) appeals to the “Widrow-Hofff predictor” (an online linear predictor) and their analysis critically relies on the closed-form structure of this predictor; Foster and Rakhlin (2020) appeals to an online regression oracle and their analysis critically relies on the fact that this oracle can efficiently minimize regret against an adaptive adversary (as they mention, “all of the heavy lifting regarding generalization” should be taken care of by the online oracle). Seemingly counter-intuitive, we claim that making “naive” greedy predictions is sufficient for optimal contextual bandit learning. This suggests that a rigorous analysis of our algorithm should contain some new ideas beyond existing bandit literature. Indeed, we will provide a quite interesting analysis of our algorithm in §3, which seem to be conceptually novel.

2.1 Statistical Optimality

Consider an epoch schedule such that for and . For any , with probability at least , the regret of the FALCON algorithm after rounds is at most

The proof is deferred to Appendix A. This upper bound matches the lower bound in Agarwal et al. (2012) up to logarithmic factors. The FALCON algorithm is thus statistically optimal. We will discuss more about the regret analysis of FALCON in §3.

2.2 Computational Efficiency

Consider the epoch schedule , . For any possibly unknown , our algorithm runs in epochs, and in each epoch our algorithm only calls the oracle for once. Therefore, our algorithm’s computational complexity is calls to a least-squares regression oracle plus linear net computational cost across all rounds. This outperforms previous algorithms. Note that ILOVETOCONBANDITS requires calls to an offline cost-sensitive classification oracle, and SquareCB requires calls to an online regression oracle — compared with our algorithm, both of them require exponentially more calls to a harder-to-implement oracle. Also, since a general finite is not a convex function class, RegCB requires calls to a weighted least-squares regression oracle for this setting — this is still exponentially slower than our algorithm.

When the total number of rounds is known to the learner, we can make the computational cost of FALCON even lower. For any , consider an epoch schedule used in Cesa-Bianchi et al. (2014): , . Then FALCON will run in epochs, calling the oracle for only times over rounds. In this case, we still have the same regret guarantee (up to logarithmic factors), see Corollary 2.2 below. The proof is at the end of Appendix A.

For any , consider an epoch schedule , . Then with probability at least , the regret of the FALCON algorithm after rounds is at most

3 Regret Analysis

In this section, we elaborate on how our simple algorithm achieves the optimal regret. We first analyze our algorithm (through an interesting dual interpretation) and provide in §3.1 to §3.4 a proof sketch of Theorem 2.1. Finally, in §3.5, we explain the key idea behind FALCON, and in §3.6, we show how this idea leads to FALCON.

Since some notations appearing in Algorithm 1 are shorthand and do not explicitly reveal the dependencies between different quantities (e.g., and should be written as a function and a conditional distribution explicitly depending on the random context ), we introduce some new notations which can describe the decision generating process of Algorithm 1 in a more systematic way. For each epoch , given the learning rate and the greedy predictor (which are uniquely determined by the data from the first epochs), we can explicitly represent the algorithm’s decision rule using and . In particular, define

Then is a well-defined probability kernel that completely characterizes the algorithm’s decision rule in epoch . Specifically, at each round in epoch , the algorithm first observes a random context , then samples its action according to the conditional distribution . Therefore, we call the action selection kernel of epoch . Note that depends on all the randomness up to round (including round ), which means that depends on , and will affect in later epochs.

3.1 A Tale of Two Processes

The conventional way of analyzing our algorithm’s behavior at round in epoch is to study the following original process:

  1. Nature generates .

  2. Algorithm samples .

The above process is however difficult to analyze, because the algorithm’s sampling procedure depends on the external randomness of nature. That is, the algorithm’s probabilistic selection strategy among actions, as a conditional distribution , depends on the random context , and cannot be evaluated in advance before observing .

A core idea of our analysis is to get rid of thinking about the above process. Instead, we look at the following virtual process at round in epoch :

  1. Algorithm samples , where is a policy, and

    is a probability distribution over all policies in

    .

  2. Nature generates .

  3. Algorithm selects deterministically.

The merit of the above process is that the algorithm’s sampling procedure is independent of the external randomness of nature. While the algorithm still has to select an action based on the random context in step 3, this is completely deterministic and easier to analyze. As a result, the algorithm’s internal randomness all comes from a stationary distribution which is already determined at the beginning of epoch .

The second process is however a virtual process because it is not how our algorithm directly proceeds. An immediate question is whether we can always find a distribution over policies , such that our algorithm behaves exactly the same as the virtual process in epoch ?

Recall that the algorithm’s decision rule in epoch is completely characterized by the action selection kernel . To answer the above question, we have to “translate” any possible probability kernel into an “equivalent” distribution over policies such that we can study our algorithm’s behavior through the virtual process. We complete this translation in §3.2.

3.2 From Kernel to Randomized Policy

We define the universal policy space as

which contains all possible policies. We consider a product probability measure on such that for all ,

Of course, can be an infinite set, and hence, one may wonder whether an infinite product of probability measures really exists. Fortunately, due to the structure of and , the existence of a unique product probability measure is guaranteed by the Kolmogorov extension theorem. We give a proof in Lemma A.3 in Appendix A. The unique that we find in Lemma A.3 satisfies that for every , we have

(4)

That is, for any arbitrary context , the algorithm’s action generated by is probabilistically equivalent to the action generated by through the virtual process in §3.1. Since is a dense distribution over all deterministic policies in the universal policy space, we refer to as the “equivalent randomized policy” induced by . Through Lemma A.3 and equation (4), we establish a one-to-one mapping between any possible probability kernel and an equivalent randomized policy . Since is uniquely determined by and , we know that is also uniquely determined by and .

We emphasize that our algorithm does not compute , but implicitly maintains through and . This is important, as even in the simple case of finite known where is directly a finite product of known probability measures, computing requires computational cost which is intractable for large . Remember that all of our arguments based on are only applied for the purpose of statistical analysis and have nothing to do with the algorithm’s original implementation.

3.3 Dual Interpretation in the Universal Policy Space

Through the lens of the virtual process, we find a dual interpretation of our algorithm: it sequentially maintains a dense distribution over all the policies in the universal policy space , for epoch . The analysis of the behavior of our algorithm thus could hopefully reduce to the analysis of an evolving sequence (which is still non-trivial because it still depends on all the interactive data). All our analysis from now on will be based on the above dual interpretation.

As we start to explore how evolves in the universal policy space, let us first define some implicit

quantities in this world which are useful for our statistical analysis — they are called “implicit” because our algorithm does not really compute or estimate them at all, yet they are all well-defined and implicitly exist as long as our algorithm proceeds.

Define the “implicit reward” of a policy as

and define the “implicit regret”444Note that this is an “instantaneous” quantity in , not a sum over multiple rounds. of a policy as

Given a predictor , define the “predicted implicit reward” of a policy as

and define the “predicted implicit regret” of a policy as555Note that in §1.1 we have defined as the reward-maximizing policy induced by a reward function , i.e., for all . Also note that not all policies in can be written as for some .

The idea of defining the above quantities is motivated by the celebrated work of Agarwal et al. (2014), which studies policy-based optimal contextual bandit learning in the agnostic setting (in which setting the above quantities are not implicit but play obvious roles and are directed estimated by their algorithm). There are some differences in the definitions though. For example, Agarwal et al. (2014) define the above quantities for all policies in a given finite policy class , while we define the above quantities for all policies in the universal policy space (which is strictly larger than ). Also, Agarwal et al. (2014) define and based on the inverse propensity scoring estimates, while we define them based on a single predictor. We will revisit these differences later.

After defining the above quantities, we make a simple yet powerful observation, which is an immediate consequence of (4): for any epoch and any round in epoch , we have

see Lemma A.3 in Appendix A. This means that (under any possible realization of ) the expected instantaneous regret incurred by our algorithm equals to the “implicit regret” of the randomized policy (as a weighted sum over the implicit regret of every deterministic policy ). Since is a fixed deterministic quantity for each , the above equation indicates that to analyze our algorithm’s expected regret in epoch , we only need to analyze the how the distribution looks like. This property shows the advantage of our dual interpretation: compared with the original process in §3.1 where it is hard to evaluate our algorithm without , now we can evaluate our algorithm’s behavior regardless of .

3.4 Optimal Contextual Bandit Learning in the Universal Policy Space

Once we realize that in order to understand the behavior of our algorithm we only need understand the properties of , the analysis of our algorithm is not difficult anymore. We first state an immediate observation based on the equivalence relationship between and in equation (4).

Observation 1

For any deterministic policy , the quantity is the expected inverse probability that the decision generated by the randomized policy is the same as the decision generated by the deterministic policy , over the randomization of context . This quantity can be intuitively understood as a measure of the “decisional divergence” between the randomized policy and the deterministic policy .

Now let us utilize the closed-form structure of in our algorithm and point out a most important property of stated below (see Lemma A.3 and Lemma A.3 in Appendix A for details).

Observation 2

For any epoch and any round in epoch , for any possible realization of and , is a feasible solution to the following “Implicit Optimization Problem” (IOP):

(5)
(6)

We give some interpretations for the “Implicit Optimization Problem” (IOP) defined above. (5) says that controls its predicted implicit regret (as a weighted sum over the predicted implicit regret of every policy , based on the predictor ) within . This can be understood as an “exploitation constraint” because it require to put more mass on “good policies” with low predicted implicit regret (as judged by the current predictor ). (6) says that the decisional divergence between and any policy is controlled by the predicted implicit regret of policy (times a learning rate and plus a constant ). This can be understood as an “adaptive exploration constraint”, as it requires that behaves similarly to every policy at some level (which means that there should be sufficient exploration), while allowing to be more similar to “good policies” with low predicted implicit regret and less similar to “bad policies” with high predicted implicit regret (which means that the exploration can be conducted adaptively based on the judgement of the predictor ). Combining (5) and (6), we conclude that elegantly strikes a balance between exploration and exploitation — it is surprising that this is done completely implicitly, as the original algorithm does not explicitly consider these constraints at all.

There are still a few important tasks to complete. The first task is to figure out what exactly the decisional divergence means. We give an answer in Lemma A.4, which shows that with high probability, for any epoch and any round in epoch , for all ,

That is, the prediction error of the implicit reward of every policy can be bounded by the (maximum) decisional divergence between and all previously used randomized policies . This is consistent with our intuition, as the more similar a policy is to the previously used randomized policies, the more likely that this policy is implicitly explored in the past, and thus the more accurate our prediction on this policy should be. We emphasize that the above inequality critically relies on our specification of the learning rate : we can bound the prediction error using because is proportional to and proportional to — the first quantity is related to the length of the history, and the second quantity is related to the generalization ability of function class . This is the first place that our proof requires an epoch-varying learning rate.

The second task is to further bound (the order of) the prediction error of the implicit regret of every policy , as the implicit regret is an important quantity that can be directly used to bound our algorithm’s expected regret (see §3.3). We do this in Lemma A.5, where we show that with high probability, for any epoch and any round in epoch , for all ,

through an inductive argument. While this is a uniform-convergence-type result, we would like to clarify that this does not mean that there is a uniform convergence of for all , which is too strong and unlikely to be true. Instead, we use a smart design of and (the design is motivated by Lemma 13 in Agarwal et al. 2014), which enables us to characterize the fact that the predicted implicit regret of “good policies” are becoming more and more accurate, while the predicted implicit regret of “bad policies” do not need to be accurate (as their orders directly dominate ). We emphasize that the above result critically relies on the fact that our learning rate is gradually increasing from to , as we use an inductive argument and in order to let the hypothesis hold for initial cases we have to let be very small for small . This is the second place that our proof requires a epoch-varying learning rate.

We have elaborated on how our algorithm implicitly strikes a balance between exploration and exploitation, and how our algorithm implicitly enables some nice uniform-convergence-type results to happen in the universal policy space. This is already enough to guarantee that the dual interpretation of our algorithm achieves optimal contextual bandit learning in the universal policy space. The rest of the proof is standard and can be found in Appendix A.

3.5 Key Idea: Bypassing the Monster

For readers who are familiar with the research line of optimal contextual bandits learning in the agnostic setting using an offline cost-sensitive classification oracle (represented by Dudik et al. 2011, Agarwal et al. 2014), they may find a surprising connection between the IOP (5) (6) that we introduce in Observation 2 and the so-called “Optimization Problem” (OP) in Dudik et al. (2011) and Agarwal et al. (2014) — in particular, if one takes a look at the OP defined in page 5 of Agarwal et al. (2014), she will find that it is almost the same as our IOP (5) (6), except for two fundamental differences:

  1. The OP of Dudik et al. (2011) and Agarwal et al. (2014) is defined on a given finite policy class , which may have an arbitrary shape. As a result, to get a solution to OP, the algorithm must explicitly solve a complicated (non-convex) optimization problem over a possibly complicated policy class — this requires considerable number of calls to a cost-sensitive classification oracle, and is the major computational burden of Dudik et al. (2011) and Agarwal et al. (2014). Although Agarwal et al. (2014) “tame the monster” and reduce the computational cost by only strategically maintaining a sparse distribution over policies in , solving OP still requires calls to the classification oracle and is computationally expensive — the monster is still there.

    By contrast, our IOP is defined on the universal policy space , which is a nice product topological space. The IOP can thus be viewed as a very “slack” relaxation of OP which is extremely easy to solve. In particular, as §3 suggests, the solution to IOP can have a completely decomposed form which enables our algorithm to solve it in a complete implicit way. This means that our algorithm can implicitly and confidently maintain a dense distribution over all policies in , while solving IOP in closed forms with no computational cost — there is no monster any more as we simply bypass it.

  2. In Dudik et al. (2011) and Agarwal et al. (2014), the quantities and are explicitly calculated based on the model-free inverse propensity scoring estimates. As a result, their regret guarantees do not require the realizability assumption.

    By contrast, in our paper, the quantities and are implicitly calculated based on a single greedy predictor — we can do this because we have the realizability assumption . As a result, we make a single call to a least-squares regression oracle here, and this is the main computational cost of our algorithm.

A possible question could then be that, given the fact that the main computational burden of Dudik et al. (2011) and Agarwal et al. (2014) is solving OP, why can’t they simply relax OP as what we do in our IOP? The answer is that without the realizability assumption, they have to rely on the capacity control of their policy space, i.e., the boundedness of , to obtain their statistical guarantees. Indeed, as their regret bound suggests, if one let , then the regret could be as large as . Specifically, their analysis requires the boundedness (or more generally the limited complexity) of in two places: first, a generalization guarantee of the inverse propensity scoring requires limited ; second, since they have to explicitly compute and without knowing the true context distribution , they try to approximate it based on the historical data, which also requires limited to enable statistical guarantees.

Our algorithm bypasses the above two requirements simultaneously: first, since we use model-based regression rather than model-free inverse propensity scoring to make predictions, we do not care about the complexity of our policy space in terms of prediction (i.e., the generalization guarantee of our algorithm comes from the boundedness of not ); second, since our algorithm does not require explicit computation of and , we do not care about what looks like. Essentially, all of these nice properties originate from the realizability assumption. This is how we understand the value of realizability: it does not only (statistically) give us better predictions, but also (computationally) enables us to remove the restrictions in the policy space , which helps us to bypass the monster.

3.6 The Born of Falcon

Seemingly intriguing and tricky, FALCON is actually an algorithm that can be derived from systematical analysis. The idea of “bypassing the monster”, as explained in §3.5, is exactly what leads to the derivation of the FALCON algorithm. Before we close this section, we introduce how FALCON is derived.

  1. We do a thought experiment, considering how ILOVETOCONBANDITS (Agarwal et al. 2014) can solve our problem without the realizability assumption, given an induced policy class .

  2. ILOVETOCONBANDITS uses an inverse propensity scoring approach to compute the predicted reward and predicted regret of policies. This can be equivalently viewed as first computing an “inverse propensity predictor” , then using our definition in §3.3 to compute and if is known.

  3. The computational burden in the above thought experiment is to solve OP over , which requires repeated calls to a cost-sensitive classification oracle.

  4. When we have realizability and use a regression oracle to select predictors, we do not need to use the model-free inverse propensity scoring predictor, so we do not need our policy space to be bounded to ensure generalization ability.

  5. An early technical result of Lemma 4.3 in Agarwal et al. (2012) is very interesting. It shows that when one tries to solve contextual bandits using regression approaches, one should try to bound a quantity like “the expected inverse probability of choosing the same action” — note that a very similar quantity also appears in OP in Agarwal et al. (2014). This suggests that an offline-regression-oracle-based algorithm should try to satisfy some requirements similar to OP. (Lemma 4.3 in Agarwal et al. (2012) also motivates our Lemma A.4. But our Lemma A.4 goes beyond Lemma 4.3 in Agarwal et al. (2012) by unbinding the relationship between a predictor and a policy.)

  6. Motivated by 3, 4, and 5, we relax the domain of OP from to , and we find a closed-form solution of it (in fact in the few-arm few-context case the problem has a clear geometric interpretation), which is , probabiliticaly equivalent to FALCON’s decision generating process in epoch .

4 Conclusion

In this paper, we propose the first provably optimal offline-regression-oracle-based algorithm for contextual bandits, solving an important open problem for realizable contextual bandits. Our algorithm is surprisingly fast and simple, and our analysis is clean as well. We hope that our findings can motivate future research on contextual bandits.

We have also studied the extension of infinite function classes (based on similar ideas), and we will add our obtained results into this paper soon.

Our next step is to conduct computational experiments to validate the efficiency of our algorithm, and compare our algorithm’s performance with other existing algorithms.

References

  • Y. Abbasi-Yadkori, D. Pál, and C. Szepesvári (2011) Improved algorithms for linear stochastic bandits. In Advances in Neural Information Processing Systems, pp. 2312–2320. Cited by: §1.2.2, §1, §2.
  • N. Abe, A. Biermann, and P. Long (2003) Reinforcement learning with immediate rewards and linear hypotheses. Algorithmica 37 (4), pp. 263–293. Cited by: §1.2.2, §1.4.
  • N. Abe and P. Long (1999) Associative reinforcement learning using linear probabilistic concepts. In International Conference on Machine Learning, Cited by: §1.2.2, §1.4, §2, §2.
  • A. Agarwal, S. Bird, M. Cozowicz, L. Hoang, J. Langford, S. Lee, J. Li, D. Melamed, G. Oshri, O. Ribas, et al. (2016) Making contextual decisions with low technical debt. arXiv preprint arXiv:1606.03966. Cited by: §1.
  • A. Agarwal, M. Dudík, S. Kale, J. Langford, and R. Schapire (2012) Contextual bandit learning with predictable rewards. In

    International Conference on Artificial Intelligence and Statistics

    ,
    pp. 19–26. Cited by: §A.2, §A.2, §A.2, §1.2.2, §1.3, §1.3, §1.4, Table 1, §1, §1, §1, §1, §2.1, §2, item 5.
  • A. Agarwal, D. Hsu, S. Kale, J. Langford, L. Li, and R. Schapire (2014) Taming the monster: a fast and simple algorithm for contextual bandits. In International Conference on Machine Learning, pp. 1638–1646. Cited by: §1.2.1, §1.2.1, §1.3, §1.4, Table 1, §1, §1, §1, item 1, item 2, item 1, item 5, §3.3, §3.4, §3.5, §3.5, footnote 2.
  • S. Agrawal and N. Goyal (2013) Thompson sampling for contextual bandits with linear payoffs. In International Conference on Machine Learning, pp. 127–135. Cited by: §1.2.2, §1.
  • P. Auer, N. Cesa-Bianchi, Y. Freund, and R. E. Schapire (2002) The nonstochastic multiarmed bandit problem. SIAM journal on computing 32 (1), pp. 48–77. Cited by: §1.2.1.
  • A. Beygelzimer, J. Langford, L. Li, L. Reyzin, and R. Schapire (2011)

    Contextual bandit algorithms with supervised learning guarantees

    .
    In International Conference on Artificial Intelligence and Statistics, pp. 19–26. Cited by: §1.2.1.
  • A. Bietti, A. Agarwal, and J. Langford (2018) A contextual bandit bake-off. arXiv preprint arXiv:1802.04064. Cited by: §1.2.1, §1.2.2, §1.2.3, §1, §1.
  • N. Cesa-Bianchi, C. Gentile, and Y. Mansour (2014) Regret minimization for reserve prices in second-price auctions. IEEE Transactions on Information Theory 61 (1), pp. 549–564. Cited by: §2.2.
  • W. Chu, L. Li, L. Reyzin, and R. Schapire (2011) Contextual bandits with linear payoff functions. In International Conference on Artificial Intelligence and Statistics, pp. 208–214. Cited by: §1.2.2, §1, §2.
  • M. Dudik, D. Hsu, S. Kale, N. Karampatziakis, J. Langford, L. Reyzin, and T. Zhang (2011) Efficient optimal learning for contextual bandits. In Conference on Uncertainty in Artificial Intelligence, pp. 169–178. Cited by: §1.2.1, §1.2.1, §1.4, §1, §1, item 1, item 2, §3.5, §3.5.
  • M. Dudík, J. Langford, and L. Li (2011) Doubly robust policy evaluation and learning. arXiv preprint arXiv:1103.4601. Cited by: §1.3, §2.
  • S. Filippi, O. Cappe, A. Garivier, and C. Szepesvári (2010) Parametric bandits: the generalized linear case. In Advances in Neural Information Processing Systems, pp. 586–594. Cited by: §1.2.2, §1.
  • D. Foster, A. Agarwal, M. Dudik, H. Luo, and R. Schapire (2018) Practical contextual bandits with regression oracles. In International Conference on Machine Learning, pp. 1539–1548. Cited by: §1.2.1, §1.2.2, §1.2.3, §1.3, §1.3, Table 1, §1, §1, §1, §1, §1.
  • D. Foster and A. Rakhlin (2020) Beyond UCB: optimal and efficient contextual bandits with regression oracles. arXiv preprint arXiv:2002.04926. Cited by: §1.2.1, §1.2.2, §1.3, §1.4, Table 1, §1, §1, §1, §2, §2, footnote 2.
  • E. Hazan and T. Koren (2016) The computational power of optimization in online learning. In

    Annual ACM Symposium on Theory of Computing (STOC)

    ,
    pp. 128–141. Cited by: §1.2.1, §1.
  • A. R. Klivans and A. A. Sherstov (2009) Cryptographic hardness for learning intersections of halfspaces. Journal of Computer and System Sciences 75 (1), pp. 2–12. Cited by: §1.2.1.
  • J. Langford and T. Zhang (2008) The epoch-greedy algorithm for multi-armed bandits with side information. In Advances in Neural Information Processing Systems, pp. 817–824. Cited by: §1.2.1, footnote 3.
  • T. Lattimore and C. Szepesvári (2018) Bandit algorithms. preprint. External Links: Link Cited by: §1.2.
  • L. Li, W. Chu, J. Langford, and R. E. Schapire (2010) A contextual-bandit approach to personalized news article recommendation. In Proceedings of the 19th international conference on World wide web, pp. 661–670. Cited by: §1.2.2.
  • L. Li, Y. Lu, and D. Zhou (2017) Provably optimal algorithms for generalized linear contextual bandits. In International Conference on Machine Learning, pp. 2071–2080. Cited by: §1.2.2, §1.
  • Y. Li, Y. Wang, and Y. Zhou (2019) Nearly minimax-optimal regret for linearly parameterized bandits. Conference on Learning Theory. Cited by: §1.2.2.
  • H. B. McMahan and M. Streeter (2009) Tighter bounds for multi-armed bandits with expert advice. In Conference on Learning Theory, Cited by: §1.2.1.
  • A. Rakhlin and K. Sridharan (2014) Online non-parametric regression. In Conference on Learning Theory, pp. 1232–1264. Cited by: §1.2.2.
  • D. J. Russo, B. Van Roy, A. Kazerouni, I. Osband, Z. Wen, et al. (2018) A tutorial on thompson sampling. Foundations and Trends in Machine Learning 11 (1), pp. 1–96. Cited by: §1.2.2, §1.
  • A. Slivkins (2019) Introduction to multi-armed bandits. Foundations and Trends in Machine Learning 12 (1-2), pp. 1–286. Cited by: §1.2.
  • T. Tao (2013) An introduction to measure theory. Graduate studies in mathematics, American Mathematical Society. Cited by: §A.3.

Appendix A Proof of Theorem 2.1

a.1 Definitions

For notational convenience, we make some definitions.

a.2 Basic Lemmas

We start from some basic “generic” lemmas that hold true for any algorithms, see Lemma A.2 and Lemma A.2. Note that these lemmas do not rely on any specific property of an algorithm — in particular, while Lemma A.2 involve some definitions like and , these quantities are well-defined for any algorithm, regardless of whether the algorithm uses them to make decisions.

[Lemma 4.2 in Agarwal et al. 2012] Fix a function . Suppose we sample from the data distribution , and an action from an arbitrary distribution such that and are conditionally independent given . Define the random variable

Then we have

[Adapted from Lemma 4.1 in Agarwal et al. 2012]For all , with probability at least , we have:

Therefore (by a union bound), the following event holds with probability at least :

The proofs of Lemma A.2 and Lemma A.2 can be found in Agarwal et al. (2012) and are omitted here. For notational simplicity, in the definition of , we may further relax to .

a.3 Per-Epoch Properties of the Algorithm

We start to utilize the specific properties of our algorithm to prove our regret bound. We start from some per-epoch properties that always hold for our algorithm regardless of its performance in other epochs.

As we mentioned in the main article, a starting point of our proof is to translate the action selection kernel into an “equivalent” distribution over policies . Lemma A.3 provides a justification of such translation by showing the existence of a probabilitically-equivalent for every .

Fix any epoch . The action selection scheme is a valid probability kernel . There exists a probability measure such that

Proof.

Proof of Lemma A.3. For each , since is discrete and finite, is a probability space satisfying the requirements of Theorem 2.4.4 in Tao (2013) (the theorem is essentially a corollary of the Kolmogorov extension theorem). By Theorem 2.4.4 in Tao (2013), there exists a unique probability measure on