# Efficient Optimal Learning for Contextual Bandits

We address the problem of learning in an online setting where the learner repeatedly observes features, selects among a set of actions, and receives reward for the action taken. We provide the first efficient algorithm with an optimal regret. Our algorithm uses a cost sensitive classification learner as an oracle and has a running time polylog(N), where N is the number of classification rules among which the oracle might choose. This is exponentially faster than all previous algorithms that achieve optimal regret in this setting. Our formulation also enables us to create an algorithm with regret that is additive rather than multiplicative in feedback delay as in all previous work.

• 25 publications
• 69 publications
• 23 publications
• 12 publications
• 53 publications
• 10 publications
• 99 publications
02/04/2014

### Taming the Monster: A Fast and Simple Algorithm for Contextual Bandits

We present a new algorithm for the contextual bandit learning problem, w...
02/20/2015

### Contextual Semibandits via Supervised Learning Oracles

We study an online decision making problem where on each round a learner...
06/09/2021

### Contextual Recommendations and Low-Regret Cutting-Plane Algorithms

We consider the following variant of contextual linear bandits motivated...
02/09/2022

### Smoothed Online Learning is as Easy as Statistical Learning

Much of modern learning theory has been split between two regimes: the c...
08/21/2013

### Distributed Online Learning via Cooperative Contextual Bandits

In this paper we propose a novel framework for decentralized, online lea...
10/17/2016

### Risk-Aware Algorithms for Adversarial Contextual Bandits

In this work we consider adversarial contextual bandits with risk constr...
02/06/2016

### BISTRO: An Efficient Relaxation-Based Method for Contextual Bandits

We present efficient algorithms for the problem of contextual bandits wi...

## 1 Introduction

The contextual bandit setting consists of the following loop repeated indefinitely:

1. The world presents context information as features .

2. The learning algorithm chooses an action from possible actions.

3. The world presents a reward for the action.

The key difference between the contextual bandit setting and standard supervised learning is that

only

the reward of the chosen action is revealed. For example, after always choosing the same action several times in a row, the feedback given provides almost no basis to prefer the chosen action over another action. In essence, the contextual bandit setting captures the difficulty of exploration while avoiding the difficulty of credit assignment as in more general reinforcement learning settings.

The contextual bandit setting is a half-way point between standard supervised learning and full-scale reinforcement learning where it appears possible to construct algorithms with convergence rate guarantees similar to supervised learning. Many natural settings satisfy this half-way point, motivating the investigation of contextual bandit learning. For example, the problem of choosing interesting news articles or ads for users by internet companies can be naturally modeled as a contextual bandit setting. In the medical domain where discrete treatments are tested before approval, the process of deciding which patients are eligible for a treatment takes contexts into account. More generally, we can imagine that in a future with personalized medicine, new treatments are essentially equivalent to new actions in a contextual bandit setting.

In the i.i.d. setting, the world draws a pair

consisting of a context and a reward vector from some unknown distribution

, revealing in Step 1, but only the reward of the chosen action in Step 3. Given a set of policies , the goal is to create an algorithm for Step 2 which competes with the set of policies. We measure our success by comparing the algorithm’s cumulative reward to the expected cumulative reward of the best policy in the set. The difference of the two is called regret.

All existing algorithms for this setting either achieve a suboptimal regret (Langford and Zhang, 2007) or require computation linear in the number of policies (Auer et al., 2002b; Beygelzimer et al., 2011)

. In unstructured policy spaces, this computational complexity is the best one can hope for. On the other hand, in the case where the rewards of all actions are revealed, the problem is equivalent to cost-sensitive classification, and we know of algorithms to efficiently search the space of policies (classification rules) such as cost-sensitive logistic regression and support vector machines. In these cases, the space of classification rules is exponential in the number of features, but these problems can be efficiently solved using convex optimization.

Our goal here is to efficiently solve the contextual bandit problems for similarly large policy spaces. We do this by reducing the contextual bandit problem to cost-sensitive classification. Given a supervised cost-sensitive learning algorithm as an oracle (Beygelzimer et al., 2009), our algorithm runs in time only while achieving regret , where is the number of possible policies (classification rules), is the number of actions (classes), and is the number of time steps. This efficiency is achieved in a modular way, so any future improvement in cost-sensitive learning immediately applies here.

### 1.1 Previous Work and Motivation

All previous regret-optimal approaches are measure based—they work by updating a measure over policies, an operation which is linear in the number of policies. In contrast, regret guarantees scale only logarithmically in the number of policies. If not for the computational bottleneck, these regret guarantees imply that we could dramatically increase performance in contextual bandit settings using more expressive policies. We overcome the computational bottleneck using an algorithm which works by creating cost-sensitive classification instances and calling an oracle to choose optimal policies. Actions are chosen based on the policies returned by the oracle rather than according to a measure over all policies. This is reminiscent of AdaBoost (Freund and Schapire, 1997)

, which creates weighted binary classification instances and calls a “weak learner” oracle to obtain classification rules. These classification rules are then combined into a final classifier with boosted accuracy. Similarly as AdaBoost converts a weak learner into a strong learner, our approach converts a cost-sensitive classification learner into an algorithm that solves the contextual bandit problem.

In a more difficult version of contextual bandits, an adversary chooses given knowledge of the learning algorithm (but not any random numbers). All known regret-optimal solutions in the adversarial setting are variants of the EXP4 algorithm (Auer et al., 2002b). EXP4 achieves the same regret rate as our algorithm: , where is the number of time steps, is the number of actions available in each time step, and is the number of policies.

Why not use EXP4 in the i.i.d. setting? For example, it is known that the algorithm can be modified to succeed with high probability

(Beygelzimer et al., 2011), and also for VC classes when the adversary is constrained to i.i.d. sampling. There are two central benefits that we hope to realize by directly assuming i.i.d. contexts and reward vectors.

1. Computational Tractability. Even when the reward vector is fully known, adversarial regrets scale as while computation scales as in general. One attempt to get around this is the follow-the-perturbed-leader algorithm (Kalai and Vempala, 2005) which provides a computationally tractable solution in certain special-case structures. This algorithm has no mechanism for efficient application to arbitrary policy spaces, even given an efficient cost-sensitive classification oracle. An efficient cost-sensitive classification oracle has been shown effective in transductive settings (Kakade and Kalai, 2005). Aside from the drawback of requiring a transductive setting, the regret achieved there is substantially worse than for EXP4.

2. Improved Rates. When the world is not completely adversarial, it is possible to achieve substantially lower regrets than are possible with algorithms optimized for the adversarial setting. For example, in supervised learning, it is possible to obtain regrets scaling as with a problem dependent constant (Bartlett et al., 2007). When the feedback is delayed by rounds, lower bounds imply that the regret in the adversarial setting increases by a multiplicative while in the i.i.d. setting, it is possible to achieve an additive regret of  (Langford et al., 2009).

In a direct i.i.d. setting, the previous-best approach using a cost-sensitive classification oracle was given by

-greedy and epoch greedy algorithms

(Langford and Zhang, 2007) which have a regret scaling as in the worst case.

There have also been many special-case analyses. For example, theory of context-free setting is well understood (Lai and Robbins, 1985; Auer et al., 2002a; Even-Dar et al., 2006). Similarly, good algorithms exist when rewards are linear functions of features (Auer, 2002) or actions lie in a continuous space with the reward function sampled according to a Gaussian process (Srinivas et al., 2010).

### 1.2 What We Prove

In Section 3 we state the PolicyElimination algorithm, and prove the following regret bound for it.

###### Theorem 4.

For all distributions over with actions, for all sets of policies , with probability at least , the regret of PolicyElimination (Algorithm 1) over rounds is at most

 16√2TKln4T2Nδ.

This result can be extended to deal with VC classes, as well as other special cases. It forms the simplest method we have of exhibiting the new analysis.

The new key element of this algorithm is identification of a distribution over actions which simultaneously achieves small expected regret and allows estimating value of every policy with small variance. The existence of such a distribution is shown

nonconstructively by a minimax argument.

PolicyElimination is computationally intractable and also requires exact knowledge of the context distribution (but not the reward distribution!). We show how to address these issues in Section 4 using an algorithm we call RandomizedUCB. Namely, we prove the following theorem.

###### Theorem 5.

For all distributions over with actions, for all sets of policies , with probability at least , the regret of RandomizedUCB (Algorithm 2) over rounds is at most

 O(√TKlog(TN/δ)+Klog(NK/δ)).

RandomizedUCB’s analysis is substantially more complex, with a key subroutine being an application of the ellipsoid algorithm with a cost-sensitive classification oracle (described in Section 5). RandomizedUCB does not assume knowledge of the context distribution, and instead works with the history of contexts it has observed. Modifying the proof for this empirical distribution requires a covering argument over the distributions over policies which uses the probabilistic method. The net result is an algorithm with a similar top-level analysis as PolicyElimination, but with the running time only poly-logarithmic in the number of policies given a cost-sensitive classification oracle.

###### Theorem 11.

In each time step , RandomizedUCB makes at most calls to cost-sensitive classification oracle, and requires additional processing time.

Apart from a tractable algorithm, our analysis can be used to derive tighter regrets than would be possible in adversarial setting. For example, in Section 6, we consider a common setting where reward feedback is delayed by rounds. A straightforward modification of PolicyElimination yields a regret with an additive term proportional to compared with the delay-free setting. Namely, we prove the following.

###### Theorem 12.

For all distributions over with actions, for all sets of policies , and all delay intervals , with probability at least , the regret of DelayedPE (Algorithm 3) is at most

 16√2Kln4T2Nδ(τ+√T).

We start next with precise settings and definitions.

## 2 Setting and Definitions

### 2.1 The Setting

Let be the set of actions, let be the domain of contexts , and let

be an arbitrary joint distribution on

. We denote the marginal distribution of over by .

We denote to be a finite set of policies , where each policy , given a context in round , chooses the action . The cardinality of is denoted by . Let be the vector of rewards, where is the reward of action on round .

In the i.i.d. setting, on each round , the world chooses i.i.d. according to and reveals to the learner. The learner, having access to , chooses action . Then the world reveals reward (which we call for short) to the learner, and the interaction proceeds to the next round.

We consider two modes of accessing the set of policies . The first option is through the enumeration of all policies. This is impractical in general, but suffices for the illustrative purpose of our first algorithm. The second option is an oracle access, through an argmax oracle, corresponding to a cost-sensitive learner:

###### Definition 1.

For a set of policies , an argmax oracle ( for short), is an algorithm, which for any sequence , , , computes

 argmaxπ∈Π∑t′=1…trt′(π(xt′)).

The reason why the above can be viewed as a cost-sensitive classification oracle is that vectors of rewards can be interpreted as negative costs and hence the policy returned by is the optimal cost-sensitive classifier on the given data.

### 2.2 Expected and Empirical Rewards

Let the expected instantaneous reward of a policy be denoted by

 ηD(π)≐E(x,→r)∼D[r(π(x))].

The best policy is that which maximizes . More formally,

 πmax≐argmaxπ∈ΠηD(π).

We define to be the history at time that the learner has seen. Specifically

 ht=⋃t′=1…t(xt′,at′,rt′,pt′),

where is the probability of the algorithm choosing action at time . Note that and are produced by the learner while are produced by nature. We write to denote choosing uniformly at random from the ’s in history .

Using the history of past actions and probabilities with which they were taken, we can form an unbiased estimate of the policy value for any

:

 ηt(π)≐1t∑(x,a,r,p)∈htrI(π(x)=a)p.

The unbiasedness follows, because . The empirically best policy at time is denoted

 πt≐argmaxπ∈Πηt(π).

### 2.3 Regret

The goal of this work is to obtain a learner that has small regret relative to the expected performance of over rounds, which is

 ∑t=1…T(ηD(πmax)−rt). (2.1)

We say that the regret of the learner over rounds is bounded by with probability at least , if

 Pr[∑t=1…T(ηD(πmax)−rt)≤ϵ]≥1−δ

where the probability is taken with respect to the random pairs for , as well as any internal randomness used by the learner.

We can also define notions of regret and empirical regret for policies . For all , let

 ΔD(π) =ηD(πmax)−ηD(π), Δt(π) =ηt(πt)−ηt(π).

Our algorithms work by choosing distributions over policies, which in turn then induce distributions over actions. For any distribution over policies , let denote the induced conditional distribution over actions given the context :

 WP(x,a)≐∑π∈Π:π(x)=aP(π). (2.2)

In general, we shall use , and

as conditional probability distributions over the actions

given contexts , i.e., such that is a probability distribution over (and similarly for and ). We shall think of as a smoothed version of with a minimum action probability of (to be defined by the algorithm), such that

 W′(x,a)=(1−Kμ)W(x,a)+μ.

Conditional distributions such as (and , , etc.) correspond to randomized policies. We define notions true and empirical value and regret for them as follows:

 ηD(W) ≐E(x,→r)∼D[→r⋅W(x)] ηt(W) ≐1t∑(x,a,r,p)∈htrW(x,a)p ΔD(W) ≐ηD(πmax)−ηD(W) Δt(W) ≐ηt(πt)−ηt(W).

## 3 Policy Elimination

The basic ideas behind our approach are demonstrated in our first algorithm: PolicyElimination (Algorithm 1).

The key step is Step 1, which finds a distribution over policies which induces low variance in the estimate of the value of all policies. Below we use minimax theorem to show that such a distribution always exists. How to find this distribution is not specified here, but in Section 5 we develop a method based on the ellipsoid algorithm. Step 2 then projects this distribution onto a distribution over actions and applies smoothing. Finally, Step 5 eliminates the policies that have been determined to be suboptimal (with high probability).

### Algorithm Analysis

We analyze PolicyElimination in several steps. First, we prove the existence of in Step 1, provided that is non-empty. We recast the feasibility problem in Step 1 as a game between two players: Prover, who is trying to produce , and Falsifier, who is trying to find violating the constraints. We give more power to Falsifier and allow him to choose a distribution over (i.e., a randomized policy) which would violate the constraints.

Note that any policy corresponds to a point in the space of randomized policies (viewed as functions ), with . For any distribution over policies in , the induced randomized policy then corresponds to a point in the convex hull of . Denoting the convex hull of by , Prover’s choice by and Falsifier’s choice by , the feasibility of Step 1 follows by the following lemma:

###### Lemma 1.

Let be a compact and convex set of randomized policies. Let and for any , . Then for all distributions ,

 minW∈CmaxZ∈CEx∼DXEa∼Z(x,⋅)[1W′(x,a)]≤K1−Kμ.
###### Proof.

Let denote the inner expression of the minimax problem. Note that is:

• everywhere defined: Since , we obtain that , hence the expectations are defined for all and .

• linear in : Linearity follows from rewriting as

 f(W,Z)=Ex∼DX∑a∈A[Z(x,a)W′(x,a)].
• convex in : Note that is convex in by convexity of in , for , . Convexity of in then follows by taking expectations over and .

Hence, by Theorem 14 (in Appendix B), min and max can be reversed without affecting the value:

 minW∈CmaxZ∈Cf(W,Z)=maxZ∈CminW∈Cf(W,Z).

The right-hand side can be further upper-bounded by , which is upper-bounded by

 f(Z,Z)=Ex∼DX∑a∈A[Z(x,a)Z′(x,a)] ≤Ex∼DX∑a∈A:Z(x,a)>0[Z(x,a)(1−Kμ)Z(x,a)]=K1−Kμ. ∎
###### Corollary 2.

The set of distributions satisfying constraints of Step 1 is non-empty.

Given the existence of , we will see below that the constraints in Step 1 ensure low variance of the policy value estimator for all . The small variance is used to ensure accuracy of policy elimination in Step 5 as quantified in the following lemma:

###### Lemma 3.

With probability at least , for all :

1. (i.e., is non-empty)

2. for all

###### Proof.

We will show that for any policy , the probability that deviates from by more that is at most . Taking the union bound over all policies and all time steps we find that with probability at least ,

 |ηt(π)−ηD(π)|≤bt (3.1)

for all and all . Then:

1. By the triangle inequality, in each time step, for all , yielding the first part of the lemma.

2. Also by the triangle inequality, if for , then . Hence the policy is eliminated in Step 5, yielding the second part of the lemma.

It remains to show Eq. (3.1). We fix the policy and time , and show that the deviation bound is violated with probability at most . Our argument rests on Freedman’s inequality (see Theorem 13 in Appendix A). Let

 yt=rtI(π(xt)=at)W′t(at),

i.e., . Let denote the conditional expectation . To use Freedman’s inequality, we need to bound the range of

and its conditional second moment

.

Since and , we have the bound

 0≤yt≤1/μt≐Rt.

Next,

 Et[y2t] =E(xt,→rt)∼DEat∼W′t[y2t] =E(xt,→rt)∼DEat∼W′t[r2tI(π(xt)=at)W′t(at)2] ≤E(xt,→rt)∼D[W′t(π(xt))W′t(π(xt))2] (3.2) =Ext∼D[1W′t(π(xt))]≤2K. (3.3)

where Eq. (3.2) follows by boundedness of and Eq. (3.3) follows from the constraints in Step 1. Hence,

 ∑t′=1…tEt′[y2t′]≤2Kt≐Vt.

Since is decreasing for , we obtain that is non-increasing (by separately analyzing , , ). Let be the first such that . Note that , so for , we have and . Hence, the deviation bound holds for .

Let . For , by the monotonicity of

 Rt′=1/μt′≤1/μt=√2Ktln(1/δt)=√Vtln(1/δt).

Hence, the assumptions of Theorem 13 are satisfied, and

 Pr[|ηt(π)−ηD(π)|≥bt]≤2δt.

The union bound over and yields Eq. (3.1). ∎

This immediately implies that the cumulative regret is bounded by

 ∑t=1…T(ηD(πmax)−rt) ≤ 8√2Kln4NT2δT∑t=11√t (3.4) ≤ 16√2TKln4T2Nδ

and gives us the following theorem.

###### Theorem 4.

For all distributions over with actions, for all sets of policies , with probability at least , the regret of PolicyElimination (Algorithm 1) over rounds is at most

 16√2TKln4T2Nδ.

## 4 The Randomized Ucb Algorithm

PolicyElimination is the simplest exhibition of the minimax argument, but it has some drawbacks:

1. The algorithm keeps explicit track of the space of good policies (like a version space), which is difficult to implement efficiently in general.

2. If the optimal policy is mistakenly eliminated by chance, the algorithm can never recover.

3. The algorithm requires perfect knowledge of the distribution over contexts.

These difficulties are addressed by RandomizedUCB (or RUCB for short), an algorithm which we present and analyze in this section. Our approach is reminiscent of the UCB algorithm (Auer et al., 2002a), developed for context-free setting, which keeps an upper-confidence bound on the expected reward for each action. However, instead of choosing the highest upper confidence bound, we randomize over choices according to the value of their empirical performance. The algorithm has the following properties:

1. The optimization step required by the algorithm always considers the full set of policies (i.e., explicit tracking of the set of good policies is avoided), and thus it can be efficiently implemented using an argmax oracle. We discuss this further in Section 5.

2. Suboptimal policies are implicitly used with decreasing frequency by using a non-uniform variance constraint that depends on a policy’s estimated regret. A consequence of this is a bound on the value of the optimization, stated in Lemma 7 below.

3. Instead of , the algorithm uses the history of previously seen contexts. The effect of this approximation is quantified in Theorem 6 below.

The regret of RandomizedUCB is the following:

###### Theorem 5.

For all distributions over with actions, for all sets of policies , with probability at least , the regret of RandomizedUCB (Algorithm 2) over rounds is at most

 O(√TKlog(TN/δ)+Klog(NK/δ)).

The proof is given in Appendix D.4. Here, we present an overview of the analysis.

### 4.1 Empirical Variance Estimates

A key technical prerequisite for the regret analysis is the accuracy of the empirical variance estimates. For a distribution over policies and a particular policy , define

 VP,π,t =Ex∼DX[1(1−Kμt)WP(x,π(x))+μt] ˆVP,π,t =1t−1t−1∑i=11(1−Kμt)WP(xi,π(xi))+μt.

The first quantity is (a bound on) the variance incurred by an importance-weighted estimate of reward in round using the action distribution induced by , and the second quantity is an empirical estimate of using the finite sample drawn from . We show that for all distributions and all , is close to with high probability.

###### Theorem 6.

For any , with probability at least ,

 VP,π,t≤(1+ϵ)⋅ˆVP,π,t+7500ϵ3⋅K

for all distributions over , all , and all .

The proof appears in Appendix C.

### 4.2 Regret Analysis

Central to the analysis is the following lemma that bounds the value of the optimization in each round. It is a direct corollary of Lemma 24 in Appendix D.4.

###### Lemma 7.

If is the value of the optimization problem (4.1) in round , then

 OPTt ≤ O(√KCt−1t−1) = O⎛⎝√Klog(Nt/δ)t⎞⎠.

This lemma implies that the algorithm is always able to select a distribution over the policies that focuses mostly on the policies with low estimated regret. Moreover, the variance constraints ensure that good policies never appear too bad, and that only bad policies are allowed to incur high variance in their reward estimates. Hence, minimizing the objective in (4.1) is an effective surrogate for minimizing regret.

The bulk of the analysis consists of analyzing the variance of the importance-weighted reward estimates , and showing how they relate to their actual expected rewards . The details are deferred to Appendix D.

## 5 Using an Argmax Oracle

In this section, we show how to solve the optimization problem (4.1) using the argmax oracle () for our set of policies. Namely, we describe an algorithm running in polynomial time independent111Or rather dependent only on , the representation size of a policy. of the number of policies, which makes queries to to compute a distribution over policies suitable for the optimization step of Algorithm 2.

This algorithm relies on the ellipsoid method. The ellipsoid method is a general technique for solving convex programs equipped with a separation oracle. A separation oracle is defined as follows:

###### Definition 2.

Let be a convex set in . A separation oracle for is an algorithm that, given a point , either declares correctly that

, or produces a hyperplane

such that and are on opposite sides of .

We do not describe the ellipsoid algorithm here (since it is standard), but only spell out its key properties in the following lemma. For a point and , we use the notation to denote the ball of radius centered at .

###### Lemma 8.

Suppose we are required to decide whether a convex set is empty or not. We are given a separation oracle for and two numbers and , such that and if is non-empty, then there is a point such that . The ellipsoid algorithm decides correctly if is empty or not, by executing at most iterations, each involving one call to the separation oracle and additional processing time.

We now write a convex program whose solution is the required distribution, and show how to solve it using the ellipsoid method by giving a separation oracle for its feasible set using .

Fix a time period . Let be the set of all contexts seen so far, i.e. . We embed all policies in , with coordinates identified with . With abuse of notation, a policy is represented by the vector with coordinate if and otherwise. Let be the convex hull of all policy vectors . Recall that a distribution over policies corresponds to a point inside , i.e., , and that , where is as defined in Algorithm 2. Also define . In the following, we use the notation to denote a context drawn uniformly at random from .

Consider the following convex program:

 min s s.t. Δt−1(W) ≤ s (5.1) W ∈ C (5.2) ∀Z∈C: Ex∼ht−1[∑aZ(x,a)W′(x,a)]≤max{4K,βtΔt−1(Z)2} (5.3)

We claim that this program is equivalent to the RUCB optimization problem (4.1), up to finding an explicit distribution over policies which corresponds to the optimal solution. This can be seen as follows. Since we require , it can be interpreted as being equal to for some distribution over policies . The constraints (5.3) are equivalent to (4.1) by substitution .

The above convex program can be solved by performing a binary search over and testing feasibility of the constraints. For a fixed value of , the feasibility problem defined by (5.1)–(5.3) is denoted by .

We now give a sketch of how we construct a separation oracle for the feasible region of . The details of the algorithm are a bit complicated due to the fact that we need to ensure that the feasible region, when non-empty, has a non-negligible volume (recall the requirements of Lemma 8). This necessitates having a small error in satisfying the constraints of the program. We leave the details to Appendix E. Modulo these details, the construction of the separation oracle essentially implies that we can solve .

Before giving the construction of the separation oracle, we first show that allows us to do linear optimization over efficiently:

###### Lemma 9.

Given a vector , we can compute using one invocation of .

###### Proof.

The sequence for consists of and . The lemma now follows since . ∎

We need another simple technical lemma which explains how to get a separating hyperplane for violations of convex constraints:

###### Lemma 10.

For , let be a convex function of , and consider the convex set defined by . Suppose we have a point such that . Let be a subgradient of at . Then the hyperplane separates from .

###### Proof.

Let . By the convexity of , we have for all . Thus, for any , we have . Since , we conclude that separates from . ∎

Now given a candidate point , a separation oracle can be constructed as follows. We check whether satisfies the constraints of . If any constraint is violated, then we find a hyperplane separating from all points satisfying the constraint.

1. First, for constraint (5.1), note that is linear in , and so we can compute via as in Lemma 9. We can then compute and check if the constraint is satisfied. If not, then the constraint, being linear, automatically yields a separating hyperplane.

2. Next, we consider constraint (5.2). To check if

, we use the perceptron algorithm. We shift the origin to

, and run the perceptron algorithm with all points being positive examples. The perceptron algorithm aims to find a hyperplane putting all policies on one side. In each iteration of the perceptron algorithm, we have a candidate hyperplane (specified by its normal vector), and then if there is a policy that is on the wrong side of the hyperplane, we can find it by running a linear optimization over in the negative normal vector direction as in Lemma 9.

If , then in a bounded number of iterations (depending on the distance of from , and the maximum magnitude ) we obtain a separating hyperplane. In passing we also note that if , the same technique allows us to explicitly compute an approximate convex combination of policies in that yields . This is done by running the perceptron algorithm as before and stopping after the bound on the number of iterations has been reached. Then we collect all the policies we have found in the run of the perceptron algorithm, and we are guaranteed that is close in distance to their convex hull. We can then find the closest point in the convex hull of these policies by solving a simple quadratic program.

3. Finally, we consider constraint (5.3). We rewrite as , where . Thus, , where , which can be computed by using once.

Next, using the candidate point , compute the vector defined as , where is the number of times appears in , so that . Now, the problem reduces to finding a policy which violates the constraint

 u⋅Z≤max{4K,βt(w⋅Z−v)2}.

Define . Note that is a convex function of . Finding a point that violates the above constraint is equivalent to solving the following (convex) program:

 f(Z) ≤ 0 (5.4) Z ∈ C (5.5)

To do this, we again apply the ellipsoid method. For this, we need a separation oracle for the program. A separation oracle for the constraints (5.5) can be constructed as in Step 2 above. For the constraints (5.4), if the candidate solution has , then we can construct a separating hyperplane as in Lemma 10.

Suppose that after solving the program, we get a point such that , i.e. violates the constraint (5.3) for . Then since constraint (5.3) is convex in , we can construct a separating hyperplane as in Lemma 10. This completes the description of the separation oracle.

Working out the details carefully yields the following theorem, proved in Appendix E:

###### Theorem 11.

There is an iterative algorithm with iterations, each involving one call to and processing time, that either declares correctly that is infeasible or outputs a distribution over policies in such that satisfies

 ∀Z∈C: Ex∼ht−1[∑aZ(x,a)W′P(x,a)]≤max{4K,βtΔt−1(Z)2}+5ϵ Δt−1(W) ≤ s+2γ,

where and .

## 6 Delayed Feedback

In a delayed feedback setting, we observe rewards with a step delay according to:

1. The world presents features .

2. The learning algorithm chooses an action .

3. The world presents a reward for the action given the features .

We deal with delay by suitably modifying Algorithm 1 to incorporate the delay , giving Algorithm 3.

Now we can prove the following theorem, which shows the delay has an additive effect on regret.

###### Theorem 12.

For all distributions over with actions, for all sets of policies , and all delay intervals , with probability at least , the regret of DelayedPE (Algorithm 3) is at most

 16√2Kln4T2Nδ(τ+√T).
###### Proof.

Essentially as Theorem 4. The variance bound is unchanged because it depends only on the context distribution. Thus, it suffices to replace