# An efficient algorithm for contextual bandits with knapsacks, and an extension to concave objectives

We consider a contextual version of multi-armed bandit problem with global knapsack constraints. In each round, the outcome of pulling an arm is a scalar reward and a resource consumption vector, both dependent on the context, and the global knapsack constraints require the total consumption for each resource to be below some pre-fixed budget. The learning agent competes with an arbitrary set of context-dependent policies. This problem was introduced by Badanidiyuru et al. (2014), who gave a computationally inefficient algorithm with near-optimal regret bounds for it. We give a computationally efficient algorithm for this problem with slightly better regret bounds, by generalizing the approach of Agarwal et al. (2014) for the non-constrained version of the problem. The computational time of our algorithm scales logarithmically in the size of the policy space. This answers the main open question of Badanidiyuru et al. (2014). We also extend our results to a variant where there are no knapsack constraints but the objective is an arbitrary Lipschitz concave function of the sum of outcome vectors.

## Authors

• 14 publications
• 8 publications
• 52 publications
• ### Robustness of Anytime Bandit Policies

This paper studies the deviations of the regret in a stochastic multi-ar...
07/22/2011 ∙ by Antoine Salomon, et al. ∙ 0

• ### Further Optimal Regret Bounds for Thompson Sampling

Thompson Sampling is one of the oldest heuristics for multi-armed bandit...
09/15/2012 ∙ by Shipra Agrawal, et al. ∙ 0

• ### Knapsack based Optimal Policies for Budget-Limited Multi-Armed Bandits

In budget-limited multi-armed bandit (MAB) problems, the learner's actio...
04/09/2012 ∙ by Long Tran-Thanh, et al. ∙ 0

• ### Learning Infinite-horizon Average-reward MDPs with Linear Function Approximation

We develop several new algorithms for learning Markov Decision Processes...
07/23/2020 ∙ by Chen-Yu Wei, et al. ∙ 12

• ### Target Tracking for Contextual Bandits: Application to Demand Side Management

We propose a contextual-bandit approach for demand side management by of...
01/28/2019 ∙ by Margaux Brégère, et al. ∙ 0

• ### BISTRO: An Efficient Relaxation-Based Method for Contextual Bandits

We present efficient algorithms for the problem of contextual bandits wi...
02/06/2016 ∙ by Alexander Rakhlin, et al. ∙ 0

• ### A Problem-Adaptive Algorithm for Resource Allocation

We consider a sequential stochastic resource allocation problem under th...
02/12/2019 ∙ by Xavier Fontaine, et al. ∙ 0

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Multi-armed bandits (e.g., Bubeck and Cesa-Bianchi (2012)) are a classic model for studying the exploration-exploitation tradeoff faced by a decision-making agent, which learns to maximize cumulative reward through sequential experimentation in an initially unknown environment. The contextual bandit problem (Langford and Zhang, 2008), also known as

associative reinforcement learning

(Barto and Anandan, 1985)

, generalizes multi-armed bandits by allowing the agent to take actions based on contextual information: in every round, the agent observes the current context, takes an action, and observes a reward that is a random variable with distribution conditioned on the context and the taken action. Despite many recent advances and successful applications of bandits, one of the major limitations of the standard setting is the lack of “global” constraints that are common in many important real-world applications. For example, actions taken by a robot arm may have different levels of power consumption, and the total power consumed by the arm is limited by the capacity of its battery. In online advertising, each advertiser has her own budget, so that her advertisement cannot be shown more than a certain number of times. In dynamic pricing, there are a certain number of objects for sale and the seller offers prices to a sequence of buyers with the goal of maximizing revenue, but the number of sales is limited by the supply.

Recently, a few papers started to address this limitation by considering very special cases such as a single resource with a budget constraint (Ding et al., 2013; Guha and Munagala, 2007; György et al., 2007; Madani et al., 2004; Tran-Thanh et al., 2010, 2012), and application-specific bandit problems such as the ones motivated by online advertising (Chakrabarti and Vee, 2012; Pandey and Olston, 2006), dynamic pricing (Babaioff et al., 2015; Besbes and Zeevi, 2009) and crowdsourcing (Badanidiyuru et al., 2012; Singla and Krause, 2013; Slivkins and Vaughan, 2013). Subsequently, Badanidiyuru et al. (2013) introduced a general problem capturing most previous formulations. In this problem, which they called Bandits with Knapsacks (BwK), there are different resources, each with a pre-specified budget. Each action taken by the agent results in a -dimensional resource consumption vector, in addition to the regular (scalar) reward. The goal of the agent is to maximize the total reward, while keeping the cumulative resource consumption below the budget. The BwK model was further generalized to the BwCR (Bandits with convex Constraints and concave Rewards) model by Agrawal and Devanur (2014), which allows for arbitrary concave objective and convex constraints on the sum of the resource consumption vectors in all rounds. Both papers adapted the popular Upper Confidence Bound (UCB) technique to obtain near-optimal regret guarantees. However, the focus was on the non-contextual setting.

There has been significant recent progress (Agarwal et al., 2014; Dudík et al., 2011) in algorithms for general (instead of linear (Abbasi-yadkori et al., 2012; Chu et al., 2011)) contextual bandits where the context and reward can have arbitrary correlation, and the algorithm competes with some arbitrary set of context-dependent policies. Dudík et al. (2011) achieved the optimal regret bound for this remarkably general contextual bandits problem, assuming access to the policy set only through a linear optimization oracle, instead of explicit enumeration of all policies as in previous work (Auer et al., 2002; Beygelzimer et al., 2011). However, the algorithm presented in Dudík et al. (2011) was not tractable in practice, as it makes too many calls to the optimization oracle. Agarwal et al. (2014) presented a simpler and computationally efficient algorithm, with a running time that scales as the square-root of the logarithm of the policy space size, and achieves an optimal regret bound.

Combining contexts and resource constraints, Agrawal and Devanur (2014) also considered a static linear contextual version of BwCR where the expected reward was linear in the context.111In particular, each arm is associated with a fixed vector and the resulting outcomes for this arm have expected value linear in this vector. Wu et al. (2015) considered the special case of random linear contextual bandits with a single budget constraint, and gave near-optimal regret guarantees for it. Badanidiyuru et al. (2014) extended the general contextual version of bandits with arbitrary policy sets to allow budget constraints, thus obtaining a contextual version of BwK, a problem they called Resourceful Contextual Bandits (RCB). We will refer to this problem as CBwK (Contextual Bandits with Knapsacks), to be consistent with the naming of related problems defined in the paper. They gave a computationally inefficient algorithm, based on Dudík et al. (2011), with a regret that was optimal in most regimes. Their algorithm was defined as a mapping from the history and the context to an action, but the computational issue of finding this mapping was not addressed. They posed an open question of achieving computational efficiency while maintaining a similar or even a sub-optimal regret.

#### Main Contributions.

In this paper, we present a simple and computationally efficient algorithm for CBwK/RCB, based on the algorithm of Agarwal et al. (2014). Similar to Agarwal et al. (2014), the running time of our algorithm scales as the square-root of the logarithm of the size of the policy set,222Access to the policy set is via an “arg max oracle”, as in Agarwal et al. (2014). thus resolving the main open question posed by Badanidiyuru et al. (2014). Our algorithm even improves the regret bound of Badanidiyuru et al. (2014) by a factor of . Another improvement over Badanidiyuru et al. (2014) is that while they need to know the marginal distribution of contexts, our algorithm does not. A key feature of our techniques is that we need to modify the algorithm in Agarwal et al. (2014) in a very minimal way — in an almost blackbox fashion — thus retaining the structural simplicity of the algorithm while obtaining substantially more general results.

We extend our algorithm to a variant of the problem, which we call Contextual Bandits with concave Rewards (CBwR): in every round, the agent observes a context, takes one of actions and then observes a -dimensional outcome vector, and the goal is to maximize an arbitrary Lipschitz concave function of the average of the outcome vectors; there are no constraints. This allows for many more interesting applications, some of which were discussed in Agrawal and Devanur (2014). This setting is also substantially more general than the contextual version considered in Agrawal and Devanur (2014), where the context was fixed and the dependence was assumed to be linear.

#### Organization.

In sec:prelim, we define the CBwK problem, and state our regret bound as thm:packing. The algorithm is detailed in Section 3, and an overview of the regret analysis is in sec:analysis. In sec:cbwr, we present CBwR, the problem with concave rewards, state the guaranteed regret bounds, and outline the differences in the algorithm and the analysis. Complete proofs and other details are provided in the appendices.

## 2 Preliminaries and Main Results

#### CBwK.

The CBwK problem was introduced by Badanidiyuru et al. (2014), under the name of Resourceful Contextual Bandits (RCB). We now define this problem.

Let be a finite set of actions and

be a space of possible contexts (the analogue of a feature space in supervised learning). To begin with, the algorithm is given a budget

. We then proceed in rounds: in every round , the algorithm observes context , chooses an action , and observes a reward and a -dimensional consumption vector . The objective is to take actions that maximize the total reward, , while making sure that the consumption does not exceed the budget, i.e., .333More generally, different dimensions could have different budgets, but this formulation is without loss of generality: scale the units of all dimensions so that all the budgets are equal to the smallest one. This preserves the requirement that the vectors are in . The algorithm stops either after rounds or when the budget is exceeded in one of the dimensions, whichever occurs first. We assume that one of the actions is a “no-op” action, i.e., it always gives a reward of 0 and a consumption vector of all 0s. Furthermore, we make a stochastic assumption that the context, the reward, and the consumption vectors for are drawn i.i.d. (independent and identically distributed) from a distribution over . The distribution is unknown to the algorithm.

#### Policy Set.

Following previous work (Agarwal et al., 2014; Badanidiyuru et al., 2014; Dudík et al., 2011), our algorithms compete with an arbitrary set of policies. Let be a finite set of policies444The policies may be randomized in general, but for our results, we may assume without loss of generality that they are deterministic. As observed by Badanidiyuru et al. (2014), we may replace randomized policies with deterministic policies by appending a random seed to the context. This blows up the size of the context space which does not appear in our regret bounds. that map contexts to actions . We assume that the policy set contains a “no-op” policy that always selects the no-op action regardless of the context. With global constraints, distributions over policies in could be strictly more powerful than any policy in itself.555E.g., consider two policies that both give reward 1, but each consume 1 unit of a different resource. The optimum solution is to mix uniformly between the two, which does twice as well as using any single policy. Our algorithms compete with this more powerful set, which is a stronger guarantee than simply competing with fixed policies in . For this purpose, define as the set of all convex combinations of policies in . For a context , choosing actions with is equivalent to following a randomized policy that selects action

with probability

; we therefore also refer to as a (mixed) policy. Similarly, define as the set of all non-negative weights over , which sum to at most . Clearly, .

#### Benchmark and Regret.

The benchmark for this problem is an optimal static mixed policy, where the budgets are required to be satisfied in expectation only. Let and denote respectively the expected reward and consumption vector for policy . We call a policy a feasible policy if . Note that there always exists a feasible policy in , because of the no-op policy. Define an optimal policy as a feasible policy that maximizes the expected reward:

 P∗=argmaxP∈C(Π) TR(P)s.t.TV(P)≤B1. (1)

The reward of this optimal policy is denoted by . We are interested in minimizing the regret, defined as

 regret(T):=OPT−∑Tt=1rt(at). (2)

#### Amo.

Since the policy set is extremely large in most interesting applications, accessing it by explicit enumeration is impractical. For the purpose of efficient implementation, we instead only access via a maximization oracle. Employing such an oracle is common when considering contextual bandits with an arbitrary set of policies (Agarwal et al., 2014; Dudík et al., 2011; Langford and Zhang, 2008). Following previous work, we call this oracle an “arg max oracle”, or AMO.

###### Definition 1.

For a set of policies , the arg max oracle (AMO) is an algorithm, which for any sequence of contexts and rewards, , returns

 argmaxπ∈Π∑tτ=1rτ(π(xτ)) (3)

#### Main Results.

Our main result is a computationally efficient low-regret algorithm for CBwK. Furthermore, we improve the regret bound of Badanidiyuru et al. (2014) by a factor; they present a detailed discussion on the optimality of the dependence on and in this bound.

###### Theorem 1.

For the CBwK  problem, , there is a polynomial-time algorithm that makes calls to AMO, and with probability at least has regret

 regret(T)=O(OPTB+1)√KTln(dT|Π|/δ).

Note that the above regret bound is meaningful only for , therefore in the rest of the paper we assume that for some large enough constant . We also extend our results to a version with a concave reward function, as outlined in Section 5. For the rest of the paper, we treat as fixed, and define quantities that depend on .

## 3 Algorithm for the CBwK problem

From previous work on multi-armed bandits, we know that the key challenges in finding the “right” policy are that (1) it should concentrate fast enough on the empirically best policy (based on data observed so far), (2) the probability of choosing an action must be large enough to enable sufficient exploration, and (3) it should be efficiently computable. Agarwal et al. (2014) show that all these can be addressed by solving a properly defined optimization problem, with help of an AMO. We have the additional technical challenge of dealing with global constraints. As mentioned earlier, one complication that arises right away is that due to the knapsack constraints, the algorithm has to compete against the best mixed policy in , rather than the best pure policy. In the following, we will highlight the main technical difficulties we encounter, and our solution to these difficulties.

Some definitions are in place before we describe the algorithm. Let denote the history of chosen actions and observations before time , consisting of records of the form , where denote, respectively, the context, action taken, reward and consumption vector observed at time , and denotes the probability at which action was taken. (Recall that our algorithm selects actions in a randomized way using a mixed policy.) Although contains observation vectors only for chosen actions, it can be “completed” using the trick of importance sampling: for every , define the fictitious observation vectors by:

 ^rτ(a) :=rτ(aτ)pτ(aτ)I{aτ=a}, ^vτ(a) :=vτ(aτ)pτ(aτ)I{aτ=a}.

Clearly, are unbiasedestimator of : for every , , where the expectations are over randomization in selecting .

With the “completed” history, it is straightforward to obtain an unbiased estimate of expected reward vector and expected consumption vector for every policy

:

 ^Rt(P) :=Eτ∼[t],π∼P[^rτ(π(xτ))], ^Vt(P) :=Eτ∼[t],π∼P[^vτ(π(xτ))].

The convenient notation above, indicating that is drawn uniformly at random from the set of integers , simply means averaging over time up to step . It is easy to verify that and .

Given these estimates, we construct an optimization problem (OP) which aims to find a mixed policy that has a small “empirical regret”, and at the same time provides sufficient exploration over “good” policies. The optimization problem uses a quantity , “the empirical regret of policy ”, to characterize good policies. Agarwal et al. (2014) define as simply the difference between the empirical reward estimate of policy and that of the policy with the highest empirical reward. Thus, good policies were characterized as those with high reward. For our problem, however, a policy could have a high reward while its consumption violates the knapsack constraints by a large margin. Such a policy should not be considered a good policy. A key challenge in this problem is therefore to define a single quantity that captures the “goodness” of a policy by appropriately combining rewards and consumption vectors.

We define quantities (and the corresponding empirical estimate up to round ) of by combining the regret in reward and constraint violation using a multiplier “Z”. The multiplier captures the sensitivity of the problem to violation in knapsack constraints. It is easy to observe from (1) that increasing the knapsack size from to can increase the optimal to atmost . It follows that if a policy violates any knapsack constraint by , it can achieve at most more reward than OPT. More precisely,

###### Lemma 2.

For any , let denote the value of an optimal solution of (1) when the budget is set as . Then, for any , ,

 OPT(b+γ)≤OPT(b)+OPT(b)bγ. (4)

We use this observation to set as an estimate of . We do this by using the outcomes of the first

 T0:=12KTBlnd|Π|δ

rounds, during which we do pure exploration (i.e., play an action in uniformly at random). For notational convenience, in our algorithm description we will index these initial exploration rounds as , so that the major component of the algorithm can be started from and runs until . The following lemma provides a bound on the that we estimate. Its proof appears in Appendix B.

###### Lemma 3.

For any , using the first rounds of pure exploration, one can compute a quantity such that with probability at least ,

 max{4OPTB,1}≤Z≤24OPTB+8.

Now, to define and , we combine regret in reward and constraint violation using the constant as computed above. In these definitions, we use a smaller budget amount

 B′:=B−T0−c√KTln(T|Π|/δ),

for a large enough constant to be specified later. Here, the budget needed to be decreased by to account for budget consumed in the first exploration rounds. We use a further smaller budget amount to ensure that with high probability () our algorithm will not abort before the end of time horizon (), due to budget violation. For any vector , let denote the amount by which the vector violates the budget , i.e.,

 ϕ(v,B′):=maxj=1,…,d(vj−B′T)+.

Let denote the optimal policy when budget amount is , i.e.,

 P′:=argmaxP∈C(Π)TR(P)s.t.TV(P)≤B′1.

And, let denote the empirically optimal policy for the combination of reward and budget violation, defined as:

 Pt:=argmaxP∈C(Π)^Rt(P)−Zϕ(^Vt(P),B′). (5)

We define

 Reg(P):=1Z+1(R(P′)−R(P)+Zϕ(V(P),B′)),
 ˆRegt(P):=1(Z+1)[^Rt(Pt)−Zϕ(^Vt(Pt),B′)−(^Rt(P)−Zϕ(^Vt(P),B′))].

Note that and by definition.

We are now ready to describe the optimization problem, (OP). This is essentially the same as the optimization problem solved in Agarwal et al. (2014), except for the new definition of , which was described above. It aims to find a mixed policy . This is equivalent to finding a and , and returning . Let denote a smoothed projection of , assigning minimum probability to every action: . (OP) depends on the history up to some time , and a parameter that will be set by the algorithm. In the rest of the paper, for convenience, we define a constant .

Optimization Problem (OP) Given: , and . Let . Find a , and an , such that the following inequalities hold. Let .

The first constraint in (OP) is to ensure that, under ,

is “small”. In the second constraint, the left-hand side, as shown in the analysis, is an upper bound on the variance of estimates

. These two constraints are critical for deriving the regret bound in Section 4. We give an algorithm that efficiently finds a feasible solution to (OP) (and also shows that a feasible solution always exists).

We are now ready to describe the full algorithm, which is summarized in Algorithm 1. The main body of the algorithm shares the same structure as the ILOVETOCONBANDITS algorithm for contextual bandits (Agarwal et al., 2014), with important changes necessary to deal with the knapsack constraints. We use the first rounds to do pure exploration and calculate as given by lem:Zestimate. (These time steps are indexed from to

.) The algorithm then proceeds in epochs with pre-defined lengths; epoch

consists of time steps indexed from to , inclusively. The algorithm can work with any epoch schedule that satisfies . Our results hold for the schedule where . However, the algorithm can choose to solve (OP) more frequently than what we use here to get a lower regret (but still within constant factors), at the cost of higher computational time. At the end of an epoch , it computes a mixed policy in by solving an instance of OP, which is then used for the entire next epoch. Additionally, at the end of every epoch , the algorithm computes the empirically best policy as defined in Equation (5), which the algorithm uses as the default policy in the sampling process defined below. can be chosen arbitrarily, e.g., as uniform policy.

The sampling process, in Step 8, samples an action from the computed mixed policy. It takes the following as input: (context), (mixed policy returned by the optimization problem (OP) for the current epoch), (default mixed policy), and (a scalar for minimum action-selection probability). Since may not be a proper distribution (as its weights may sum to a number less than ),  first computes , by assigning any remaining mass (from ) to the default policy . Then, it picks an action from the smoothed projection of this distribution defined as: .

The algorithm aborts (in Step 10) if the budget is consumed for any resource.

### 3.1 Computation complexity: Solving (OP) using AMO

Algorithm 1 requires solving (OP) at the end of every epoch. Agarwal et al. (2014) gave an algorithm that solves (OP) using access to the AMO. We use a similar algorithm, except that calls to the AMO are now replaced by calls to a knapsack constrained optimization problem over the empirical distribution. This optimization problem is identical in structure to the optimization problem defining in eq:empOptPolicy, which we need to solve also. We can solve both of these problems using AMO, as outlined below.

We rewrite eq:empOptPolicy as a linear optimization problem where the domain is the intersection of two polytopes. The domain is ; we represent a point in this domain as , where and are scalars and is a vector in dimensions. Let

 K1:={(x,y,λ):x=^Rt(P),y=^Vt(P) for some P∈C(Π),λ∈[0,1]},

be the set of all reward, consumption vectors achievable on the empirical outcomes upto time , through some policy in . Let

 K2:={(x,y,λ):y≤(B′/T+λ)1}∩[0,1]d+2,

be the constraint set, given by relaxaing the knapsack constraints by . Now eq:empOptPolicy is equivalent to

 maxx−Zλ such that (x,y,λ)∈K1∩K2. (6)

Recently, Lee et al. (2015, Theorem 49) gave a fast algorithm to solve problems of the kind above, given access to oracles that solve linear optimization problems over and .666Alternately, one could use the algorithms of Vaidya (1989a, b) to solve the same problem, with a slightly weaker polynomial running time. The algorithm makes calls to these oracles, and takes an additional running time.777Here, hides terms of the order , where is the accuracy needed of the solution. A linear optimization problem over is equivalent to the AMO; the linear function defines the “rewards” that the AMO optimizes for.888These rewards may not lie in but an affine transformation of the rewards can bring them into without changing the solution. A linear optimization problem over is trivial to solve. As an aside, a solution output by this algorithm has support equal to the policies output by the AMO during the run of the algorithm, and hence has size .

Using this, (OP) can be solved using calls to the AMO at the end of every epoch, and (5) can be solved using calls, giving a total of calls to AMO. The complete algorithm to solve (OP) is in Appendix C.

## 4 Regret Analysis

This section provides an outline of the proof of Theorem 1, which provides a bound on the regret of Algorithm 1. (A complete proof is given in Appendix D. ) The proof structure is similar to the proof of Agarwal et al. (2014, Theorem 2), with major differences coming from the changes necessary to deal with mixed policies and constraint violations. We defined the algorithm to minimize (through the first constraint in the optimization problem (OP)), and the first step is to show that this implies a bound on Reg as well. The alternate definitions of Reg and require a different analysis than what was in Agarwal et al. (2014), and this difference is highlighted in the proof outline of lem:recursive2.main below. Once we have a bound on Reg, we show that this implies a bound on the actual reward , as well as the probability of violating the knapsack constraints.

We start by proving that the empirical average reward and consumption vector for any mixed policy are close to the true averages and respectively. We define such that for initial epochs , . Recall that is the minimum probability of playing any action in epoch , defined in Step 1 of Algorithm 1. Therefore, for these initial epochs the variance of importance sampling estimates is small, and we can obtain a stronger bound on estimation error. For subsequent epochs, decreases, and we get error bounds in terms of max variance of the estimates for policy across all epochs before time , defined as . In fact, the second constraint in the optimization problem (OP) seeks to bound this variance.

The precise definitions of above-mentioned quantities are provided in Appendix D.

###### Lemma 4.

With probability , for all policies ,

 max{|^Rt(P)−Rt(P)|,∥^Vt(P)−V(P)∥∞}≤⎧⎪⎨⎪⎩√8Kdttt∈epoch m0,t≥t0Vt(P)μm−1+dttμm−1,t∈epoch m,m>m0

Here, , .

Now suppose the error bounds in above lemma hold. A major step is to show that, for every , the empirical regret and the actual regret are close in a particular sense.

###### Lemma 5.

Assume that the events in Lemma 4 hold. Then, for all epochs , all rounds in epoch , and all policies ,

 Reg(P)≤2ˆRegt(P)+c0Kμm,  and  ˆRegt(P)≤2Regt(P)+c0Kμm,

for as defined in Section 3, and being a constant smaller than .

###### Proof Outline.

The proof of above lemma is by induction, using the second constraint in (OP) to bound the variance . Below, we prove the base case. This proof demonstrates the importance of appropriately chosing . Consider , and in epoch . For all ,

 (Z+1)(ˆRegt(P)−Reg(P)) = ^Rt(Pt)−^Rt(P)−R(P′)+R(P) −Z[ϕ(^Vt(Pt),B′)−ϕ(^Vt(P),B′)+ϕ(V(P),B′)].

We can assume that for any constant (otherwise the regret guarantees in Theorem 1 are meaningless). Then, we have that implying . Also, observe that since , . Then, by Lemma 2 and choice of as specified by Lemma 3, we have that for any

 OPT(B′+γ)≤OPT(B′)+Z2γ. (8)

Now, since is defined as the optimal policy for budget , we obtain that . Also, by definition of , we have that , and therefore,

 R(P′)≥R(Pt))−Z2ϕ(V(Pt),B′)≥R(Pt))−Zϕ(V(Pt),B′).

Substituting in (4), we can upper bound by

 ^Rt(Pt)−^Rt(P)−R(Pt)+Zϕ(V(Pt),B′)+R(P) −Z[ϕ(^Vt(Pt),B′)−ϕ(^Vt(P),B′)+ϕ(V(P),B′)] ≤ |^Rt(Pt)−R(Pt)|+|^Rt(P)−R(P)|+Z∥^Vt(Pt)−V(Pt)∥∞+Z∥^Vt(P)−V(P)∥∞

For the other side, by definition of , we have that . Substituting in (4) as above, and using that , we get a similar upper bound on . Now substituting bounds from Lemma 4, we obtain,

 |ˆRegt(P)−Reg(P)|≤4√8Kdtt≤c0Kμm.

This completes the base case. The remaining proof is by induction, using the bounds provided by Lemma 4 for epochs in terms of variance , and bound on variance provided by the second constraint in (OP). The second constraint in (OP) provides a bound on the variance of any policy in any past epoch, in terms of for in that epoch; the inductive hypothesis is used in the proof to obtain those bounds in terms of . ∎

Given the above lemma, the first constraint in (OP) which bounds the estimated regret for the chosen mixed policy , directly implies an upper bound on for this mixed policy. Specifically, we get that for every epoch , for mixed policy that solves (OP),

 Reg(Qm)≤(c0+2)Kψμm.

Next, we bound the regret in epoch using above bound on . For simplicity of discussion, here we outline the steps for bounding regret for rewards sampled from policy in epoch . Note that this is not precise in following ways. First, may not be in and therefore may not be a proper distribution (the actual sampling process puts the remaining probability on default policy to obtain at time in epoch ). Second, the actual sampling process picks an action from smoothed projection of . However, we ignore these technicalities here in order to get across the intuition behind the proof; these technicalities are dealt with rigorously in the complete proof provided in Appendix D.

The first step is to use the above bound on to show that expected reward in epoch is close to optimal reward . Since is always non-negative, by definition of , for any

 (Z+1)Reg(Q)≥R(P′)−R(Q)≥R(P∗)−R(Q)−OPTB(B−B′)T,

where we used Lemma 2 to get the last inequality. If the algorithm never aborted due to constraint violation in Step 10, the above observation would bound the regret of the algorithm by

 ∑m(R(P∗)−R(Qm−1))(τm−τm−1)≤∑m(Z+1)(c0+2)Kψμm−1(τm−τm−1)+OPTB(B−B′).

Then, using that , , and properly chosen scaling factors ( and ) result in the desired bound of for expected regret. An application of Azuma-Hoeffding inequality obtains the high probability regret bound as stated in Theorem 1.

To complete the proof, we show that in fact, with probability , the algorithm is not aborted in Step 10 due to constraint violation. This involves showing that with high probability, the algorithm’s consumption (in steps ) above is bounded above by , and since , we obtain that the algorithm will satisfy the knapsack constraint with high probability. This also explains why we started with a smaller budget. More precisely, we show that for every ,

 ϕ(V(Qm),B′)≤4(c0+2)Kψμm (9)

Recall that was defined as the maximum violation of budget by vector . To prove the above, we observe that due to our choice of , is bounded by as follows. By Equation (8), for all so that

 (Z+1) Reg(P)=R(P′)−R(P)+Zϕ(V(P),B′)≥Z2ϕ(V(P),S).

Then, using the bound of , we obtain the bound in Equation (9). Summing this bound over all epochs , and using Jensen’s inequality and convexity of , we obtain a bound on the max violation of budget constraint by the algorithm’s expected consumption vector . This is converted to a high probability bound using Azuma-Hoeffding inequality.

## 5 The CBwR problem

In this section, we consider a version of the problem with a concave objective function, and show how to get an efficient algorithm for it. The CBwR problem is identical to the CBwK problem, except for the following. The outcome in a round is simply the vector , and the goal of the algorithm is to maximize for some concave function defined on the domain , and given to the algorithm ahead of time. The optimum mixed policy is now defined as

 P∗=argmaxP∈C(Π)f(V(P)). (10)

The optimum value is and we bound the average regret, which is

 avg-regret:=OPT−f(1T∑Tt=1vt(at)).

The main result of this section is an regret bound for this problem. Note that the regret scales as rather than since the problem is defined in terms of the average of the vectors rather than the sum. We assume that is represented in such a way that we can solve optimization problems of the following form in polynomial time.999This problem has nothing to do with contexts and policies, and only depends on the function . For any given

 maxf(x)+a⋅x:x∈[0,1]d.
###### Theorem 6.

For the CBwR  problem, if is -Lipschitz w.r.t. norm , then there is a polynomial time algorithm that makes calls to AMO, and with probability at least has regret

 avg-regret(T)=O(∥1d∥L√T(√Kln(T|Π|/δ)+√ln(d/δ))).
###### Remark.

A special case of this problem is when there are only constraints, in which case could be defined as the negative of the distance from the constraint set. Further, one could handle both concave objective function and convex constraints as follows. Suppose that we wish to maximize subject to the constraint that , for some -Lipschitz concave function and a convex set . Further, suppose that we had a good estimate of the optimum achieved by a static mixed policy, i.e.,

 OPT′:=maxP∈C(Π)h(V(P))s.t.V(P)∈S. (11)

For some distance function measuring distance of a point from set , define

 f(v):=min{h(v)−OPT′,−Ld(v,S)}.

### 5.1 Algorithm

Since we don’t have any hard constraints and don’t need to estimate as in the case of CBwK, we can drop Steps 2–5 and Step 10 in Algorithm 1, and set . The optimization problem (OP) is also the same, but with new definitions of and as below. Recall that is the optimal policy as given by Equation (10), and is the Lipschitz factor for with respect to norm . We now define the regret of policy as

 Reg(P):=1∥1d∥L(f(V(P∗))−f(V(P))).

The best empirical policy is now given by

 Pt:=argmaxP∈C(Π)f(^Vt(P)), (12)

and an estimate of the regret of policy at time is

 ˆRegt(P):=1∥1d∥L(f(^Vt(Pt))−f(^Vt(P))).

Another difference is that we need to solve a convex optimization problem to find (as defined in (12)) once every epoch. A similar convex optimization problem needs to be solved in every iteration of a coordinate descent algorithm for solving (OP) (details of this are in Appendix C.2). In both cases, the problems can be cast in the form

 ming(x):x∈C,

where is a convex function, is a convex set, and we are given access to a linear optimization oracle, that solves a problem of the form . In (12) for instance, is the set of all for all . A linear optimization oracle over this is just an AMO as in Definition 1. We show how to efficiently solve such a convex optimization problem using cutting plane methods (Vaidya, 1989a; Lee et al., 2015), while making only calls to the oracle. The details of this are in Appendix C.2.

### 5.2 Regret Analysis: Proof of Theorem 6

We prove that Algorithm 1 and (OP) with the above new definition of achieves regret bounds of Theorem 6 for the CBwR problem. A complete proof of this theorem is given in Appendix E. Here, we sketch some key steps.

The first step of the proof is to use constraints in (OP) to prove a lemma akin to Lemma 5 showing that the empirical regret and actual regret are close for every . Therefore, the first constraint in (OP) that bounds the empirical regret of the computed policy implies a bound on the actual regret . Ignoring the technicalities of sampling process (which are dealt with in the complete proof), and assuming that is the policy used in epoch , this provides a bound on regret in every epoch. Regret across epochs can be combined using Jensen’s inequality which bounds the regret in expectation. Using Azuma-Hoeffding’s inequality to bound deviation of expected reward vector from the actual reward vector, we obtain the high probability regret bound stated in Theorem 6.

## References

• Abbasi-yadkori et al. (2012) Yasin Abbasi-yadkori, Dávid Pál, and Csaba Szepesvári. Improved algorithms for linear stochastic bandits. In NIPS, 2012.
• Agarwal et al. (2014) Alekh Agarwal, Daniel Hsu, Satyen Kale, John Langford, Lihong Li, and Robert E. Schapire. Taming the monster: A fast and simple algorithm for contextual bandits. In ICML 2014, June 2014. URL http://arxiv.org/abs/1402.0555. Full version on arXiv.
• Agrawal and Devanur (2014) Shipra Agrawal and Nikhil R. Devanur. Bandits with concave rewards and convex knapsacks. In Proceedings of the Fifteenth ACM Conference on Economics and Computation, EC ’14, 2014.
• Auer et al. (2002) Peter Auer, Nicolò Cesa-Bianchi, Yoav Freund, and Robert E. Schapire. The nonstochastic multiarmed bandit problem. SIAM Journal on Computing, 32(1):48–77, 2002.
• Babaioff et al. (2015) Moshe Babaioff, Shaddin Dughmi, Robert D. Kleinberg, and Aleksandrs Slivkins. Dynamic pricing with limited supply. ACM Trans. Economics and Comput., 3(1):4, 2015. doi: 10.1145/2559152.
• Badanidiyuru et al. (2012) Ashwinkumar Badanidiyuru, Robert Kleinberg, and Yaron Singer. Learning on a budget: posted price mechanisms for online procurement. In Proc. of the 13th ACM EC, pages 128–145. ACM, 2012.
• Badanidiyuru et al. (2013) Ashwinkumar Badanidiyuru, Robert Kleinberg, and Aleksandrs Slivkins. Bandits with knapsacks. In FOCS, pages 207–216, 2013.
• Badanidiyuru et al. (2014) Ashwinkumar Badanidiyuru, John Langford, and Aleksandrs Slivkins. Resourceful contextual bandits. In Proceedings of The Twenty-Seventh Conference on Learning Theory (COLT-14), pages 1109–1134, 2014.
• Barto and Anandan (1985) Andrew G. Barto and P. Anandan. Pattern-recognizing stochastic learning automata. IEEE Transactions on Systems, Man, and Cybernetics, 15(3):360–375, 1985.
• Besbes and Zeevi (2009) Omar Besbes and Assaf Zeevi. Dynamic pricing without knowing the demand function: Risk bounds and near-optimal algorithms. Operations Research, 57(6):1407–1420, 2009.
• Beygelzimer et al. (2011) Alina Beygelzimer, John Langford, Lihong Li, Lev Reyzin, and Robert E. Schapire. Contextual bandit algorithms with supervised learning guarantees. In Proc. of the 14th AIStats, pages 19–26, 2011.
• Bubeck and Cesa-Bianchi (2012) Sébastien Bubeck and Nicolò Cesa-Bianchi. Regret analysis of stochastic and nonstochastic multi-armed bandit problems.

Foundations and Trends in Machine Learning

, 5(1):1–122, 2012.
• Chakrabarti and Vee (2012) Deepayan Chakrabarti and Erik Vee. Traffic shaping to optimize ad delivery. In Proceedings of the 13th ACM Conference on Electronic Commerce, EC ’12, 2012.
• Chu et al. (2011) Wei Chu, Lihong Li, Lev Reyzin, and Robert E. Schapire. Contextual Bandits with Linear Payoff Functions. Journal of Machine Learning Research - Proceedings Track, 15:208–214, 2011.
• Ding et al. (2013) Wenkui Ding, Tao Qin, Xu-Dong Zhang, and Tie-Yan Liu. Multi-armed bandit with budget constraint and variable costs. In Proc. of the 27th AAAI, pages 232–238, 2013.
• Dudík et al. (2011) Miroslav Dudík, Daniel Hsu, Satyen Kale, Nikos Karampatziakis, John Langford, Lev Reyzin, and Tong Zhang. Efficient optimal learning for contextual bandits. In Proc. of the 27th UAI, pages 169–178, 2011.
• Grötschel et al. (1988) Martin Grötschel, László Lovász, and Alexander Schrijver.

Geometric Algorithms and Combinatorial Optimization

.
Springer-Verlag, New York, 1988.
• Guha and Munagala (2007) Sudipto Guha and Kamesh Munagala. Approximation algorithms for budgeted learning problems. In STOC, pages 104–113, 2007.
• György et al. (2007) András György, Levente Kocsis, Ivett Szabó, and Csaba Szepesvári. Continuous time associative bandit problems. In Proc. of the 20th IJCAI, pages 830–835, 2007.
• Langford and Zhang (2008) John Langford and Tong Zhang. The epoch-greedy algorithm for contextual multi-armed bandits. In Advances in Neural Information Processing Systems 20, pages 1096–1103, 2008.
• Lee et al. (2015) Yin Tat Lee, Aaron Sidford, and Sam Chiu wai Wong. A faster cutting plane method and its implications for combinatorial and convex optimization. In Foundations of Computer Science (FOCS), 2015 IEEE 56th Annual Symposium on, pages 1049–1065. IEEE, 2015. URL http://arxiv.org/abs/1508.04874v1. Full version on arXiv .
• Madani et al. (2004) Omid Madani, Daniel J Lizotte, and Russell Greiner. The budgeted multi-armed bandit problem. In Learning Theory, pages 643–645. Springer, 2004.
• Pandey and Olston (2006) Sandeep Pandey and Christopher Olston. Handling advertisements of unknown quality in search advertising. In Advances in Neural Information Processing Systems, pages 1065–1072, 2006.
• Rockafellar (2015) Ralph Tyrell Rockafellar. Convex analysis. Princeton university press, 2015.
• Singla and Krause (2013) Adish Singla and Andreas Krause. Truthful incentives in crowdsourcing tasks using regret minimization mechanisms. In Proc. of the 22nd WWW, pages 1167–1178, 2013.
• Slivkins and Vaughan (2013) Aleksandrs Slivkins and Jennifer Wortman Vaughan. Online decision making in crowdsourcing markets: Theoretical challenges (position paper). CoRR, abs/1308.1746, 2013.
• Tran-Thanh et al. (2010) Long Tran-Thanh, Archie C. Chapman, Enrique Munoz de Cote, Alex Rogers, and Nicholas R. Jennings. Epsilon-first policies for budget-limited multi-armed bandits. In Proc. of the 24th AAAI, 2010.
• Tran-Thanh et al. (2012) Long Tran-Thanh, Archie C. Chapman, Alex Rogers, and Nicholas R. Jennings. Knapsack based optimal policies for budget-limited multi-armed bandits. In AAAI, 2012.
• Vaidya (1989a) Pravin M Vaidya. A new algorithm for minimizing convex functions over convex sets. In Foundations of Computer Science, 1989., 30th Annual Symposium on, pages 338–343. IEEE, 1989a.
• Vaidya (1989b) Pravin M Vaidya.

Speeding-up linear programming using fast matrix multiplication.

In Foundations of Computer Science, 1989., 30th Annual Symposium on, pages 332–337. IEEE, 1989b.
• Wu et al. (2015) Huasen Wu, R. Srikant, Xin Liu, and Chong Jiang. Algorithms with logarithmic or sublinear regret for constrained contextual bandits. CoRR, abs/1504.06937, 2015. URL http://arxiv.org/abs/1504.06937.

## Appendix A Concentration Inequalities

###### Lemma 7.

(Freedman’s inequality for martingales [Beygelzimer et al., 2011]) Let be a sequence of real-valued random variables. Assume for all and . Define and