DeepAI

# Non-monotonic Resource Utilization in the Bandits with Knapsacks Problem

Bandits with knapsacks (BwK) is an influential model of sequential decision-making under uncertainty that incorporates resource consumption constraints. In each round, the decision-maker observes an outcome consisting of a reward and a vector of nonnegative resource consumptions, and the budget of each resource is decremented by its consumption. In this paper we introduce a natural generalization of the stochastic BwK problem that allows non-monotonic resource utilization. In each round, the decision-maker observes an outcome consisting of a reward and a vector of resource drifts that can be positive, negative or zero, and the budget of each resource is incremented by its drift. Our main result is a Markov decision process (MDP) policy that has constant regret against a linear programming (LP) relaxation when the decision-maker knows the true outcome distributions. We build upon this to develop a learning algorithm that has logarithmic regret against the same LP relaxation when the decision-maker does not know the true outcome distributions. We also present a reduction from BwK to our model that shows our regret bound matches existing results.

• 4 publications
• 14 publications
02/01/2020

### Advances in Bandits with Knapsacks

"Bandits with Knapsacks" () is a general model for multi-armed bandits u...
06/10/2015

### An efficient algorithm for contextual bandits with knapsacks, and an extension to concave objectives

We consider a contextual version of multi-armed bandit problem with glob...
02/08/2022

### Budgeted Combinatorial Multi-Armed Bandits

We consider a budgeted combinatorial multi-armed bandit setting where, i...
01/29/2021

### Variance-Aware Confidence Set: Variance-Dependent Bound for Linear Bandits and Horizon-Free Bound for Linear Mixture MDP

We show how to construct variance-aware confidence sets for linear bandi...
07/14/2020

### Optimal Learning for Structured Bandits

We study structured multi-armed bandits, which is the problem of online ...
07/01/2021

### Markov Decision Process modeled with Bandits for Sequential Decision Making in Linear-flow

In membership/subscriber acquisition and retention, we sometimes need to...
11/25/2022

### On the Re-Solving Heuristic for (Binary) Contextual Bandits with Knapsacks

In the problem of (binary) contextual bandits with knapsacks (CBwK), the...

## 1 Introduction

Multi-armed bandits are the quintessential model of sequential decision-making under uncertainty in which the decision-maker must trade-off between exploration and exploitation. They have been studied extensively and have numerous applications, such as clinical trials, ad placements, and dynamic pricing to name a few. We refer the reader to Bubeck and Cesa-Bianchi (2012); Slivkins (2019); Lattimore and Szepesvári (2020) for an introduction to bandits. An important shortcoming of the basic stochastic bandits model is that it does not take into account resource consumption constraints that are present in many of the motivating applications. For example, in a dynamic pricing application the seller may be constrained by a limited inventory of items that can run out well before the end of the time horizon. The bandits with knapsacks ) model (Tran-Thanh et al., 2010, 2012; Badanidiyuru et al., 2013, 2018) remedies this by endowing the decision-maker with some initial budget for each of resources. In each round, the outcome is a reward and a vector of nonnegative resource consumptions, and the budget of each resource is decremented by its consumption. The process ends when the budget of any resource becomes nonpositive. However, even this formulation fails to model that in many applications resources can get replenished or renewed over time. For example, in a dynamic pricing application a seller may receive shipments that increase their inventory level.

#### Contributions

In this paper we introduce a natural generalization of by allowing non-monotonic resource utilization. The decision-maker starts with some initial budget for each of resources. In each round, the outcome is a reward and a vector of resource drifts that can be positive, negative or zero, and the budget of each resource is incremented by its drift. A negative drift has the effect of decreasing the budget akin to consumption in and a positive drift has the effect of increasing the budget. We consider two settings: (i) when the decision-maker knows the true outcome distributions and must design a Markov decision process (MDP) policy; and (ii) when the decision-maker does not know the true outcome distributions and must design a learning algorithm.

Our main contribution is an MDP policy, ControlBudget(CB), that has constant regret against a linear programming (LP) relaxation. Such a result was not known even for . We build upon this to develop a learning algorithm, ExploreThenControlBudget(ETCB), that has logarithmic regret against the same LP relaxation. We also present a reduction from to our model and show that our regret bound matches existing results.

Instead of merely sampling from the optimal probability distribution over arms, our policy samples from a perturbed distribution to ensure that the budget of each resource stays close to a decreasing sequence of thresholds. The sequence is chosen such that the expected leftover budget is a constant and proving this is a key step in the regret analysis. Our work combines aspects of related work on logarithmic regret for

(Flajolet and Jaillet, 2015; Li et al., 2021).

#### Related Work

Multi-armed bandits have a rich history and logarithmic instance-dependent regret bounds have been known for a long time (Lai et al., 1985; Auer et al., 2002a). Since then, there have been numerous papers extending the stochastic bandits model in a variety of ways (Auer et al., 2002b; Slivkins, 2011; Kleinberg et al., 2019; Badanidiyuru et al., 2018; Immorlica et al., 2019; Agrawal and Devanur, 2014, 2016).

To the best of our knowledge, there are three papers on logarithmic regret bounds for . Flajolet and Jaillet (2015) showed the first logarithmic regret bound for . In each round, their algorithm finds the optimal basis for an optimistic version of the LP relaxation, and chooses arms from the resulting basis to ensure that the average resource consumption stays close to a pre-specified level. Even though their regret bound is logarithmic in and inverse linear in the suboptimality gap, it is exponential in the number of resources. Li et al. (2021) showed an improved logarithmic regret bound that is polynomial in the number of resources, but it scales inverse quadratically with the suboptimality gap and their definition of the gap is different from the one in Flajolet and Jaillet (2015)

. The main idea behind improving the dependence on the number of resources is to proceed in two phases: (i) identify the set of arms and binding resources in the optimal solution; (ii) in each round, solve an adaptive, optimistic version of the LP relaxation and sample an arm from the resulting probability distribution. Finally,

Sankararaman and Slivkins (2021) show a logarithmic regret bound for against a fixed-distribution benchmark. However, the regret of this benchmark itself with the optimal MDP policy can be as large as  (Flajolet and Jaillet, 2015; Li et al., 2021).

## 2 Preliminaries

### 2.1 Model

Let denote a finite time horizon, a set of arms, denote a set of resources, and denote the initial budget of resource . In each round , if the budget of any resource is less than , then . Otherwise, . The algorithm chooses an arm and observes an outcome . The algorithm earns reward and the budget of resource is incremented by drift as .

Each arm has an outcome distribution over and is drawn from the outcome distribution of the arm . We use to denote the expected outcome vector of arm consisting of the expected reward and the expected drifts for each of the resources.111In fact, our proofs remain valid even if the outcome distribution depends on the past history provided the conditional expectation is independent of the past history and fixed for each arm. In this case denotes the conditional expectation of when arm is pulled in round . Since our proofs rely on the Azuma-Hoeffding inequality, we need this assumption on the conditional expectation to hold. We also use to denote the vector of expected drifts for resource . We assume that arm is a null arm with three important properties: (i) its reward is zero a.s.; (ii) the drift for each resource is nonnegative a.s.; and (iii) the expected drift for each resource is positive. The second and third properties of the null arm plus the model’s requirement that if s.t.  ensure that the budgets are nonnegative a.s. and can be safely increased from .

Our model is intended to capture applications featuring resource renewal, such as the following. In each round, each resource gets replenished by some random amount and the chosen arm consumes some random amount of each resource. If the consumption is less than replenishment, the resource gets renewed. The random variable

then models the net replenishment minus consumption. The full model presented above is more general because it allows both the consumption and replenishment to depend on the arm pulled.

We consider two settings in this paper.

MDP setting

The decision-maker knows the true outcome distributions. In this setting the model implicitly defines an MDP, where the state is the budget vector, the actions are arms, and the transition probabilities are defined by the outcome distributions of the arms.

Learning setting

The decision-maker does not know the true outcome distributions.

The goal is to design to an MDP policy for the first setting and a learning algorithm for the second, and bound their regret against an LP relaxation as defined in the next subsection.

### 2.2 Linear Programming Relaxation

Similar to Badanidiyuru et al. (2018, Lemma 3.1), we consider the following LP relaxation that provides an upper bound on the total expected reward of any algorithm:

 OPTLP=maxp{∑x∈Xpxμrx:∑x∈Xpxμd,jx≥\nicefrac−BT ∀j∈J, ∑x∈Xpx=1, px≥0 ∀x∈X}. (1)
###### Lemma 2.1.

The total expected reward of any algorithm is at most .

The proof of this lemma, similar to those in existing works (Agrawal and Devanur, 2014; Badanidiyuru et al., 2018), follows from the observations that (i) the variables can be interpreted as the probability of choosing arm in a round; and (ii) if we set equal to the expected number of times is chosen by an algorithm divided by , then it is a feasible solution for the LP.

###### Definition 2.1 (Regret).

The regret of an algorithm is defined as , where denotes the total expected reward of .

### 2.3 Assumptions

We assume that the initial budget of every resource is . This assumption is without loss of generality because otherwise we can scale the drifts by dividing them by the smallest budget. This results in a smaller support set for the drift distribution that is still contained in .

Our assumptions about the null arm are a major difference between our model and . In the budgets can only decrease and the process ends when the budget of any resource reaches . However, in our model the budgets can increase or decrease, and the process ends at the end of the time horizon. Our assumptions about the null arm allow us to increase the budget from without making it negative.222In a model where, in each round, each resource gets replenished by some random amount and the chosen arm consumes some random amount of each resource, the null arm represents the option to remain idle and do nothing while waiting for resource replenishment. See Appendix A for more discussion on the assumptions about the null arm. A side-effect of this is that in our model we can even assume that is a small constant because we can always increase the budget by pulling the null arm, in contrast to existing literature on that assume the initial budgets are large and often scale with the time horizon.

A standard assumption for achieving logarithmic regret in stochastic bandits is that the gap between the total expected reward of an optimal arm and that of a second-best arm is positive. There are a few different ways in which one could translate this to our model where the optimal solution is a mixture over arms. We make the following choice. We assume that there exists a unique set of arms that form the support set of the LP solution and a unique set of resources that correspond to binding constraints in the LP solution (Li et al., 2021). We define the gap of the problem instance in Definition 4.1 and our uniqueness assumption333This assumption is essentially without loss of generality because the set of problem instances with multiple optimal solutions is a set of measure zero. implies that the gap is strictly positive.

We make a few separation assumptions parameterized by four positive constants that can be arbitrarily small. First, the smallest magnitude of the drifts, , satisfies

. Second, the smallest singular value of the LP constraint matrix, denoted by

, satisfies . Third, the LP solution satisfies for all . Fourth, for all resources . The first assumption is necessary for logarithmic regret bounds because otherwise one can show that the regret of the optimal algorithm for the case of one resoure and one zero-drift arm is  (Appendix B). The second and third assumptions are essentially the same as in existing literature on logarithmic regret bounds for  (Flajolet and Jaillet, 2015; Li et al., 2021). The fourth assumption allows us to design algorithms that can increase the budgets of the non-binding resources away from , thereby reducing the number of times the algorithm has to pull the null arm. Otherwise, if they have zero drift, then, as stated above, the regret of the optimal algorithm for the case of one resource and one zero-drift arm is  (Appendix B).

## 3 MDP Policy with Constant Regret

In this section we design an MDP policy, ControlBudget (Algorithm 2), with constant regret in terms of for the setting when the learner knows the true outcome distributions and our model implicitly defines an MDP (Section 2.1). At a high level, ControlBudget, which shares similarities with Flajolet and Jaillet (2015, Algorithm UCB-Simplex), plays arms to keep the budgets close to a decreasing sequence of thresholds. The choice of this sequence allows us to show that the expected leftover budgets and the expected number of null arm pulls are constants. This is a key step in proving the final regret bound. We start by considering the special case of one resource in Section 3.1 because it provides intuition for the general case of multiple resources in Section 3.2.

### 3.1 Special Case: One Resource

Since there is only one resource we drop the superscript in this section. We say that an arm is a positive (resp. negative) drift arm if (resp. ). The following lemma characterizes the possible solutions of the LP (Eq. 1).

###### Lemma 3.1.

The solution of the LP relaxation (Eq. 1) is supported on at most two arms. Furthermore, if , then the solution belongs to one of three categories: (i) supported on a single positive drift arm; (ii) supported on the null arm and a negative drift arm; (iii) supported on a positive drift arm and a negative drift arm.

The proof of this lemma follows from properties of LPs and a case analysis of which constraints are tight. Our MDP policy, ControlBudget (Algorithm 1), deals with the three cases separately and satisfies the following regret bound.444In this theorem and the rest of the paper, we use to denote a constant that depends on problem parameters, including , and the various separation constants mentioned in Section 2.3, but does not depend on . We use this notation because the main focus of this work is how the regret scales as a function of .

###### Theorem 3.1.

If , the MDP policy ControlBudget (Algorithm 1) satisfies

 RT(ControlBudget)≤~C, (2)

where is a constant.

We defer all proofs in this section to Appendix C, but we include a proof sketch of most results in the main paper following the statement. The proof of Theorem 3.1 follows from the following sequence of lemmas.

###### Lemma 3.2.

If the LP solution is supported on a positive drift arm , then

 RT(ControlBudget)≤~C, (3)

where is a constant.

We can write the regret in terms of the norm of , where is the expected difference between the number of times is played by the LP and by ControlBudget. This is equal to the expected number of times the policy plays the null arm and, in turn, is equal to the expected number of rounds in which the budget is below . Since both and have positive drift, this is a transient random walk that drifts away from . It is known that such a walk spends a constant number of rounds in any state in expectation.

###### Lemma 3.3.

If the LP solution is supported on the null arm and a negative drift arm , then

 RT(ControlBudget)≤~C⋅E[BT], (4)

where is a constant.

We can write the regret in terms of the norm of , where is the expected difference between the number of times is played by the LP and by ControlBudget. Since both constraints (resource and sum-to-one) are tight, the lemma follows by writing and taking norms, where is the LP constraint matrix and .

###### Lemma 3.4.

If the LP solution is supported on a positive drift arm and a negative drift arm , then

 (5)

where denotes the expected number null arm pulls and is a constant.

This lemma follows similarly to the previous one by writing regret in terms of the norm of and writing for .

Therefore, proving that is a constant in requires proving that both the expected leftover budget and expected number of null arm pulls are constants. Intuitively, we could ensure is small by playing the negative drift arm whenever the budget is at least . However, there is constant probability of the budget decreasing below and the expected number of null arm pulls becomes . ControlBudget solves the tension between the two objectives by carefully choosing a decreasing sequence of thresholds . The threshold is initially far from to ensure low probability of pre-mature resource depletion, but decreases to over time to ensure small expected leftover budget and decreases at a rate that ensures the expected number of null arm pulls is a constant.

###### Lemma 3.5.

If the LP solution is supported on a positive drift arm and a negative drift arm , and , then

 E[Nx0]≤~C, (6)

where is a constant.

If the budget is below the threshold, i.e., for some , then ControlBudget pulls until for some . Since has positive drift, the event that repeated pulls decrease the budget towards is a low probability event. Using this, our choice of for an appropriate constant , and summing over all rounds shows that the expected number of rounds in which the budget is less than is a constant in .

###### Lemma 3.6.

If the LP solution is supported on two arms, and , then

 E[BT]≤~C, (7)

where is a constant.

If , then ControlBudget pulls a negative drift arm . We can upper bound the expected leftover budget by conditioning on , the number of consecutive pulls of at the end of the timeline. The main idea in completing the proof is that (i) if is large, then it corresponds to a low probability event; and (ii) if is small, then the budget in round was smaller than , which is a decreasing sequence in , and there are few rounds left so the budget cannot increase by too much.

### 3.2 General Case: Multiple Resources

Now we use the ideas from Section 3.1 to tackle the case of resources that is much more challenging. Generalizing Lemma 3.1, the solution of the LP relaxation (Eq. 1) is supported on at most arms. Informally, our MDP policy, ControlBudget (Algorithm 2), samples an arm from a probability distribution that ensures drifts bounded away from in the “correct directions”: (i) a binding resource has drift at least if and drift at most if ; and (ii) a non-binding resource has drift at least if . This allows us to show that the expected leftover budget for each binding resource and the expected number of null arm pulls are constants in terms of .

###### Theorem 3.2.

If , the regret of ControlBudget (Algorithm 2) satisfies

 RT(ControlBudget)≤~C, (9)

where (defined in Lemma 3.9) and are constants with .

We defer all proofs in this section to Appendix D, but we include a proof sketch of most results in the main paper following the statement. The proof of Theorem 3.2 follows from the following sequence of lemmas. The next two lemmas are generalizations of Lemmas 3.4 and 3.3 with essentially the same proofs. Recall denotes the unique set of resources that correspond to binding constraints in the LP solution (Section 2.3).

###### Lemma 3.7.

If the LP solution includes the null arm in its support, then

 RT(ControlBudget)≤~C⋅⎛⎝∑j∈J∗E[BT,j]⎞⎠, (10)

where is a constant.

###### Lemma 3.8.

If the LP solution does not include the null arm in its support, then

 RT(ControlBudget)≤~C⋅⎛⎝∑j∈J∗E[BT,j]+E[Nx0]⎞⎠, (11)

where denotes the expected number of null arm pulls and is a constant.

Lemmas 3.11 and 3.10 are generalizations of Lemmas 3.6 and 3.5 with similar proofs after taking a union bound over resources. But we first need Lemma 3.9 that lets us conclude there is drift of magnitude at least in the “correct directions” as stated earlier.

###### Lemma 3.9.

(Flajolet and Jaillet, 2015, Lemma 14) In each round , .

The proof of this lemma is identical to Flajolet and Jaillet (2015, Lemma 14) but we provide a proof in the appendix for completeness.

###### Lemma 3.10.

If the LP solution does not include the null arm in its support, then

 E[Nx0]≤~C, (12)

where is a constant.

###### Lemma 3.11.

If the LP solution is supported on more than one arm, then for all

 E[BT,j]≤~C, (13)

where is a constant.

A subtle but important point is that the regret analysis does not require ControlBudget to know the true expected drifts in order to find the probability vector . It simply requires the algorithm to know , , and find any probability vector that ensures drifts bounded away from in the “correct directions” as stated earlier. We use this property in our learning algorithm, ExploreThenControlBudget (Algorithm 3), in the next section.

## 4 Learning Algorithm with Logarithmic Regret

In this section we design a learning algorithm, ExploreThenControlBudget (Algorithm 3), with logarithmic regret in terms of for the setting when the learning does not know the true distributions. Our algorithm, which can be viewed as combining aspects of Li et al. (2021, Algorithm 1) and Flajolet and Jaillet (2015, Algorithm UCB-Simplex), proceeds in three phases. It uses phase one of Li et al. (2021, Algorithm 1) to identify the set of optimal arms and the set of binding constraints

by playing arms in a round-robin fashion, and using confidence intervals and properties of LPs. This is reminiscent of successive elimination

(Even-Dar et al., 2002), except that the algorithm tries to identify the optimal arms instead of eliminating suboptimal ones. In the second phase the algorithm continues playing the arms in in a round-robin fashion to shrink the confidence radius further. In the third phase the algorithm plays a variant of the MDP policy ControlBudget (Algorithm 2) with a slighly different optimization problem for because it only has empirical esitmates of the drifts.

### 4.1 Additional Notation and Preliminaries

For all arms and rounds , define the upper confidence bound (UCB) of the expected outcome vector as , where denotes the confidence radius, denotes the number of times has been played before , and denotes the empirical mean outcome vector of . The lower confidence bound (LCB) is defined similarly as .

For all arms , let denote the value of the LP relaxation (Eq. 1) with the additional constraint , and for all resources , let denote the value when the objective has an extra term (Li et al., 2021). Intuitively, these represent how important it is to play arm or make the resource constraint for a binding constraint. Define the UCB of to be the value of the LP when the expected outcome is replaced by its UCB, and denote this by . The LCB for , and UCB and LCB for and are defined similarly.

###### Definition 4.1 (Gap (Li et al., 2021)).

The gap of the problem instance is defined as

 Δ=min{minx∈X∗{OPTLP−OPT−x},minj∉J∗{OPTLP−OPT−j}}. (14)

### 4.2 Learning Algorithm and Regret Analysis

###### Theorem 4.1.

If , the regret of ExploreThenControlBudget (Algorithm 3) satisfies

 RT(ExploreThenControlBudget)≤~C⋅logT, (18)

where (defined in Lemma 3.9), and are constants with denoting the constant in Theorem 3.2 and

 ~C=O⎛⎝km2min{δ2drift,σ2min}Δ2+k(γ∗)−2+~C′⎞⎠. (19)

We defer all proofs in this section to Appendix E, but we include a proof sketch of most results in the main paper following the statement. We refer to the three while loops of ExploreThenControlBudget (Algorithm 3) as the three phases. The lemmas and corollaries following the theorem below show that the first phase consists of at most logarithmic number of rounds. It is easy to see that the second phase consists of at most logarithmic number of rounds. The third phase plays a variant of the MDP policy ControlBudget (Algorithm 2). There exists a feasible solution to the optimization problem in Algorithm 3 that ensures drifts bounded away from in the “correct directions” (Section E.3). Combining the analysis of the first two phases with  Theorem 3.2 lets us conclude that ExploreThenControlBudget has logarithmic regret.

###### Definition 4.2 (Clean Event).

The clean event is the event such that for all and all , (i) ; and (ii) after the first pulls of the null arm the sum of the drifts for each resource is at least , where and .

###### Lemma 4.1.

The clean event occurs with probability at least .

The lemma follows from the Azuma-Hoeffding inequality. Since the complement of the clean event contributes to the regret, it suffices to bound the regret conditioned on the clean event.

###### Lemma 4.2.

If the clean event occurs, then . A similar statement is true for and for all and .

The proof follows from a perturbation analysis of the LP and uses the confidence radius to bound the perturbations in the rewards and drifts.

###### Corollary 4.1.

If the clean event occurs and for all , then

 UCBt(OPT−x)

for all and .

This follows from substituting the bound on into the definition of and applying Lemma 4.2. In the worst case, each pull of an arm can cause the budget to drop below , but the clean event implies that the first pulls of have enough total drift to allow pulls of each non-null arm in phase 1. This allows us to upper bound the duration of phase 1 as follows.

###### Corollary 4.2.

If the clean event occurs, then phase 1 of ExploreThenControlBudget has at most

 ~C⋅logT (21)

rounds, where is a constant.

### 4.3 Reduction from BwK

###### Theorem 4.2.

Suppose . Consider a instance such that (i) for each arm and resource, the expected consumption of that resource differs from by at least ; and (ii) all the other assumptions required by Theorem 4.1 (Section 2.3) are also satisfied. Then, there is an algorithm for whose regret satisfies the same bound as in Theorem 4.1 with the same constant .

Due to space constraints, we present the reduction in  Section E.4.

## 5 Conclusion

In this paper we introduced a natural generalization of that allows non-monotonic resource utilization. We first considered the setting when the decision-maker knows the true distributions and presented an MDP policy with constant regret against an LP relaxation. Then we considered the setting when the decision-maker does not know the true distributions and presented a learning algorithm with logarithmic regret against the same LP relaxation. Finally, we also presented a reduction from to our model and showed a regret bound that matches existing results (Li et al., 2021).

An important direction for future research is to obtain optimal regret bounds. The regret bound for our algorithm scales as , where is the number of arms, is the number of resources and is the suboptimality parameter. A modification to our algorithm along the lines of Flajolet and Jaillet (2015, algorithm UCB-Simplex) that considers each support set of the LP solution explicitly leads to a regret bound that scales as . It is an open question, even for , to obtain a regret bound that scales as or show that the trade-off between the dependence on the number of resources and the suboptimality parameter is unavoidable.

Another natural follow-up to our work is to develop further extensions, such as considering an infinte set of arms (Kleinberg et al., 2019), studying adversarial observations (Auer et al., 2002b; Immorlica et al., 2019), or incorporating contextual information (Slivkins, 2011; Agrawal and Devanur, 2016) as has been the case elsewhere throughout the literature on bandits.

We thank Kate Donahue, Sloan Nietert, Ayush Sekhari, and Alex Slivkins for helpful discussions. This research was partially supported by the NSERC Postgraduate Scholarships-Doctoral Fellowship 545847-2020.

## References

• S. Agrawal and N. R. Devanur (2014) Bandits with concave rewards and convex knapsacks. In Proceedings of the fifteenth ACM conference on Economics and computation, pp. 989–1006. Cited by: §1, §2.2.
• S. Agrawal and N. Devanur (2016) Linear contextual bandits with knapsacks. Advances in Neural Information Processing Systems 29. Cited by: §1, §5.
• P. Auer, N. Cesa-Bianchi, and P. Fischer (2002a) Finite-time analysis of the multiarmed bandit problem. Machine learning 47 (2), pp. 235–256. Cited by: §1.
• P. Auer, N. Cesa-Bianchi, Y. Freund, and R. E. Schapire (2002b) The nonstochastic multiarmed bandit problem. SIAM journal on computing 32 (1), pp. 48–77. Cited by: §1, §5.
• A. Badanidiyuru, R. Kleinberg, and A. Slivkins (2013) Bandits with knapsacks. In 2013 IEEE 54th Annual Symposium on Foundations of Computer Science, pp. 207–216. Cited by: §1.
• A. Badanidiyuru, R. Kleinberg, and A. Slivkins (2018) Bandits with knapsacks. Journal of the ACM (JACM) 65 (3), pp. 1–55. Cited by: Non-monotonic Resource Utilization in the Bandits with Knapsacks Problem, §1, §1, §2.2, §2.2.
• S. Bubeck and N. Cesa-Bianchi (2012) Regret analysis of stochastic and nonstochastic multi-armed bandit problems. Found. Trends Mach. Learn. 5 (1), pp. 1–122. Cited by: §1.
• E. Even-Dar, S. Mannor, and Y. Mansour (2002) PAC bounds for multi-armed bandit and markov decision processes. In

International Conference on Computational Learning Theory

,
pp. 255–270. Cited by: §4.
• A. Flajolet and P. Jaillet (2015) Logarithmic regret bounds for bandits with knapsacks. arXiv preprint arXiv:1510.01800v4. Cited by: §1, §1, §2.3, §3.2, Lemma 3.9, §3, §4, §5.
• N. Immorlica, K. A. Sankararaman, R. Schapire, and A. Slivkins (2019) Adversarial bandits with knapsacks. In 2019 IEEE 60th Annual Symposium on Foundations of Computer Science (FOCS), pp. 202–219. Cited by: §1, §5.
• R. Kleinberg, A. Slivkins, and E. Upfal (2019) Bandits and experts in metric spaces. Journal of the ACM (JACM) 66 (4), pp. 1–77. Cited by: §1, §5.
• T. L. Lai, H. Robbins, et al. (1985) Asymptotically efficient adaptive allocation rules. Advances in applied mathematics 6 (1), pp. 4–22. Cited by: §1.
• T. Lattimore and C. Szepesvári (2020) Bandit algorithms. Cambridge University Press. Cited by: §1.
• X. Li, C. Sun, and Y. Ye (2021) The symmetry between arms and knapsacks: a primal-dual approach for bandits with knapsacks. In International Conference on Machine Learning, pp. 6483–6492. Cited by: §E.4, Non-monotonic Resource Utilization in the Bandits with Knapsacks Problem, §1, §1, §2.3, §2.3, §4.1, Definition 4.1, §4, §5.
• K. A. Sankararaman and A. Slivkins (2021) Bandits with knapsacks beyond the worst case. Advances in Neural Information Processing Systems 34. Cited by: §1.
• A. Slivkins (2011) Contextual bandits with similarity information. In Proceedings of the 24th annual Conference On Learning Theory, pp. 679–702. Cited by: §1, §5.
• A. Slivkins (2019) Introduction to multi-armed bandits. Found. Trends Mach. Learn. 12 (1-2), pp. 1–286. Cited by: §1.
• L. Tran-Thanh, A. Chapman, E. M. De Cote, A. Rogers, and N. R. Jennings (2010) Epsilon–first policies for budget–limited multi-armed bandits. In

Twenty-Fourth AAAI Conference on Artificial Intelligence

,
Cited by: §1.
• L. Tran-Thanh, A. Chapman, A. Rogers, and N. Jennings (2012) Knapsack based optimal policies for budget–limited multi–armed bandits. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 26, pp. 1134–1140. Cited by: §1.

## Checklist

1. For all authors…

1. Do the main claims made in the abstract and introduction accurately reflect the paper’s contributions and scope?

2. Did you describe the limitations of your work? We discuss potential improvements in Section 5.

3. Did you discuss any potential negative societal impacts of your work? Our paper studies a theoretical model and we do not believe there are any negative societal impacts of this paper.

4. Have you read the ethics review guidelines and ensured that your paper conforms to them?

2. If you are including theoretical results…

1. Did you state the full set of assumptions of all theoretical results? We state our assumptions in Section 2.3.

2. Did you include complete proofs of all theoretical results? We include complete proofs in the appendix. But we include a proof sketch of most lemmas in the main paper following the lemma statement.

3. If you ran experiments…

1. Did you include the code, data, and instructions needed to reproduce the main experimental results (either in the supplemental material or as a URL)?

2. Did you specify all the training details (e.g., data splits, hyperparameters, how they were chosen)?

3. Did you report error bars (e.g., with respect to the random seed after running experiments multiple times)?

4. Did you include the total amount of compute and the type of resources used (e.g., type of GPUs, internal cluster, or cloud provider)?

4. If you are using existing assets (e.g., code, data, models) or curating/releasing new assets…

1. If your work uses existing assets, did you cite the creators?

2. Did you mention the license of the assets?

3. Did you include any new assets either in the supplemental material or as a URL?

4. Did you discuss whether and how consent was obtained from people whose data you’re using/curating?

5. Did you discuss whether the data you are using/curating contains personally identifiable information or offensive content?

5. If you used crowdsourcing or conducted research with human subjects…

1. Did you include the full text of instructions given to participants and screenshots, if applicable?

2. Did you describe any potential participant risks, with links to Institutional Review Board (IRB) approvals, if applicable?

3. Did you include the estimated hourly wage paid to participants and the total amount spent on participant compensation?

## Appendix A Assumptions about the Null Arm

In any model in which resources can be consumed and/or replenished over time, one must specify what happens when the budget of one (or more) resources reaches zero. The original bandits with knapsacks problem assumes than when this happens, the process of learning and gaining rewards ceases. The key distinction between that model and ours is that we instead assume the learner is allowed remain idle until the supply of every resource becomes positive again, at which point the learning process recommences. The null arm in our paper is intended to represent this option to remain idle and wait for resource replenishment. In order for these idle periods to have finite length almost surely, a minimal assumption is that when the null arm is pulled, for each resource there is a positive probability that the supply of the resource increases. We make the stronger assumption that for each resource, the expected change in supply is positive when the null arm is pulled. In fact, our results for the MDP setting hold under the following more general assumption: there exists a probability distribution over arms, such that when a random arm is sampled from this distribution and pulled, the expected change in the supply of each resource is positive. In the following, we refer to this as Assumption PD (for "positive drift").

To see that our results for the MDP setting continue to hold under Assumption PD (i.e., even if one doesn’t assume that the null arm itself is guaranteed to yield positive expected drift for each resource) simply modify Algorithms 3 and 2 so that whenever they pull the null arm in a time step when the supply of each resource is at least 1, the modified algorithms instead pull a random arm sampled from the probability distribution over arms that guarantees positive expected drift for every resource. As long as the constant is less than or equal to this positive expected drift, the modification to the algorithms does not change their analysis. We believe it’s likely that our learning algorithm (Algorithm 3) could similarly be adapted to work under Assumption PD, but it would be less straightforward because the positive-drift distribution over arms would need to be learned.

When Assumption PD is violated, the problem becomes much more similar to the Bandits with Knapsacks problem. To see why, consider a two-player zero-sum game in which the row player chooses an arm , the column player chooses a resource , and the payoff is the expected drift of that resource when that arm is pulled, . Assumption PD is equivalent to the assertion that the value of the game is positive; the negation of Assumption PD means that the value of the game is negative. By the Minimax Theorem, this means there is a convex combination of resources (i.e., a mixed strategy for the column player) such that the weighted-average supply of these resources is guaranteed to experience non-positive expected drift, no matter which arm is pulled. Either the expected drift is zero — we prove in Appendix A of the supplementary material that regret is unavoidable in this case — or the expected drift is strictly negative, in which case the weighted-average resource supply inevitably dwindles to zero no matter which arms the learner pulls. In either case, the behavior of the model is qualitatively different when Assumption PD does not hold.

## Appendix B Regret Bounds for One Arm, One Resource, and Zero Drift

In this section we will consider the case when , , and has zero drift, i.e., . Since is the only arm besides the null arm, we assume without loss of generality that its reward is equal to deterministically. The optimal policy is to pull when and otherwise. We will show that the regret of this policy is .

###### Theorem B.1.

The regret of the MDP policy is .

###### Proof.

The optimal solution of the LP relaxation (Eq. 1) is and . Since and have reward equal to and deterministically, .Therefore, the regret of the MDP policy is equal to the expected number rounds in which the budget is less than . That is,

 RT=E[T∑t=11[Bt−1<1]].

Since when and when , we can write

 RT =E[T∑t=11[Bt−1<1]] =1μdx0μdx0E[T∑t=11[Bt−1<1]] =1μdx0E[T∑t=1μdx01[Bt−1<1]+01[Bt−1≥1]] =1μdx0E[T∑t=1dt] =1μdx0(E[BT]−B).

Since , the budget is updated as and , we have

 E[B2t|Bt−1] =E[B2t−1+2Bt−1dt+d2t|Bt−1] =E[B2t−1|Bt−1]+E[2Bt−1dt|Bt−1]+E[d2t|Bt−1] ≤B2t−1+E[2Bt−1dt|Bt−1]+12 =B2t−1+2Bt−1μdx01[Bt−1<1]+1 ≤B2t−1+2μdx0+1 ⇒E[B2T] =O(T).

Using Jensen’s inequality, we have

 E[BT]≤√E[B2T] =O(√T).

This completes the proof. ∎

###### Theorem B.2.

If , then the regret of the MDP policy is .

###### Proof.

Using the proof of Theorem B.1, it suffices to provide a lower bound on . Since the budget is updated as , , and , we have

 E[B2t|Bt−1] =E[B2t−1+2Bt−1dt+d2t|Bt−1] =E[B2t−1|Bt−1]+E[2Bt−1dt|Bt−1]+E[d2t|Bt−1] ≥B2t−1+E[2Bt−1dt|Bt−1]+σ2 =B2t−1+2Bt−1E[dt|Bt−1]+σ2 ≥B2t−1+σ2 ⇒E[B2T] ≥Ω(T).

The Cauchy-Schwarz inequality yields that

 E[(B\nicefrac12T)2]\nicefrac12E[(B\nicefrac32T)2]\nicefrac12≥E[B2T]≥Ω(T).

Squaring both sides yields that

 E[BT]E[B3T]≥E[B2T]2≥Ω(T2).

It suffices to show that because this will imply that . Since , we have

 E[B3t|Bt−1] =E[B3t−1+3B2t−1dt+3Bt−1d2t+d3t|Bt−1] =B3t−1+3B2t−1E[dt|Bt−1]+3Bt−1E[d2t|Bt−1]+E[d3t|Bt−1] =B3t−1+3B2t−1μdx01[Bt−1<1]+3Bt−1E[d2t|Bt−1]+E[d3t|Bt−1] ≤B3t−1+3μdx0+3Bt−1+1.

Taking expectation on both sides yields

 E[B3t] ≤E[B3t−1]+3E[Bt−1]+O(1) ≤E[B3t−1]+O(√t),

where the last inequality follows from the proof of Theorem B.1 where we show that . Summing both sides over all rounds yields that

 E[B3T]=O(T\nicefrac32)

and this completes the proof. ∎

## Appendix C Proofs for Section 3.1

### c.1 Proof of Lemma 3.2

When the LP solution is supported on a positive drift arm , because the LP plays it with probability . Therefore, the regret is equal to the expected number of times ControlBudget (Algorithm 1) pulls the null arm. This, in turn, is equal to the expected number of rounds in which the budget is less than .

Define

 b0=8δ−2driftln⎛⎜ ⎜ ⎜ ⎜ ⎜ ⎜⎝21−exp(−δ2drift8)⎞⎟ ⎟ ⎟ ⎟ ⎟ ⎟⎠. (22)

Then, we have that for all ,

 ∞∑k=bPr[Bs+k∈[0,1)|Bs=b] ≤∞∑k=bexp⎛⎝−δ2driftk8⎞⎠ (23) =exp⎛⎝−δ2driftb8⎞⎠⎛⎝1−exp⎛⎝−δ2drift8⎞⎠⎞⎠−1. (24)

where the first inequality follows from Azuma-Hoeffding’s inequality. By our choice of , we have that

 ∞∑k=bPr[Bs+k∈[0,1)