1 Introduction
The multiarmed bandit (MAB) problem characterizes a problem for which a limited amount of resource must be allocated between competing (alternative) choices in a way that maximizes the expected gain. The bandits with knapsacks (BwK) problem generalizes the multiarmed bandits problem to allow more general resource constraints structure on the decisions made over time, in addition to the customary limitation on the time horizon. Specifically, for the BwK problem, the decision maker/player chooses to play an arm at each time period; s/he will receive a reward and consume certain amount of resource from each of the multiple resource types. Accordingly, the objective is to maximize the cumulative reward over a finite time horizon and subject to an initial budget of multiple resource types. The BwK problem was first introduced by badanidiyuru2013bandits as a general framework to model a wide range of applications, including dynamic pricing and revenue management (besbes2012blind), Adwords problem (mehta2005adwords) and more.
The standard setting of the BwK problem is stochastic where the joint distribution of reward and resource consumption for each arm remains stationary (identical) over time. Under such setting, a linear program (LP), that takes the expected reward and resource consumption of each arm as input, both serves as the benchmark for regret analysis and drives the algorithm design
(badanidiyuru2013bandits; agrawal2014bandits). Notably, a static best distribution prescribed by the LP’s optimal solution is used for defining the regret benchmark. An alternative setting is the adversarial BwK problem where the reward and the consumption may no long follow a distribution and they can be chosen arbitrarily over time. Under the adversarial setting, a sublinear regret is not achievable in the worst case; immorlica2019adversarial derive a competitive ratio against the static best distribution benchmark which is aligned with the static optimal benchmark in the adversarial bandits problem (auer1995gambling). Another key of the BwK problem is the number of resource types . When , one optimal decision is to play the arm with largest (expected) reward to (expected) resource consumption ratio, where the algorithm design and analysis can be largely reduced to the MAB problem. When , the optimal decision in general requires to play a combination of arms (corresponding the optimal basis of the underlying LP). rangi2018unifying focus on the case of and propose an EXP3based algorithm that attains a regret of against the best fixed distribution benchmark. Their result thus bridges the gap between the stochastic BwK problem and the adversarial BwK problem for the case of . The difference between the cases of and is also exhibited in the derivation of problemdependent regret bounds for the stochastic BwK problem (flajolet2015logarithmic; li2021symmetry; sankararaman2021bandits).In this paper, we study the nonstationary BwK problem where the reward and the resource consumption at each time are sampled from a distribution as the stochastic BwK problem but the distribution may change over time. The setting relaxes the temporally i.i.d. assumption in the stochastic setting and it can be viewed as a soft measure of adversity. We aim to relate the nonstationarity (or adversity) of the distribution change with the bestachievable algorithm performance, and thus our result bridges the two extremes of BwK problem: stochastic BwK and adversarial BwK. We consider a dynamic benchmark to define the regret; while such a benchmark is aligned with the dynamic benchmark in other nonstationary learning problem (besbes2014stochastic; besbes2015non; cheung2019learning; faury2021regret), it is stronger than the static distribution benchmark in adversarial BwK (rangi2018unifying; immorlica2019adversarial). Importantly, we use simple examples and lower bound results to show that the traditional nonstationarity measures such as change points and variation budget are not suitable for the BwK problem due to the presence of the constraints. We introduce a new nonstationarity measure called global variation budget and employ both of this new measure and the original variation budget to capture the underlying nonstationarity of the BwK problem. We analyze the performance of a slidingwindow UCBbased BwK algorithm and derive a nearoptimal regret bound. Furthermore, we show that the new nonstationarity measure can also be applied to the problem of online convex optimization with constraints (OCOwC) and extend the analyses therein.
1.1 Related literature
The study of nonstationary bandits problem begins with the changepoint or piecewisestationary setting where the distribution of the rewards remains constant over epochs and changes at unknown time instants
(garivier2008upper; yu2009piecewise). The prototype of nonstationary algorithms such as discounted UCB and slidingwindow UCB are proposed and analyzed in (garivier2008upper) to robustify the standard UCB algorithm against the environment change. The prevalent variation budget measure (where and the norm bear different meaning under different context) is later proposed and widely studied under different contexts, such as nonstationary stochastic optimization (besbes2015non), nonstationary MAB (besbes2014stochastic), nonstationary linear bandits (cheung2019learning), and nonstationary generalized linear bandits (faury2021regret) problems. In general, these works derive lower bound of , and propose algorithms that achieve nearoptimal regret of cheung2019learning and faury2021regret require additional orthogonality assumption on the decision set to attain such upper bound; without this assumption, a regret bound of can be obtained (faury2021regret). With the soft measure of nonstationarity, the existing results manage to obtain sublinear regret bounds in against dynamic optimal benchmarks. In contrast, a linear regret in is generally inevitable against the dynamic benchmark when the underlying environment is adversarial. We remark that while all these existing works consider the unconstrained setting, our work complements this line of literature with a proper measure of nonstationarity in the constrained setting.Another related stream of literature is the problem of online convex optimization with constraints (OCOwC) which extends the OCO problem in a constrained setting. There are two types of constraints considered: the longterm constraint (jenatton2016adaptive; neely2017online) and the cumulative constraint (yuan2018online; yi2021regret). The former defines the constraint violation by whilst the latter defines it by where is the positivepart function. The existing works mainly study the setting where for all and is known a priori. neely2017online considers a setting where is i.i.d. generated from some distribution. In this paper, we show that our nonstationarity measure naturally extends to this problem and derives bounds for OCOwC when ’s are generated in a nonstationary manner.
A line of works in operations research and operations management literature also study nonstationary environment for online decision making problem under constraints (ma2020approximation; freund2019good; jiang2020online). The underlying problem along this line can be viewed as a fullinformation setting where at each time , the decision is made after the observation of the function/realized randomness/customer type, while the BwK and OCOwC can be viewed as a partialinformation setting where the decision is made prior to and may affect the observation. So for the setting along this line, there is generally no need for exploration in algorithm design, and the main challenge is to trade off the resource consumption with the reward earning.
2 Problem Setup
We first introduce the formulation of the BwK problem. The decisionmaker/learner is given a fixed finite set of arms (with ) called as action set. There are knapsack constraints with a known initial budget of for . Without loss of generality, we assume for all There is a finite time horizon , which is also known in advance. At each time , the learner must choose either to play an arm or to do nothing but wait. If the learner plays the arm at time , s/he will receive a reward and consume amount of each resource from the initial budget . As the convention, we introduce a null arm to model “doing nothing” which generates a reward of zero and consumes no resource at all. We assume is sampled from some distribution independently over time where and . In the stochastic BwK problem, the distribution remains unchanged over time, while in the adversarial BwK problem, is chosen adversarially. In our paper, we allow to be chosen adversarially, while we use some nonstationarity measure to control the extent of adversity in choosing ’s.
At each time , the learner needs to pick using the past observations until time but without observing the outcomes of time step . The resource constraints are assumed to be hard constraints, i.e., the learner must stop at the earliest time when at least one constraint is violated, i.e. , or the time horizon is exceeded. The objective is to maximize the expected cumulative reward until time , i.e. . To measure the performance of a learner, we define the regret of the algorithm/policy adopted by the learner as
Here denotes the expected cumulative reward of the optimal dynamic policy given all the knowledge of ’s in advance. Its definition is based on the dynamic optimal benchmark which allows the arm play decisions/distributions to change over time. As a result, it is stronger than the optimal fixed distribution benchmark used in the adversarial BwK setting (rangi2018unifying; immorlica2019adversarial).
2.1 A Motivating Example
The conventional variation budget is defined by
By twisting the definition of the metric , it captures many of the existing nonstationary measures for unconstrained learning problems. Now we use a simple example to illustrate why no longer fits for the constrained setting. Similar examples have been used to motivate algorithm design and lower bound analysis in (golrezaei2014real; cheung2019learning; jiang2020online), but have not been yet be exploited in a partialinformation setting such as bandits problems.
Consider a BwK problem instance that has two arms (one actual arm and one null arm), and a single resource constraint with initial capacity of . Without loss of generality, we assume is even. The null arm has zero reward and zero resource consumption throughout the horizon, and the actual arm always consumes 1 unit of resource (deterministically) for each play and outputs 1 unit of reward (deterministically) for the first half of the horizon, i.e., when For the second half of the horizon , the reward of the actual arm will change to either or , and the change happens adversarially. For this problem instance, the distribution only changes once, i.e., (varying up to constant due to the metric definition). But for this problem instance, a regret of is inevitable. To see this, if the player plays the actual arm no less than times, then the distributions of the second half can adversarially change to the reward , and this will result in a regret at least. The same for the case of playing the actual arm for the case of no more than times, and we defer the formal analysis to the proof of the lower bounds in Theorem 2.
This problem instance implies that a sublinear dependency on cannot be achieved with merely the variation budget to characterize the nonstationarity. Because with the presence of the constraint(s), the arm play decisions over time are all coupled together not only through the learning procedure, but also through the “global” resource constraint(s). For the unconstrained problems, the nonstationarity affects the effectiveness of the learning of the system; for the constrained problems, the nonstationarity further challenges the decision making process through the lens of the constraints.
2.2 Nonstationarity Measure and Linear Programs
We first follow the conventional variation budget and define the local nonstationarity budget: ^{1}^{1}1Throughout the paper, for a vector , we denote its norm and norm by For a matrix , we denote its norm and norm by
We refer to the measure as a local one in that they capture the local change of the distributions between time and time .
Next, we define the global nonstationarity budget:
where and . These measures capture the total deviations for all the ’s and from their global averages. By definition, and upper bound and (up to a constant), so they can be viewed as a more strict measure of nonstationarity than the local budget. In the definition of , the L norm is not essential and it aims to sharpen the regret bounds (by corresponding to the upper bound on the dual optimal solution in supremum norm to be defined shortly).
All the existing analyses of the BwK problem utilize the underlying linear program (LP) and establish the LP’s optimal objective value as an upper bound of the regret benchmark OPT. In a nonstationary environment, the underlying LP is given by
where and denotes the dimensional standard simplex. We know that
In the rest of our paper, we will use for the analysis of regret upper bound. We remark that in terms of this LP upper bound, the dynamic benchmark allows the to take different values, while the static benchmark will impose an additional constraint to require all the be the same.
For notation simplicity, we introduce the following linear growth assumption. All the results in this paper still hold without this condition.
Assumption 1 (Linear Growth).
We have the resource budget for some .
Define the singlestep LP by
where The singlestep LP’s optimal objective value can be interpreted as the singlestep optimal reward under a normalized resource budget .
Throughout the paper, we will use the dual program and the dual variables to relate the resource consumption with the reward, especially for the nonstationary environment. The dual of the benchmark LP is
s.t.  
where denotes an dimensional allone vector. Here we denotes one optimal solution as
The dual of the singlestep LP is
s.t.  
Here we denotes one optimal solution as We remark that these two dual LPs are always feasible by choosing and some large , so there always exists an optimal solution.
The dual optimal solutions and are also known as the dual price, and they quantify the cost efficiency of each arm play.
Define
The quantity captures the maximum amount of achievable reward by each unit of resource consumption. We will return with more discussion on this quantity after we present the regret bound.
Lemma 1.
We have the following upper bound on
Proposition 1.
We have
Proposition 1 relates the optimal value of the benchmark with the optimal values of the singlestep LPs. To interpret the bound, works as an upper bound of the OPT in defining the regret, and the summation of corresponds to the total reward obtained by evenly allocating the resource over all time periods. In a stationary environment, these two are the same as the optimal decision naturally corresponds to an even allocation of the resources. In a nonstationary environment, it can happen that the optimal allocation of the resource corresponds an uneven one for . For the problem instance in Section 2.1, the optimal allocation may be either to exhaust all the resource in first half of time periods or preserve all the resource for the second half. In such case, forcing an even allocation will reduce the total reward obtained. The proposition tells that the reduction can be bounded by where the nonstationarity in resource consumption is weighted by the dual price upper bound
3 SlidingWindow UCB for Nonstationary BwK
In this section, we adapt the standard slidingwindow UCB algorithm for the BwK problem (Algorithm 1) and derive a nearoptimal regret bound. The algorithm will terminate when any type of the resources is exhausted. At each time
, it constructs standard slidingwindow confidence bounds for the reward and the resource consumption. Specifically, we define the slidingwindow estimators by
where denotes the number of times that the th arm has been played in the last time periods. To be optimistic on the objective value, UCBs are computed for rewards and LCBs are computed for the resource consumption, respectively. With the confidence bounds, the algorithm solves a singlestep LP to prescribe a randomized rule for the arm play decision.
Our algorithm can be viewed as a combination of the standard slidingwindow UCB algorithm (garivier2008upper; besbes2015non) with the UCB for BwK algorithm (agrawal2014bandits). It makes a minor change compared to (agrawal2014bandits) which solves a singlestep LP with a shrinkage factor on the righthandside. The shrinkage factor therein ensures that the resources will not be exhausted until the end of the horizon, but it is not essential to solving the problem. For simplicity, we choose the more natural version of the algorithm which directly solves the singlestep LP. We remark that the knowledge of the initial resource budget and the time horizon will only be used for defining the righthandside of the constraints for this .
Now we begin to analyze the algorithm’s performance. For starters, the following lemma states a standard concentration result for the slidingwindow confidence bound.
Lemma 2.
The following inequalities hold for all
with probability at least
:where the UCB and LCB estimators are defined in Algorithm 1.
With Lemma 2, we can employ a concentration argument to relate the realized reward (or resource consumption) with the reward (or resource consumption) of the LP under its optimal solution. In Lemma 3, recall that is the termination time of the algorithm where some type of resources is exhausted, and is defined in Algorithm 1 as the optimal solution of the LP solved at time .
Lemma 3.
We note that the singlestep LP’s optimal solution is always subject to the resource constraints. So the second group of inequalities in Lemma 3 implies the following bound on the termination time . Recall that is the resource budget per time period; for a larger , the resource consumption process becomes more stable and the budget is accordingly less likely to be exhausted too early.
Corollary 1.
To summarize, Lemma 3 compares the realized reward with the cumulative reward of the singlestep LPs, and Corollary 1 bounds the termination time of the algorithm. Recall that Proposition 1 relates the cumulative reward of the singlestep LPs with the underlying LP – the regret benchmark. Putting together these results, we can optimize and by choosing
and then obtain the final regret upper bound as follows.
Theorem 1.
Theorem 1 provides a regret upper bound for Algorithm 1 that consists of several parts. The first part of the regret bound is on the order of and it captures the regret when the underlying environment is stationary. The remaining parts of the regret bound characterize the relation between the intensity of nonstationarity and the algorithm performance. The nonstationarity from both the reward and the resource consumption will contribute to the regret bound and that from the resource consumption will be weighted by a factor of or (See Lemma 1 for the relation between these two). For the local nonstationarity and , the algorithm requires a prior knowledge of them to decide the window length, aligned with the existing works on nonstationarity in unconstrained settings. For the global nonstationarity and , the algorithm does not require any prior knowledge and they will contribute additively to the regret bound. Together with the lower bound results in Theorem 2, we argue that the regret bound cannot be further improved even with the knowledge of and .
When the underlying environment degenerates from a nonstationary one to a stationary one, all the terms related to , and will disappear and then the upper bound in Theorem 1 matches the regret upper bound for the stochastic BwK setting. In Theorem 1, we choose to represent the upper bound in terms of and so as to reveal its dependency on and draw a better comparison with the literature on unconstrained bandits problem. We provide a second version of Theorem 1 in Appendix D that matches the existing high probability bounds using OPT (badanidiyuru2013bandits; agrawal2014bandits). In contrast to the competitiveness result in the adversarial BwK (immorlica2019adversarial), our result implies that with a property measure of the nonstationarity/adversity, the slidingwindow design provides an effective approach to robustify the algorithm performance when the underlying environment changes from stationary to nonstationary, and the according algorithm performance will not drastically deteriorate when the intensity of the nonstationarity is small.
When the resource constraints become nonbinding for the underlying LPs, the underlying environment degenerates from a constrained setting to an unconstrained setting. We separate the discussion for the two cases: (i) the benchmark LP and all the singlestep LPs have only nonbinding constraints; (ii) the benchmark LP have only nonbinding constraints but some singlestep LP have binding constraints. For case (i), the regret bound in Theorem 1 will match the nonstationary MAB bound (besbes2014stochastic). For case (ii), the match will not happen and this is inevitable. We elaborate the discussion in Section C.
Theorem 2 (Regret lower bounds).
The following lower bounds hold for any policy ,

.

.

.
Theorem 2 presents a few lower bounds for the problem. The first and the second lower bounds are adapted from the lower bound example in nonstationary MAB (besbes2014stochastic) and the third lower bound is adapted from the motivating example in 2.1. There are simple examples where each one of these three lower bounds dominates over the other two. In this sense, all the nonstationarityrelated terms in the upper bound of Theorem 1 are necessary including the parameters and . There is one gap between the lower bound and the upper bound with regard to the number of constraints in the term related to . We leave it as future work to reduce the factor to with some finer analysis. Furthermore, we provide a sharper definition of the global nonstationarity measure and in replacement of and in Appendix B. It makes no essential change to our analysis, and the two measures coincide with each other on the lower bound problem instance. We choose to use and for presentation simplicity, while and can capture the more detailed temporal structure of the nonstationarity. The discussion leaves an open question that whether the knowledge of some additional structure of the environment can further reduce the global nonstationarity.
4 Extension to Online Convex Optimization with Constraints
In this section, we show how our notion of nonstationarity measure can be extended to the problem of online convex optimization with constraints (OCOwC). Similar to BwK, OCOwC also models a sequential decision making problem under the presence of constraints. Specifically, at each time , the player chooses an action from some convex set . After the choice, a convex cost function and a concave resource consumption function are revealed. As in the standard setting of OCO, the functions is adversarially chosen and thus a static benchmark is consider and defined by
s.t. 
Denote its optimal solution as and its dual optimal solution as .
While the existing works consider the case when ’s are static or sample i.i.d. from some distribution We consider a nonstationary setting where may change adversarially over time. We define a global nonstationarity measure by
where and .
The OCOwC problem considers the following biobjective performance measure:
where denotes the positivepart function and denotes the policy/algorithm.
In analogous to the singlestep LPs, we consider an optimization problem with more restricted constraints as
s.t. 
Denote its optimal solution as , and its dual optimal solution as . The following proposition relates the two optimal objective values.
Assumption 2.
We assume that Slater’s condition holds for both the standard OCOwC program OPT and the restricted OCOwC program OPT. We assume that , and are uniformly bounded on and that itself is bounded. Moreover, we assume that their dual optimal solutions are uniformly bounded by , i.e.
The following proposition relates the two optimal objective values.
Proposition 2.
For OCOwC problem, under Assumption 2, we have
Utilizing the proposition, we can show that the gradientbased algorithm of (neely2017online) achieves the following regret for the setting of OCO with nonstationary constraints. Moreover, we further extend the results and discuss in Appendix E on an oblivious adversarial setting where is sampled from some distribution and the distribution may change over time.
Theorem 3.
Under Assumption 2, the Virtual Queue Algorithm of (neely2017online) for any OCOwC problem (denoted by ) produces a decision sequence such that
The theorem tells that the nonstationarity when measured properly will not drastically deteriorate the performance of the algorithm for the OCOwC problem as well. Moreover, the nonstationarity will not affect the constraint violation at all. Together with the results for the BwK problem, we argue that the new global nonstationarity measure serves as a proper one for the constrained online learning problems. Note that the upper and lower bounds match up to a logarithmic factor (in a worstcase sense) subject to the nonstationarity measures. The future direction can be to refine the bounds in a more instancedependent way and to identify useful prior knowledge on the nonstationarity for better algorithm design and analysis.
References
Appendix A Proofs of Section 2 and Section 3
a.1 Proof of Lemma 1
Proof.
We first inspect the null arm (say, the th arm) where and . The global DLP must satisfy that
i.e.
The same argument applies to the onestep LP such that for all .
Note that the reward is upper bounded by . Hence,
Therefore,
and
Combining above two inequalities together, we have
∎
a.2 Proof of Proposition 1
Proof.
The first inequality is straightforward from the fact that the feasible solutions of singlestep LP’s yield a feasible solution for the global LP.
For the second inequality, we study the dual problems. By the strong duality of LP, we have
Denote the dual optimal solution w.r.t. by . Then
implies that
which induces a feasible solution to the dual program DLP, i.e. where
Hence,
For the last inequality, similar duality arguments can be made with respect to . Taking a summation, we yield the final inequality as desired. ∎
a.3 Proofs of Lemma 2 and Lemma 3
Lemma 4 (AzumaHoeffding’s inequality).
Consider a random variable with distribution supported on
. Denote its expectation as . Let be the average of independent samples from this distribution. Then, , the following inequality holds with probability at least ,More generally, this result holds if are random variables, , and .
Next, we present a general bound for the normalized empirical mean of the slidingwindow estimator:
Lemma 5.
For any window size , define the normalized empirical average within window size of some with mean for each arm at time step as
where is the number of plays of arm before time step within steps. Then for small such that , the following inequality holds with probability at least ,