The multi-armed bandit (MAB) problem characterizes a problem for which a limited amount of resource must be allocated between competing (alternative) choices in a way that maximizes the expected gain. The bandits with knapsacks (BwK) problem generalizes the multi-armed bandits problem to allow more general resource constraints structure on the decisions made over time, in addition to the customary limitation on the time horizon. Specifically, for the BwK problem, the decision maker/player chooses to play an arm at each time period; s/he will receive a reward and consume certain amount of resource from each of the multiple resource types. Accordingly, the objective is to maximize the cumulative reward over a finite time horizon and subject to an initial budget of multiple resource types. The BwK problem was first introduced by badanidiyuru2013bandits as a general framework to model a wide range of applications, including dynamic pricing and revenue management (besbes2012blind), Adwords problem (mehta2005adwords) and more.
The standard setting of the BwK problem is stochastic where the joint distribution of reward and resource consumption for each arm remains stationary (identical) over time. Under such setting, a linear program (LP), that takes the expected reward and resource consumption of each arm as input, both serves as the benchmark for regret analysis and drives the algorithm design(badanidiyuru2013bandits; agrawal2014bandits). Notably, a static best distribution prescribed by the LP’s optimal solution is used for defining the regret benchmark. An alternative setting is the adversarial BwK problem where the reward and the consumption may no long follow a distribution and they can be chosen arbitrarily over time. Under the adversarial setting, a sublinear regret is not achievable in the worst case; immorlica2019adversarial derive a competitive ratio against the static best distribution benchmark which is aligned with the static optimal benchmark in the adversarial bandits problem (auer1995gambling). Another key of the BwK problem is the number of resource types . When , one optimal decision is to play the arm with largest (expected) reward to (expected) resource consumption ratio, where the algorithm design and analysis can be largely reduced to the MAB problem. When , the optimal decision in general requires to play a combination of arms (corresponding the optimal basis of the underlying LP). rangi2018unifying focus on the case of and propose an EXP3-based algorithm that attains a regret of against the best fixed distribution benchmark. Their result thus bridges the gap between the stochastic BwK problem and the adversarial BwK problem for the case of . The difference between the cases of and is also exhibited in the derivation of problem-dependent regret bounds for the stochastic BwK problem (flajolet2015logarithmic; li2021symmetry; sankararaman2021bandits).
In this paper, we study the non-stationary BwK problem where the reward and the resource consumption at each time are sampled from a distribution as the stochastic BwK problem but the distribution may change over time. The setting relaxes the temporally i.i.d. assumption in the stochastic setting and it can be viewed as a soft measure of adversity. We aim to relate the non-stationarity (or adversity) of the distribution change with the best-achievable algorithm performance, and thus our result bridges the two extremes of BwK problem: stochastic BwK and adversarial BwK. We consider a dynamic benchmark to define the regret; while such a benchmark is aligned with the dynamic benchmark in other non-stationary learning problem (besbes2014stochastic; besbes2015non; cheung2019learning; faury2021regret), it is stronger than the static distribution benchmark in adversarial BwK (rangi2018unifying; immorlica2019adversarial). Importantly, we use simple examples and lower bound results to show that the traditional non-stationarity measures such as change points and variation budget are not suitable for the BwK problem due to the presence of the constraints. We introduce a new non-stationarity measure called global variation budget and employ both of this new measure and the original variation budget to capture the underlying non-stationarity of the BwK problem. We analyze the performance of a sliding-window UCB-based BwK algorithm and derive a near-optimal regret bound. Furthermore, we show that the new non-stationarity measure can also be applied to the problem of online convex optimization with constraints (OCOwC) and extend the analyses therein.
1.1 Related literature
The study of non-stationary bandits problem begins with the change-point or piecewise-stationary setting where the distribution of the rewards remains constant over epochs and changes at unknown time instants(garivier2008upper; yu2009piecewise). The prototype of non-stationary algorithms such as discounted UCB and sliding-window UCB are proposed and analyzed in (garivier2008upper) to robustify the standard UCB algorithm against the environment change. The prevalent variation budget measure (where and the norm bear different meaning under different context) is later proposed and widely studied under different contexts, such as non-stationary stochastic optimization (besbes2015non), non-stationary MAB (besbes2014stochastic), non-stationary linear bandits (cheung2019learning), and non-stationary generalized linear bandits (faury2021regret) problems. In general, these works derive lower bound of , and propose algorithms that achieve near-optimal regret of cheung2019learning and faury2021regret require additional orthogonality assumption on the decision set to attain such upper bound; without this assumption, a regret bound of can be obtained (faury2021regret). With the soft measure of non-stationarity, the existing results manage to obtain sublinear regret bounds in against dynamic optimal benchmarks. In contrast, a linear regret in is generally inevitable against the dynamic benchmark when the underlying environment is adversarial. We remark that while all these existing works consider the unconstrained setting, our work complements this line of literature with a proper measure of non-stationarity in the constrained setting.
Another related stream of literature is the problem of online convex optimization with constraints (OCOwC) which extends the OCO problem in a constrained setting. There are two types of constraints considered: the long-term constraint (jenatton2016adaptive; neely2017online) and the cumulative constraint (yuan2018online; yi2021regret). The former defines the constraint violation by whilst the latter defines it by where is the positive-part function. The existing works mainly study the setting where for all and is known a priori. neely2017online considers a setting where is i.i.d. generated from some distribution. In this paper, we show that our non-stationarity measure naturally extends to this problem and derives bounds for OCOwC when ’s are generated in a non-stationary manner.
A line of works in operations research and operations management literature also study non-stationary environment for online decision making problem under constraints (ma2020approximation; freund2019good; jiang2020online). The underlying problem along this line can be viewed as a full-information setting where at each time , the decision is made after the observation of the function/realized randomness/customer type, while the BwK and OCOwC can be viewed as a partial-information setting where the decision is made prior to and may affect the observation. So for the setting along this line, there is generally no need for exploration in algorithm design, and the main challenge is to trade off the resource consumption with the reward earning.
2 Problem Setup
We first introduce the formulation of the BwK problem. The decision-maker/learner is given a fixed finite set of arms (with ) called as action set. There are knapsack constraints with a known initial budget of for . Without loss of generality, we assume for all There is a finite time horizon , which is also known in advance. At each time , the learner must choose either to play an arm or to do nothing but wait. If the learner plays the arm at time , s/he will receive a reward and consume amount of each resource from the initial budget . As the convention, we introduce a null arm to model “doing nothing” which generates a reward of zero and consumes no resource at all. We assume is sampled from some distribution independently over time where and . In the stochastic BwK problem, the distribution remains unchanged over time, while in the adversarial BwK problem, is chosen adversarially. In our paper, we allow to be chosen adversarially, while we use some non-stationarity measure to control the extent of adversity in choosing ’s.
At each time , the learner needs to pick using the past observations until time but without observing the outcomes of time step . The resource constraints are assumed to be hard constraints, i.e., the learner must stop at the earliest time when at least one constraint is violated, i.e. , or the time horizon is exceeded. The objective is to maximize the expected cumulative reward until time , i.e. . To measure the performance of a learner, we define the regret of the algorithm/policy adopted by the learner as
Here denotes the expected cumulative reward of the optimal dynamic policy given all the knowledge of ’s in advance. Its definition is based on the dynamic optimal benchmark which allows the arm play decisions/distributions to change over time. As a result, it is stronger than the optimal fixed distribution benchmark used in the adversarial BwK setting (rangi2018unifying; immorlica2019adversarial).
2.1 A Motivating Example
The conventional variation budget is defined by
By twisting the definition of the metric , it captures many of the existing non-stationary measures for unconstrained learning problems. Now we use a simple example to illustrate why no longer fits for the constrained setting. Similar examples have been used to motivate algorithm design and lower bound analysis in (golrezaei2014real; cheung2019learning; jiang2020online), but have not been yet be exploited in a partial-information setting such as bandits problems.
Consider a BwK problem instance that has two arms (one actual arm and one null arm), and a single resource constraint with initial capacity of . Without loss of generality, we assume is even. The null arm has zero reward and zero resource consumption throughout the horizon, and the actual arm always consumes 1 unit of resource (deterministically) for each play and outputs 1 unit of reward (deterministically) for the first half of the horizon, i.e., when For the second half of the horizon , the reward of the actual arm will change to either or , and the change happens adversarially. For this problem instance, the distribution only changes once, i.e., (varying up to constant due to the metric definition). But for this problem instance, a regret of is inevitable. To see this, if the player plays the actual arm no less than times, then the distributions of the second half can adversarially change to the reward , and this will result in a regret at least. The same for the case of playing the actual arm for the case of no more than times, and we defer the formal analysis to the proof of the lower bounds in Theorem 2.
This problem instance implies that a sublinear dependency on cannot be achieved with merely the variation budget to characterize the non-stationarity. Because with the presence of the constraint(s), the arm play decisions over time are all coupled together not only through the learning procedure, but also through the “global” resource constraint(s). For the unconstrained problems, the non-stationarity affects the effectiveness of the learning of the system; for the constrained problems, the non-stationarity further challenges the decision making process through the lens of the constraints.
2.2 Non-stationarity Measure and Linear Programs
We denote the expected reward vector asand the expected consumption matrix as , i.e.,
We first follow the conventional variation budget and define the local non-stationarity budget: 111Throughout the paper, for a vector , we denote its norm and norm by For a matrix , we denote its norm and norm by
We refer to the measure as a local one in that they capture the local change of the distributions between time and time .
Next, we define the global non-stationarity budget:
where and . These measures capture the total deviations for all the ’s and from their global averages. By definition, and upper bound and (up to a constant), so they can be viewed as a more strict measure of non-stationarity than the local budget. In the definition of , the L norm is not essential and it aims to sharpen the regret bounds (by corresponding to the upper bound on the dual optimal solution in supremum norm to be defined shortly).
All the existing analyses of the BwK problem utilize the underlying linear program (LP) and establish the LP’s optimal objective value as an upper bound of the regret benchmark OPT. In a non-stationary environment, the underlying LP is given by
where and denotes the -dimensional standard simplex. We know that
In the rest of our paper, we will use for the analysis of regret upper bound. We remark that in terms of this LP upper bound, the dynamic benchmark allows the to take different values, while the static benchmark will impose an additional constraint to require all the be the same.
For notation simplicity, we introduce the following linear growth assumption. All the results in this paper still hold without this condition.
Assumption 1 (Linear Growth).
We have the resource budget for some .
Define the single-step LP by
where The single-step LP’s optimal objective value can be interpreted as the single-step optimal reward under a normalized resource budget .
Throughout the paper, we will use the dual program and the dual variables to relate the resource consumption with the reward, especially for the non-stationary environment. The dual of the benchmark LP is
where denotes an -dimensional all-one vector. Here we denotes one optimal solution as
The dual of the single-step LP is
Here we denotes one optimal solution as We remark that these two dual LPs are always feasible by choosing and some large , so there always exists an optimal solution.
The dual optimal solutions and are also known as the dual price, and they quantify the cost efficiency of each arm play.
The quantity captures the maximum amount of achievable reward by each unit of resource consumption. We will return with more discussion on this quantity after we present the regret bound.
We have the following upper bound on
Proposition 1 relates the optimal value of the benchmark with the optimal values of the single-step LPs. To interpret the bound, works as an upper bound of the OPT in defining the regret, and the summation of corresponds to the total reward obtained by evenly allocating the resource over all time periods. In a stationary environment, these two are the same as the optimal decision naturally corresponds to an even allocation of the resources. In a non-stationary environment, it can happen that the optimal allocation of the resource corresponds an uneven one for . For the problem instance in Section 2.1, the optimal allocation may be either to exhaust all the resource in first half of time periods or preserve all the resource for the second half. In such case, forcing an even allocation will reduce the total reward obtained. The proposition tells that the reduction can be bounded by where the non-stationarity in resource consumption is weighted by the dual price upper bound
3 Sliding-Window UCB for Non-stationary BwK
In this section, we adapt the standard sliding-window UCB algorithm for the BwK problem (Algorithm 1) and derive a near-optimal regret bound. The algorithm will terminate when any type of the resources is exhausted. At each time
, it constructs standard sliding-window confidence bounds for the reward and the resource consumption. Specifically, we define the sliding-window estimators by
where denotes the number of times that the -th arm has been played in the last time periods. To be optimistic on the objective value, UCBs are computed for rewards and LCBs are computed for the resource consumption, respectively. With the confidence bounds, the algorithm solves a single-step LP to prescribe a randomized rule for the arm play decision.
Our algorithm can be viewed as a combination of the standard sliding-window UCB algorithm (garivier2008upper; besbes2015non) with the UCB for BwK algorithm (agrawal2014bandits). It makes a minor change compared to (agrawal2014bandits) which solves a single-step LP with a shrinkage factor on the right-hand-side. The shrinkage factor therein ensures that the resources will not be exhausted until the end of the horizon, but it is not essential to solving the problem. For simplicity, we choose the more natural version of the algorithm which directly solves the single-step LP. We remark that the knowledge of the initial resource budget and the time horizon will only be used for defining the right-hand-side of the constraints for this .
Now we begin to analyze the algorithm’s performance. For starters, the following lemma states a standard concentration result for the sliding-window confidence bound.
With Lemma 2, we can employ a concentration argument to relate the realized reward (or resource consumption) with the reward (or resource consumption) of the LP under its optimal solution. In Lemma 3, recall that is the termination time of the algorithm where some type of resources is exhausted, and is defined in Algorithm 1 as the optimal solution of the LP solved at time .
For Algorithm 1, the following inequalities hold for all ,
with probability at least
We note that the single-step LP’s optimal solution is always subject to the resource constraints. So the second group of inequalities in Lemma 3 implies the following bound on the termination time . Recall that is the resource budget per time period; for a larger , the resource consumption process becomes more stable and the budget is accordingly less likely to be exhausted too early.
If we choose in Algorithm 1, the following inequality holds
with probability at least
To summarize, Lemma 3 compares the realized reward with the cumulative reward of the single-step LPs, and Corollary 1 bounds the termination time of the algorithm. Recall that Proposition 1 relates the cumulative reward of the single-step LPs with the underlying LP – the regret benchmark. Putting together these results, we can optimize and by choosing
and then obtain the final regret upper bound as follows.
Theorem 1 provides a regret upper bound for Algorithm 1 that consists of several parts. The first part of the regret bound is on the order of and it captures the regret when the underlying environment is stationary. The remaining parts of the regret bound characterize the relation between the intensity of non-stationarity and the algorithm performance. The non-stationarity from both the reward and the resource consumption will contribute to the regret bound and that from the resource consumption will be weighted by a factor of or (See Lemma 1 for the relation between these two). For the local non-stationarity and , the algorithm requires a prior knowledge of them to decide the window length, aligned with the existing works on non-stationarity in unconstrained settings. For the global non-stationarity and , the algorithm does not require any prior knowledge and they will contribute additively to the regret bound. Together with the lower bound results in Theorem 2, we argue that the regret bound cannot be further improved even with the knowledge of and .
When the underlying environment degenerates from a non-stationary one to a stationary one, all the terms related to , and will disappear and then the upper bound in Theorem 1 matches the regret upper bound for the stochastic BwK setting. In Theorem 1, we choose to represent the upper bound in terms of and so as to reveal its dependency on and draw a better comparison with the literature on unconstrained bandits problem. We provide a second version of Theorem 1 in Appendix D that matches the existing high probability bounds using OPT (badanidiyuru2013bandits; agrawal2014bandits). In contrast to the -competitiveness result in the adversarial BwK (immorlica2019adversarial), our result implies that with a property measure of the non-stationarity/adversity, the sliding-window design provides an effective approach to robustify the algorithm performance when the underlying environment changes from stationary to non-stationary, and the according algorithm performance will not drastically deteriorate when the intensity of the non-stationarity is small.
When the resource constraints become non-binding for the underlying LPs, the underlying environment degenerates from a constrained setting to an unconstrained setting. We separate the discussion for the two cases: (i) the benchmark LP and all the single-step LPs have only non-binding constraints; (ii) the benchmark LP have only non-binding constraints but some single-step LP have binding constraints. For case (i), the regret bound in Theorem 1 will match the non-stationary MAB bound (besbes2014stochastic). For case (ii), the match will not happen and this is inevitable. We elaborate the discussion in Section C.
Theorem 2 (Regret lower bounds).
The following lower bounds hold for any policy ,
Theorem 2 presents a few lower bounds for the problem. The first and the second lower bounds are adapted from the lower bound example in non-stationary MAB (besbes2014stochastic) and the third lower bound is adapted from the motivating example in 2.1. There are simple examples where each one of these three lower bounds dominates over the other two. In this sense, all the non-stationarity-related terms in the upper bound of Theorem 1 are necessary including the parameters and . There is one gap between the lower bound and the upper bound with regard to the number of constraints in the term related to . We leave it as future work to reduce the factor to with some finer analysis. Furthermore, we provide a sharper definition of the global nonstationarity measure and in replacement of and in Appendix B. It makes no essential change to our analysis, and the two measures coincide with each other on the lower bound problem instance. We choose to use and for presentation simplicity, while and can capture the more detailed temporal structure of the nonstationarity. The discussion leaves an open question that whether the knowledge of some additional structure of the environment can further reduce the global non-stationarity.
4 Extension to Online Convex Optimization with Constraints
In this section, we show how our notion of non-stationarity measure can be extended to the problem of online convex optimization with constraints (OCOwC). Similar to BwK, OCOwC also models a sequential decision making problem under the presence of constraints. Specifically, at each time , the player chooses an action from some convex set . After the choice, a convex cost function and a concave resource consumption function are revealed. As in the standard setting of OCO, the functions is adversarially chosen and thus a static benchmark is consider and defined by
Denote its optimal solution as and its dual optimal solution as .
While the existing works consider the case when ’s are static or sample i.i.d. from some distribution We consider a non-stationary setting where may change adversarially over time. We define a global non-stationarity measure by
where and .
The OCOwC problem considers the following bi-objective performance measure:
where denotes the positive-part function and denotes the policy/algorithm.
In analogous to the single-step LPs, we consider an optimization problem with more restricted constraints as
Denote its optimal solution as , and its dual optimal solution as . The following proposition relates the two optimal objective values.
We assume that Slater’s condition holds for both the standard OCOwC program OPT and the restricted OCOwC program OPT. We assume that , and are uniformly bounded on and that itself is bounded. Moreover, we assume that their dual optimal solutions are uniformly bounded by , i.e.
The following proposition relates the two optimal objective values.
For OCOwC problem, under Assumption 2, we have
Utilizing the proposition, we can show that the gradient-based algorithm of (neely2017online) achieves the following regret for the setting of OCO with non-stationary constraints. Moreover, we further extend the results and discuss in Appendix E on an oblivious adversarial setting where is sampled from some distribution and the distribution may change over time.
Under Assumption 2, the Virtual Queue Algorithm of (neely2017online) for any OCOwC problem (denoted by ) produces a decision sequence such that
The theorem tells that the non-stationarity when measured properly will not drastically deteriorate the performance of the algorithm for the OCOwC problem as well. Moreover, the non-stationarity will not affect the constraint violation at all. Together with the results for the BwK problem, we argue that the new global non-stationarity measure serves as a proper one for the constrained online learning problems. Note that the upper and lower bounds match up to a logarithmic factor (in a worst-case sense) subject to the non-stationarity measures. The future direction can be to refine the bounds in a more instance-dependent way and to identify useful prior knowledge on the non-stationarity for better algorithm design and analysis.
a.1 Proof of Lemma 1
We first inspect the null arm (say, the -th arm) where and . The global DLP must satisfy that
The same argument applies to the one-step LP such that for all .
Note that the reward is upper bounded by . Hence,
Combining above two inequalities together, we have
a.2 Proof of Proposition 1
The first inequality is straightforward from the fact that the feasible solutions of single-step LP’s yield a feasible solution for the global LP.
For the second inequality, we study the dual problems. By the strong duality of LP, we have
Denote the dual optimal solution w.r.t. by . Then
which induces a feasible solution to the dual program DLP, i.e. where
For the last inequality, similar duality arguments can be made with respect to . Taking a summation, we yield the final inequality as desired. ∎
Lemma 4 (Azuma-Hoeffding’s inequality).
Consider a random variable with distribution supported on
Consider a random variable with distribution supported on. Denote its expectation as . Let be the average of independent samples from this distribution. Then, , the following inequality holds with probability at least ,
More generally, this result holds if are random variables, , and .
Next, we present a general bound for the normalized empirical mean of the sliding-window estimator:
For any window size , define the normalized empirical average within window size of some with mean for each arm at time step as
where is the number of plays of arm before time step within steps. Then for small such that , the following inequality holds with probability at least ,