# Online Stochastic Optimization with Wasserstein Based Non-stationarity

We consider a general online stochastic optimization problem with multiple budget constraints over a horizon of finite time periods. At each time period, a reward function and multiple cost functions, where each cost function is involved in the consumption of one corresponding budget, are drawn from an unknown distribution, which is assumed to be non-stationary across time. Then, a decision maker needs to specify an action from a convex and compact action set to collect the reward, and the consumption each budget is determined jointly by the cost functions and the taken action. The objective of the decision maker is to maximize the cumulative reward subject to the budget constraints. Our model captures a wide range of applications including online linear programming and network revenue management, among others. In this paper, we design near-optimal policies for the decision maker under the following two specific settings: a data-driven setting where the decision maker is given prior estimates of the distributions beforehand and a no information setting where the distributions are completely unknown to the decision maker. Under each setting, we propose a new Wasserstein-distance based measure to measure the non-stationarity of the distributions at different time periods and show that this measure leads to a necessary and sufficient condition for the attainability of a sublinear regret. For the first setting, we propose a new algorithm which blends gradient descent steps with the prior estimates. We then adapt our algorithm for the second setting and propose another gradient descent based algorithm. We show that under both settings, our polices achieve a regret upper bound of optimal order. Moreover, our policies could be naturally incorporated with a re-solving procedure which further boosts the empirical performance in numerical experiments.

## Authors

• 1 publication
• 7 publications
• 83 publications
• ### The Best of Many Worlds: Dual Mirror Descent for Online Allocation Problems

Online allocation problems with resource constraints are central problem...
11/18/2020 ∙ by Santiago Balseiro, et al. ∙ 0

• ### Non-stationary Stochastic Optimization

We consider a non-stationary variant of a sequential stochastic optimiza...
07/20/2013 ∙ by O. Besbes, et al. ∙ 0

• ### Constrained Upper Confidence Reinforcement Learning

Constrained Markov Decision Processes are a class of stochastic decision...
01/26/2020 ∙ by Liyuan Zheng, et al. ∙ 0

• ### Online Convex Optimization with Binary Constraints

We consider online optimization with binary decision variables and conve...
05/05/2020 ∙ by Antoine Lesage-Landry, et al. ∙ 0

• ### From Predictive to Prescriptive Analytics

In this paper, we combine ideas from machine learning (ML) and operation...
02/22/2014 ∙ by Dimitris Bertsimas, et al. ∙ 0

• ### Non-stationary Stochastic Optimization with Local Spatial and Temporal Changes

We consider a non-stationary sequential stochastic optimization problem,...
08/09/2017 ∙ by Xi Chen, et al. ∙ 0

• ### Selling Information Through Consulting

We consider a monopoly information holder selling information to a budge...
07/09/2019 ∙ by Yiling Chen, et al. ∙ 0

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

In this paper, we study a general online stochastic optimization problem with budgets, each with an initial capacity, over a horizon of finite discrete time periods. At each time period , a reward function and a cost function are drawn independently from a distribution. Then the decision maker should specify a decision , where is assumed to be a convex and compact set. Accordingly, a reward is generated, and each budget is consumed by amount of budget, where . The decision maker’s objective is to maximize the total generated reward subject to the budget capacity constraints.

Our formulation generalizes several existing problems studied in the literature. When and are linear functions for each , our formulation reduces to the online linear programming (OLP) problem (Buchbinder and Naor, 2009). Our formulation could also be applied to network revenue management (NRM) problem (Talluri and Van Ryzin, 2006), including the quantity-based model, price-based model and choice-based model (Talluri and Van Ryzin, 2004) (See detailed discussions in Section ). Note that in the OLP problem, the reward function and cost functions are assumed to be drawn from an unknown distribution which is stationary across time (Li and Ye, 2019), while in the NRM problem, the distribution is usually assumed to be known to the decision maker though it can be non-stationary across time (Talluri and Van Ryzin, 2006). In this paper, we assume an unknown non-stationary input, i.e., and are drawn from an unknown distribution which is non-stationary across time. More specifically, we consider the following two settings: a data-driven setting where there exists an available prior estimate to approximate the true distribution at each time period, and a no distribution information setting where the distribution at each time period is completely unknown to the decision maker. Note that the first setting reduces to the known non-stationary setting of the NRM problem when the prior estimates are identical to the true distributions, while the second setting reduces to the unknown stationary setting of the OLP problem when the distribution at each period is identical to each other

Though we consider an unknown non-stationary input, it may be too pessimistic to consider an adversary setting where the distribution at each time period could be arbitrarily chosen. Moreover, for our first setting where there exists prior estimates of the distributions, the estimates are usually “close” to the true distributions. Thus, we assume that for each setting, the true distributions fall into an uncertainty set, which controls the non-stationarity or estimates ambiguity over the distributions. Our goal is to derive near-optimal policies for both settings, which perform well over the uncertainty set. We compare the performances of our policies to the so-called “offline” optimization problem, i.e., to maximize the objective function with full information/knowledge of all the ’s and ’s. Moreover, we use worst-case regret to measure the performance of our policies over the uncertainty set, which is defined as the maximal difference between the expected value of the “offline” problem and the expected reward collected by the policy, over all distributions in the uncertainty set. The formal definitions will be provided in the next section after introducing the notations and formulations.

### 1.1 Main Results and Contributions

For our first data-driven setting, we assume the availability of some prior estimate for ’s, where denotes the distribution of the arrival input at time period . We propose a new non-stationarity measure, which is defined as the cumulative Wasserstein distance of the prior estimate from the true distribution for each , and we name this new measure as Wasserstein-based non-stationarity budget with prior estimate (WBNB-P). Then, we introduce an uncertainty set based on WBNB-P driven by a parameter , which is called the variation budget, and the set covers all the arrival inputs ’s that have WBNB-P no greater than We illustrate the sharpness of our WBNB-P by showing that if the variation budget is linear in , sublinear regret could not be achieved by any policy. Note that the Wasserstein distance has been widely used as a measure of the ambiguity in the distributionally robust optimization literature (e.g. Esfahani and Kuhn (2018)) for its power to represent confidence set and its great performance, both theoretically and empirically. To the best of our knowledge, we are the first to propose its use in online optimization to measure estimates ambiguity (or non-stationarity in the second setting).

We develop a new gradient-based algorithm that adjusts the gradient descent direction according to all the prior estimates. Our algorithm is motivated by the traditional online gradient descent (OGD) algorithm (Hazan, 2016), which applies a linear update rule according to the gradient of the functions at the current time period. Note that the OGD algorithm uses only historical information in every step and it has been shown to work well in a stationary setting, even when the distribution is unknown (Lu et al., 2020; Sun et al., 2020; Li et al., 2020). For non-stationary setting, we have to make use of the prior estimates of the future time periods to guide the budget consumption. For that purpose, we develop a new gradient descent algorithm which combines the linear update at each time period with the offline convex relaxation obtained over prior estimates. We show that our algorithm achieves a regret bound, which is of optimal order.

Note that even for a special case where the prior estimate is identical to the true distribution at each time period, i.e., a known non-stationary setting, our regret bound turns out to be new. Similar result for this setting is only known in Devanur et al. (2019) for a competitive ratio, where denotes the minimal capacity of the budget constraints and they assume the reward function and cost function are all linear functions. However, their result on competitive ratio doesn’t translate to

result on regret. It is also not clear how to generalize their method to the setting where the true distributions are unknown and there are estimates ambiguity. Our algorithm and analysis are totally different from theirs. Specifically, their method is based on the concentration property of the arrival process and applying Chernoff-type inequalities to derive high probability bounds. In contrast, our approach is based on applying an adjusted gradient descent step to balance the budget consumption. We show that the budget consumption on every sample path can be represented by certain dual variables and our update rule ensures that these dual variables are bounded almost surely. In this way, we provide a new methodology to analyze the online optimization problem in a non-stationary environment.

For the second setting where no prior estimates of the distributions are available, we modify our WBNB-P by replacing the prior estimate of each distribution with their uniform mixture distribution in the Wasserstein distance. Then the Wasserstein distance measure can be regarded as a measure over the non-stationarity of the distributions and we formulate the uncertainty set accordingly with a variation budget. In this case, the offline convex relaxation admits a trivial solution (capacity should be allocated equally across time), and our adjusted gradient descent algorithm reduces to the classical gradient descent algorithm. We prove that our algorithm achieves a regret bound of optimal order, even when the distributions are chosen adversarially over the uncertainty set.

Note that there is a stream of literature that studies non-stationary online optimization without budget constraints (Besbes et al., 2015; Cheung et al., 2019), which also constructs the uncertainty set via a variation budget. However, their non-stationarity measure is concerned about the temporal change of the distributions over time. We provide an example in Section 4.1 showing that such measure would fail in a budget constrained setting. Thus it motivates us to propose our measure based on the global change of the distributions, i.e., comparing each with their uniform mixture distribution. An independent work (Balseiro et al., 2020) also consider using global change of the distributions to derive a measure of non-stationarity. However, their measure is based on the total variation metric between distributions. By illustrating the advantage of using Wasserstein distance instead of total variation distance or KL-divergence through a simple example in Section 4.2, we show that our measure is sharper and we establish the suitability of the Wasserstein-based non-stationary measure.

Finally, to the best of our knowledge, our model is new comparing to the existing literature. Our measure in both settings can be universally applied to various online linear programming and network revenue management formulations, and it thus fills the gap between the studies of these problems in the stochastic setting and the adversarial setting. Specifically, For the first setting, the prior knowledge could be obtained from the historical data and its presence is aligned with the settings in the network revenue management literature (Talluri and Van Ryzin, 2004; Gallego et al., 2019). However, the network revenue management literature always assumes a precise knowledge of the true input while our paper allows a deviation of the prior estimate from the true distribution The deviation can be interpreted as an estimation or model misspecification error. Thus our results in the first setting generalize this line of literature.

For the second setting, the assumption of no available prior knowledge is consistent with the setting of online linear programming problem (Molinaro and Ravi, 2013; Agrawal et al., 2014; Gupta and Molinaro, 2014) and the setting of blind network revenue management (Besbes and Zeevi, 2012; Jasin, 2015). For the online linear programming problem, the literature studies either the stochastic setting or the random permutation setting, and for the blind network revenue management, the literature is only focused on the stochastic setting. Compared to these two streams of literature, our results in the second setting relax the stochastic assumption in a non-stationary (more adversarial but not fully adversarial) manner.

From a modeling perspective, our work contributes to the study of non-stationary environment for online learning/optimization problem. This line of literature has mainly concerned with the unconstrained settings such as unconstrained online optimization problem (Besbes et al., 2015), bandits problem (Garivier and Moulines, 2008; Besbes et al., 2014)

, reinforcement learning problem

(Cheung et al., 2019; Lecarpentier and Rachelson, 2019). Our notion of Wasserstein-based non-stationarity adds to the current dictionary of non-stationarity definitions and it specializes for a characterization of the constrained setting.

### 1.2 Literature review

The formulation of online stochastic optimization studied in this paper roots in two major applications: the online linear programming (LP) problem and the network revenue management problem. We briefly review these two streams of literature as follows.

The online LP problem (Molinaro and Ravi, 2013; Agrawal et al., 2014; Gupta and Molinaro, 2014) covers a wide range of applications through different ways of specifying the underlying LP, including secretary problem (Ferguson and others, 1989), knapsack problem (Kellerer et al., 2003), resource allocation problem (Vanderbei and others, 2015), quantity-based network revenue management (NRM) problem (Jasin, 2015), generalized assignment problem (Conforti et al., 2014), network routing problem (Buchbinder and Naor, 2009), matching problem (Mehta et al., 2005), etc. Notably, the problem has been studied under either the stochastic input model where the coefficient in the objective function, together with the corresponding column in the constraint matrix is drawn from an unknown distribution or the random permutation model where they arrive in a random order. As noted in the paper (Li et al., 2020), the random permutation model exhibits similar concentration behavior as the stochastic input model. The non-stationary setting of our paper and relaxes the i.i.d. structure and it can be viewed as a third paradigm for analyzing the online LP problem.

The network revenue management (NRM) problem has been extensively studied in the literature and a main focus is to propose near-optimal policies with strong theoretical guarantees. One popular way is to construct a linear programming as an upper bound of the optimal revenue and use the optimal solution to derive heuristic policies. Specifically,

Gallego and Van Ryzin (1994) propose a static bid-price policy based on the dual variable of the linear programming upper bound and proves that the revenue loss is when each period is repeated times and the capacities are scaled by . Subsequently, Reiman and Wang (2008) show that by re-solving the linear programming upper bound once, one could obtain an upper bound on the revenue loss. A follow-up work (Jasin and Kumar, 2012) shows that under a so-called “non-degeneracy” assumption, a policy which re-solves the linear programming upper bound at each time period would lead to an revenue loss, which is independent of the scaling factor . The relationship between the performances of the control policies and the number of times of resolving the linear programming upper bound is further discussed in their later paper (Jasin and Kumar, 2013). Recently, Bumpensanti and Wang (2020) propose an infrequent re-solving policy and show that their policy obtains an upper bound of the revenue loss even without the “non-degeneracy” assumption. With a different approach, Vera and Banerjee (2020) prove the same upper bound for the NRM problem and their approach is generalized from their previous work (Vera et al., 2019) for other online decision making problems, including online stochastic knapsack, online probing, and dynamic pricing. Note that all the approaches mentioned above are mainly developed for the stochastic/stationary setting. When the arrival process of customers is non-stationary over time, Adelman (2007) develops a strong heuristic based on a novel approximate dynamic programming (DP) approach. This approach is further investigated under various settings in the literature (for example (Zhang and Adelman, 2009), (Kunnumkal and Talluri, 2016)). Remarkably, although the approximate DP heuristic is one of the strongest heuristics in practice, it does not feature for a theoretical bound. Finally, by using non-linear basis functions to approximate the value of the DP, Ma et al. (2020) develop a novel approximate DP policy and derive a constant competitiveness ratio for their policy, which depends on the problem parameters.

## 2 Problem Formulation

Consider the following convex optimization problem

 max T∑t=1ft(xt) (CP) s.t. T∑t=1git(xt)≤ci,  i=1,...,m, xt∈X,  t=1,...,T,

where the decision variables are for . Here is a compact convex set in . The function ’s are functions in the space of concave continuous functions and ’s are functions in the space of convex continuous functions, both of which are supported on

We define the vector-value function

. Throughout the paper, we use to index the constraint and (or sometimes ) to index the decision variables, and we use bold symbols to denote vectors/matrices and normal symbols for scalars.

In this paper, we study the online stochastic optimization problem where the functions in (CP) are revealed in an online fashion and one needs to determine the value of decision variables sequentially. Specifically, at each time the functions are revealed, and we need to decide the value of instantly. Different from the offline setting, at time , we do not have the information of the future part of the optimization problem. Given the history , the decision of can be expressed as a policy function of the history and the observation at the current time period. That is,

 xt=πt(ft,gt,Ht−1). (1)

The policy function can be time-dependent and we denote policy The decision variable must conform to the constraints in (CP) throughout the procedure, and the objective is aligned with the maximization objective for the offline problem (CP).

### 2.1 Parameterized Form, Probability Space, and Assumptions

Consider a parametric form of the underlying problem (CP) where the functions are parameterized by a parameter . Specifically,

 ft(xt)\coloneqqf(xt;θt),  git(xt;θt)\coloneqqgi(xt;θt)

for each and . The function is concave in its first argument, while the function is convex in its first argument. We define the vector-value function . Then the problem (CP) can be rewritten as the following parameterized convex program

 max T∑t=1f(xt,θt) (PCP) s.t. T∑t=1gi(xt,θt)≤ci,  i=1,...,m, xt∈X.  t=1,...,T,

where the decision variables are We note that this parametric form (PCP) is introduced mainly for presentation purpose, since it avoids the complication of defining probability measure in function space, and also it does not change the nature of the problem. We assume the knowledge of and a priori. Here and hereafter, we will use (PCP) as the underlying form of the online stochastic optimization problem.

The problem of online stochastic optimization, as its name refers, involves stochasticity on the functions for the underlying optimization problem. The parametric form (PCP) reduces the randomness from the function to the parameters ’s, and therefore the probability measure can be defined in the parameter space of . First, we consider the following distance function between two parameters ,

 ρ(θ,θ′)\coloneqqsupx∈X∥(f(x,θ),g(x,θ))−f(x,θ′),g(x,θ′))∥∞ (2)

where is the L norm in Without loss of generality, let be a set of class representatives, that is, for any , In this way, the parameter space can be viewed as a metric space equipped with metric Also, note that we define the metric based on the vector-valued function , instead of a metric in the parameter space (or ). This is because the main focus is on the effect of different parameter on the function value rather than the original Euclidean difference in the parameter space. Let be the smallest -algebra in that contains all open subsets (under metric ) of We denote the distribution of as and can thus be viewed as a probability measure on

Throughout the paper, we make the following assumptions. Assumption 1 (a) and (b) imposes boundedness on function and ’s. Assumption 1 (c) states the ratio between and is uniformly bounded by for all and . Intuitively, it tells that for each unit consumption of resource, the maximum amount of revenue earned is upper bounded by . In Assumption 1 (d), we assume ’s are independent of each other but we do not assume the exact knowledge of them. However, there can be dependence between components in the vector-value functions

###### Assumption 1 (Boundedness and Independence)

We assume

• for all .

• for all and In particular, for all

• There exists a positive constant such that for any and each , we have that holds for any as long as .

• and ’s are independent with each others. We do not assume the knowledge of ’s.

In the following, we illustrate the online formulation through two application contexts: online linear programming and online network revenue management. We choose the more general convex formulation (PCP) with the aim of uncovering the key mathematical structures for this online optimization problem, but we will occasionally return to these two examples to generate intuitions throughout the paper.

### 2.2 Examples

Online linear programming (LP): The online LP problem (Molinaro and Ravi, 2013; Agrawal et al., 2014; Gupta and Molinaro, 2014) can be viewed as an example of the online stochastic optimization formulation of (CP). Specifically, the decision variable , the functions and are linear functions, and the parameter where . Specifically, and At each time , the coefficient in the objective together with the corresponding column in the constraint matrix is revealed and one needs to determine the value of immediately.

Price-based network revenue management (NRM): In the price-based NRM problem (Gallego and Van Ryzin, 1994), a firm is selling a given stock of products over a finite time horizons by posting a price at each time. The demand is price-sensitive and the firm’s objective is to maximize the total collected revenue. This problem could be cast in the formulation (PCP). Specifically, the parameter refers to the type of the -th arriving customer, and the decision variable represents to the price posted by the decision maker at time . Accordingly, denotes the resource consumption under the price and denotes the collected revenue.

Choice-based network revenue management: In the choice-based NRM problem (Talluri and Van Ryzin, 2004), the seller offers an assortment of the products to the customer arriving in each time period and the customer chooses a product from the assortment to purchase according to a given choice model. The formulation (PCP) can model the choice-based NRM problem as a special case by assuming that given each and , and

are all random variables. Specifically, for each

, refers to the assortment offered at time and denotes the customer type. Then denotes the revenue collected by offering assortment , and denotes the according resource consumption, where and are both stochastic and their distribution follows the choice model of the customer with type . Note that although in the following sections we only analyze the case where for each and , and are deterministic, our analysis and results could be generalized directly to the case where and are random and follow known distributions.

### 2.3 Performance Measure

We denote the offline optimal solution of optimization problem (CP) as , and the offline (online) objective value as (). Specifically,

 R∗T \coloneqqT∑t=1ft(x∗t) RT(π) \coloneqqT∑t=1ft(xt).

in which online objective value depends on the policy . Aligned with general online learning/optimization problem, we focus on minimizing the gap between the online and offline objective values. Specifically, the optimality gap is defined as follows:

 RegT(H,π)\coloneqqR∗T−RT(π)

where the problem profile encapsulates a random realization of the parameters, i.e., Note that , , and are all dependent on the problem profile as well, but we omit in these terms for notation simplicity without any ambiguity. We define the performance measure of the online stochastic optimization problem formally as regret

 RegT(π)\coloneqqmaxP∈Ξ EH∼P[RegT(H,π)] (3)

where denotes the probability measure of all time periods and the expectation is taken with respect to the parameter ; compactly, the problem profile . We consider the worst-case regret for all the distribution in a certain set where the set will be specified in later sections.

We conclude this section with a few comments on our formulation of the online stochastic optimization problem. Generally speaking, the problem of online learning/optimization with constraints falls into two categories: (i) first-observe-then-decide and (ii) first-decide-then-observe. Our formulation belongs to the first category in that at each time , the decision maker first observes the parameter and hence functions , and then determines the value of . In many application contexts of operations research and operations management, the observations constitute the meaning of customers/orders arriving sequentially to the system, and the decision variables capture accordingly the acceptance/rejection/pricing decisions of the customers. The problems discussed earlier, such as matching, resource allocation, network revenue management, all fall into this category. For the second category, the representative problems are bandits with knapsacks (Badanidiyuru et al., 2013) and online convex optimization (Hazan, 2016), where the decision is made first and the observation arrives after the decision. For example, in the classic bandits problem, the decision of which arm to play will affect the observation, and in the online convex optimization (or more generally two-player game setting (Cesa-Bianchi and Lugosi, 2006)), the “nature” may even choose the function against our made decision in an adversarial manner. There is a line of literature on online convex optimization with constraints, namely, the OCOwC problem (Mahdavi et al., 2012; Yu et al., 2017; Yuan and Lamperski, 2018). While the same underlying optimization problem (CP) is used in our formulation and the OCOwC problem, a key distinction is which of the decision or the observation is made first. Our formulation allows to observe before making the decision, and it thus enables us to adopt a stronger benchmark (as the definition of ), that is, a dynamic oracle which permits different value over different time periods. In contrast, the OCOwC problem requires to make decision before observe the functions and thus it considers a weaker benchmark which requires the decision variables take the same value over different time periods.

We have not yet discussed much about the conditions on the distributions except for independence. Importantly, this is one of the main themes of our paper. The canonical setting of online stochastic learning problem refers to the case when all the distributions are the same, i.e., for On the other extreme, the adversarial setting of online learning problem refers to the case when ’s are adversarially chosen. Our work aims to bridge these two ends of the spectrum with a novel notion of non-stationarity, and we aim to relate the regret of the problem with structural property on . In the same spirit, the work on non-stationary stochastic optimization (Besbes et al., 2015) proposes an elegant notion of non-stationarity called variation budget. Subsequent works consider similar notions in the settings of bandits (Besbes et al., 2014; Russac et al., 2019) and reinforcement learning (Cheung et al., 2019). To the best of our knowledge, all the previous works along this line consider unconstrained setting and thus our work contributes to this line of work in illustrating how the constraints interact with the non-stationarity. We will return to the point later in the paper.

## 3 Algorithm and Motivation

### 3.1 Benchmark Upper Bounds and Main Algorithm

In this section, we motivate and present the prototype of the main algorithm. To begin with, we first establish two useful upper bounds for the expected optimal reward . The derivation of the first upper bound is standard in online decision making problems and it is also known as the deterministic upper bound or the prophet benchmark (for example, see (Jasin and Kumar, 2012)). The motivation for such an upper bound is that the offline optimum obtained by solving (PCP) often preserves complex structure, and thus is very hard to analyze. Comparatively, the proposed upper bound features for better tractability and provides a good starting point for algorithm design and analysis. For a function and a probability measure in the parameter space we introduce the following notation

 Pu(x(θ))\coloneqq∫θ∈Θu(x(θ);θ)dP(θ)

where is a measurable function. Thus can be viewed as a deterministic functional that maps function to a real value and it is obtained by taking expectation with respect to the parameter .

Consider the following optimization problem

 RUBT= max T∑t=1Ptf(xt(θ)) (4) s.t. T∑t=1Ptgi(xt(θ))≤ci,  i=1,...,m, xt(θ):Θ→X is a % measurable function for t=1,...,T.

The optimization problem (4) can be viewed as a deterministic relaxation of (PCP) where the objective/constraints are all replaced with their expected counterparts, and the constraints are only required to be satisfied on expectation. In the following, Lemma 1 shows the optimal objective value is an upper bound for . Thus it formally establishes as a surrogate benchmark for when analyzing the regret.

###### Lemma 1

It holds that .

Now we seek for a second upper bound by considering the Lagrangian function of (4),

 L(p,x1:T(θ))=m∑i=1cipi+T∑t=1Pt(f(xt(θ))−m∑i=1pi⋅gi(xt(θ))) (5)

where encapsulates all the primal decision variables. The primal variables are expressed in a function form because for each different value of , we allow a different choice of the primal variables. At time , the parameter follows the distribution . The vector conveys a meaning of dual price for each type of resource where is the multiplier/dual variable associated with the -th constraint. It follows from weak duality that

 RUBT≤minp≥0maxx1:T(θ)L(p,x1:T(θ)) (6)

where the maximum is taken with respect to all measurable functions that maps to . In fact, the inner maximization with respect to can be achieved in a point-wise manner by defining the following function for each

 h(p;θ)\coloneqqmaxx∈X{f(x;θ)−m∑i=1pi⋅gi(x;θ)}

where is a function of the dual variable and it is also parameterized by This also echoes the “first-observe-then-decide” setting where at each time , the decision maker first observes the parameter and then decides the value of Moreover, let

 L(p):=c⊤p+T∑t=1Pth(p,θ) (7)

and it holds that . Thus,

 RUBT≤minp≥0L(p).

where the right-hand-side serves as the second upper bound of the problem. The above discussions are summarized in Proposition 1. The advantage of the function is that it only involves the dual variable , and the dual variable is not time-dependent.

###### Proposition 1

It holds that

 minp≥0maxx∈XL(p,x)=minp≥0L(p) (8)

Consequently, we have the following upper bound of ,

 E[R∗T]≤T⋅minp≥0L(p). (9)

Algorithm 1

describes a simple primal-dual gradient descent algorithm for solving the online stochastic optimization problem. Essentially, it performs online/stochastic gradient descent for minimizing

. To see this, the expected dual gradient update (12) is in fact the gradient with respect to the -th component of the function

 E[g(~xt;θt)−cT] =−cT+Ptg(~xt;θ) =−∂∂p(1Tc⊤p+Pth(p,θ)).

The first line comes from taking expectation with respect to and the second line comes from the definition of in Algorithm 1. Also, the right-hand-side of the second line is the gradient of the -th term in (by absorbing into the summation in ). In the algorithm, the value of the primal decision variable is then decided based on the value of and the observation as in the definition of the function . Throughout the paper, we assume the optimization problem in defining can be solved efficiently. This implicit assumption is further discussed in Section A4.

###### Proposition 2

Under Assumption 1, if we consider the set , then the regret of Algorithm 1 has the following upper bound

 RegT(π1)≤O(√T)

where stands for the policy specified by Algorithm 1.

Proposition 2 states that the regret of Algorithm 1 is in a stationary (i.i.d.) setting where the distribution remains the same over time. We present this result mainly for benchmark purpose to better interpret the results in the later sections. In fact, Algorithm 1 and Proposition 2 can be directly implied from several recent results on the application of gradient-based algorithms for different online stochastic optimization problems. Lu et al. (2020) propose and analyze a dual mirror descent algorithm for the online resource allocation problem under the stationary (i.i.d.) setting. Li et al. (2020) analyze a special case of Algorithm 1 for the online linear programming problem under both the stationary (i.i.d.) setting and the random permutation setting. While both works achieve an regret under the setting where the underlying distribution is unknown, a recent work (Sun et al., 2020) considers the network revenue management problem and achieves an regret by exploiting the knowledge of underlying distribution and the structure of the problem. Our paper generalizes the formulations in these three papers (Lu et al., 2020; Li et al., 2020; Sun et al., 2020), and the contribution of the result in this section is mainly on illuminating the idea from the general formulation, but the derivation of Algorithm 1 and the proof of Proposition 2 are not novel and they follow a similar roadmap as the analyses therein.

## 4 Non-stationary Environment: Wasserstein-Based Distance and Analysis

In this section, we present the definition of Wasserstein-based non-stationarity and an analytical result on the performance of Algorithm 1 in a non-stationary environment. The aim of such a non-stationarity measure is to relate the best achievable algorithmic performance with the intensity of non-stationarity of the environment (distribution ’s). We will show that our notion of non-stationarity is necessitated by the presence of constraints and thus differs from the prevalent notion of variational budget in the unconstrained setting for online learning problems.

### 4.1 Wasserstein-Based Non-stationarity

The Wasserstein distance, also known as Kantorovich–Rubinstein metric or optimal transport distance (Villani, 2008; Galichon, 2018)

, is a distance function defined between probability distributions on a metric space. Its notion has a long history dating back a century ago and gains increasingly popularity in recent years with a wide range of applications including generative modeling

(Arjovsky et al., 2017), robust optimization (Esfahani and Kuhn, 2018), statistical estimation (Blanchet et al., 2019), etc. In our context, the Wasserstein distance for two probability measures and on the metric parameter space is defined as follows,

 W(Q1,Q2)\coloneqqinfQ1,2∈J(Q1,Q2)∫ρ(θ1,θ2)dQ1,2(θ1,θ2) (13)

where

denotes all the joint distributions

for that have marginals and . The distance function is defined earlier in (2).

Now, we define the Wasserstein-based non-stationarity budget (WBNB) as

 WT(P)\coloneqqT∑t=1W(Pt,¯PT) (14)

where and is defined to be the uniform mixture distribution of , i.e.,

 ¯PT\coloneqq1TT∑t=1Pt.

The WBNB measures the total deviation of ’s from the “centric” distribution . Next, we illustrate the difference between the WBNB with the prevalent notion of variation budget (Besbes et al., 2014, 2015; Cheung et al., 2019). Specifically, Besbes et al. (2015) define the variation budget for the stochastic optimization problem in a non-stationary setting, which can be viewed as an unconstrained version of our online stochastic optimization problem (there is no function in (PCP)). Thus, the variation budget can be defined as follows (in the language of our paper),

 VT\coloneqqT−1∑t=1TV(Pt,Pt+1)

where denotes the total variation distance between two distributions. If we temporarily put aside the different distance function used (total variation versus Wasserstein), the variation budget measures the total amount of changes throughout the evolution of the environment and it concerns the local change between two consecutive distributions and Comparatively, the WBNB is more of a “global” property that measures the distance between all ’s and the centric distribution This global property is in fact necessitated by the shift from an unconstrained setting to a constrained setting, and it can be illustrated through the following example adapted from (Golrezaei et al., 2014). Consider the following two linear programs as the underlying problem (PCP) for the online stochastic optimization problem,

 max x1+...+xc+(1+κ)xc+1+...+(1+κ)xT (15) s.t. x1+...+xc+xc+1+...+xT≤c 0≤xt≤1  for t=1,...,T. max x1+...+xc+(1−κ)xc+1+...+(1−κ)xT (16) s.t. x1+...+xc+xc+1+...+xT≤c 0≤xt≤1  for t=1,...,T.

where , and here we assume is an even number. In the first scenario (15), the optimal solution is to wait and accept the later half of the orders while in the second scenario (16), the optimal solution is to accept the first half of the orders and deplete the resource at half time. In both scenarios, the structural difference between the first half and the second half of the orders can be captured by the non-stationarity with which the orders are generated. The contrast between the two scenarios (of whether the first half or the second half is more profitable) creates difficulty for the online decision making. Without knowledge of the future orders, there is no way we can obtain a sub-linear regret in both scenarios, i.e. we will inevitably incur a loss that is a fixed proportion of the optimal value in at least one of these two scenarios. Because if we exhaust too much resource in the first half of the time, then for the first scenario (15), we do not have enough capacity to accept all the relatively profitable orders in the second half. On the contrary, if we have too much remaining resource at the half way, then for the second scenario (16), those orders that we miss in the first half are irrevocable. The intuition is summarized in Proposition 3. Golrezaei et al. (2014) use the example to illustrate the importance of balancing resource usage in an online context; here we revisit this example from a non-stationarity perspective. For these two examples, we can let the distribution in the general formulation (PCP) be a point mass distribution. Then there is only one change point throughout the whole procedure, so the variation budget for these two examples is while the WBNB for these two examples is In the hope of using the non-stationarity measure to characterize the problem difficulty, the WBNB is more suitable, because the variation budget is but a sublinear regret is still unachievable. Intuitively, the presence of the constraint(s) limits our ability to rectify the decision in a non-stationary environment: for example, in (15), even if we learn that the second half of the orders are more profitable, we may not be able to accept them because of the shortage of the resource. Thus the global (indeed, more restrictive) notion of non-stationarity – WBNB is necessary in the online stochastic optimization problem with the presence of the constraints. We will see in the rest of the paper that while the variation budget captures only the learnability of the non-stationary environment, the WBNB aims to characterize whether the non-stationary environment is learnable under the permission of the resource constraints.

###### Proposition 3

The worst-case regret of constrained online stochastic optimization in adversarial setting is .

### 4.2 Lower Bound: Why Wasserstein Distance

Based on the notion of WBNB, we define a set of distributions

 (17)

###### Theorem 1

Under Assumption 1, if we consider the set as in (17), there is no algorithm that can achieve a regret better than .

Theorem 1 states that the lower bound of the best achievable regret is . The part is due to Lemma 1 in (Arlotto and Gurvich, 2019). The part can be established from (15) and (16): for these two examples, we can view the coefficients in the objective function and the constraint as a point mass distribution . With , we can verify that both examples belong to the set Then we can follow a similar argument as Proposition 3 to show that any algorithm will incur at least optimality gap in one of the two scenarios.

The way how the lower bound on is established also explains why the Wasserstein distance instead of the total variation distance or the KL-divergence is used for our non-stationarity measure. If we revisit the examples (15) and (16), a smaller value of should indicate a smaller variation/non-stationarity between the first half and the second half of observations in both examples. However, the total variation distance fails to characterize this point in that for any non-zero value of , the total variation distance between and for is always (since and have different supports). In other words, if we replace the Wasserstein distance with the total variation distance in our definition of WBNB (14), then the quantity will always be for all

The KL-divergence may be even ill-defined when the two distributions have different support. In this light, the Wasserstein distance is a smoother representation of the distance between two distributions than the total variation distance or the KL-divergence. Interestingly, this coincides with the intuitions in the literature of generative adversarial network (GAN) where

Arjovsky et al. (2017) replace the KL-divergence with the Wasserstein distance in training GANs. Simultaneously and independently, Balseiro et al. (2020) analyze the dual mirror descent algorithm under a similar setting as our results in this section. The key difference is that Balseiro et al. (2020) consider the total variation distance, which inherits the definition of variation functional from (Besbes et al., 2015). As argued above, the Wasserstein distance is smoother in measuring the difference between distributions and thus it provides sharper regret upper bounds as we will see in the rest of the paper.

### 4.3 Algorithm Analysis and Regret Upper Bound

Now we connect Algorithm 1 with our notion of WBNB and establish a regret upper bound for the algorithm under WBNB. For a probability measure over the metric parameter space , we define

 LQ(p)\coloneqq1Tc⊤p+Qh(p;θ).

Then the dual function can be expressed as

 L(p)=T∑t=1LPt(p).

Recall that at each time , Algorithm 1 utilizes a stochastic gradient of the -th term in i.e., function Intuitively, when all ’s are close to each other, the functions should be close to each other. Consequently, though the stochastic gradient is taken with respect to a different function at each time , as long as the difference is small, the stochastic gradient descent should be effective in minimizing . This intuition is aligned with the analysis in (Besbes et al., 2015) given that takes an unconstrained form (if we ignore the non-negativeness constraint).

Lemma 2 states that the function has certain “Lipschitz continuity” in regard with the underlying distribution . Specifically, the supremum norm between two functions and is bounded by the Wasserstein distance between two distributions and up to a constant.

###### Lemma 2

For two probability measures and over the metric parameter space , we have that

 supp∈Ω¯p∣∣LQ1(p)−LQ2(p)∣∣≤max{1,¯p}⋅(m+1)W(Q1,Q2)

where and is an arbitrary positive constant.

Note that the Lipschitz constant in Lemma 2 involves an upper bound of the function argument The following lemma provides such an upper bound for the dual price ’s in Algorithm 1. Its proof largely relies on the part (c) of Assumption 1, and reversely, the key role of the part (c) of Assumption 1 throughout our analysis is to ensure an upper bound for the dual price.

###### Lemma 3

Under Assumption 1, for each , the dual price vector satisfies , where is specified by (12) in Algorithm 1 and the constant is defined in Assumption 1 (c).

The following theorem builds upon Lemma 2 and Lemma 3 and it states that the regret of Algorithm 1 is upper bounded by . Its proof mimics the standard analysis of online/stochastic gradient descent (Hazan, 2016) and integrates the notion of non-stationarity in a similar manner as (Besbes et al., 2015).

###### Theorem 2

Under Assumption 1, if we consider the set as in (17), then the regret of Algorithm 1 has the following upper bound

 RegT(π1)≤O(max{√T,WT})

where stands for the policy specified by Algorithm 1.

Remarkably, the factors on and are additive in the regret upper bound of Algorithm 1. In comparison, the factor on and the variation budget are usually multiplicative in the regret upper bounds in the line of works that adopts the variation budget as nonstationary measure (Besbes et al., 2014, 2015; Cheung et al., 2019). The price of such an advantage for WBNB is that the WBNB is a more restrictive notion than the variation budget; for example, recall that in (15) and (16), the variational budget is , but the WBNB is . Another important feature of our result is that Algorithm 1 does not depend on or utilize the knowledge of the quantity On the upside, this avoids the assumption on the prior knowledge of variation budget (Besbes et al., 2015). On the downside, there is nothing the algorithm can do even when it knows that is large. Technically, it means for Algorithm 1, the WBNB contributes nothing in the dimension of algorithm design, and it will only influence the algorithm analysis. Specifically, it quantifies the extent to which the non-stationary environment will deteriorate the performance of Algorithm 1, and the quantification is done by the additional regret compared to the regret in the stationary environment (Proposition 2).

Theorem 1 and Theorem 2 seemingly conclude our discussion on the problem by validating the optimality of Algorithm 1. In terms of the worst-case performance (regret), no algorithm can do better than the simple gradient-based algorithm. However, we emphasize that the optimality is contingent on the specific choice of the distribution set . In the next section, we present a more generalized and realistic setting, and develop new algorithm and more analysis under the WBNB.

## 5 Non-stationary Environment with Prior Estimate: Blend of Gradient Update and Offline Solution

In this section, we generalize our previous notion of WBNB and present a second algorithm in a more general context. The motivation for the generalization is two-fold:

• Availability of future information: In our previous setting, we consider a “blind” setting where no knowledge about the future distributions is assumed. However, the non-stationarity in practical applications may exhibit predictable patterns such as demand seasonality, the day-of-week effect, and demand surge due to pre-scheduled promotion or shopping festivals. Then, the questions are (i) how to revise the definition of non-stationarity for such a predictable environment, and (ii) how to utilize the predictability of future information for better algorithm design.

• Restrictiveness of our previous WBNB: In Section 4.1, we note that the WBNB is a global and more restrictive measure than the classic notion of variation budget. In particular, the example (15) or (16) show that one single change point for the sequence of distributions may cause the WBNB to scale linearly with . Thus the regret upper bound in Theorem 2 can be quite loose in this type of change-point setting.

### 5.1 Wasserstein-Based Non-stationarity with Prior Estimate

Suppose the decision maker has a prior estimate/prediction for each distribution , and all the predictions are made available at the very beginning of the procedure. We consider the following Wasserstein-based non-stationarity budget with Prior Estimate (WBNB-P):

 WPT(P,^P)=T∑t=1W(Pt,^Pt) (18)

where denotes the true distribution and denotes the prior estimate. By its definition, the new WBNB-P measures the total deviation of true distributions from their prior estimates , whereas our previous WBNB (14) considers the deviation from the centric distribution . Besides, we can also view WBNB-P as a measure of total estimation error.

Now we present an algorithm that utilizes the prior estimate. The starting point is the same as the derivation of Algorithm 1. Specifically, we define

 ^L(p)=c⊤p+T∑t=1^Pth(p,θ) (19)

where the true distribution is replaced by its estimate for each in function defined in (7). Thus it can be viewed as an approximation for the true dual function based on prior estimate. Let denote one optimal solution to ,

 ^p∗∈argminp≥0^L(p) (20)

and for each , define

 γt\coloneqq^Ptg(^x(θ);θ)~{}~{}where~{}^x(θ)=argmaxx∈X{f(x;θ)−(^p∗)⊤⋅g(x;θ)}. (21)

Here, denotes the expected resource consumption in the -th time period under the dual optimal solution . Accordingly, for each , we define the following function ,

 ^Lt(p)\coloneqqγ⊤tp+^Pth(p;θ). (22)

and then we have the following relation between and

###### Lemma 4

For each , it holds that

 ^p∗∈argminp≥0^Lt(p) (23)

where is defined in (20) as the minimizer of the function . Moreover, it holds that

 ^L(^p∗)=T∑t=1^Lt(^p∗). (24)

The definition of and Lemma 4 construct a way to decompose the function into a summation of functions. The new scheme absorbs the term in function into the summation in a different way compared to the scheme in the previous two sections where . Inspired by this new scheme, Algorithm 2 replaces the term in Algorithm 1 with for the update of dual prices . From an algorithmic perspective, Algorithm 2 adjusts the gradient descent direction in Algorithm 1 based on ’s computed from an offline problem (19) specified by the prior estimate. The new update rule with ’s thus coincides with a stochastic gradient with respect to the function . From Lemma 4, we note that each function shares the same optimal solution with their aggregated function Intuitively, at each time , though the stochastic gradient is computed from a different function , all the gradient descent directions point to the same optimal solution This special property makes the gradient update in Algorithm 2 more effective, and in essence, the algorithm performs one iteration of online stochastic gradient descent with respect to the function at each time . A natural question is why we do not use the optimal solution to as a fixed dual price throughout the procedure as the well-known bid price policy (Talluri and Van Ryzin, 1998) for the network revenue management problem. We defer to Section B11 for a detailed discussion on this question.

Another way to interpret Algorithm 2 is from the resource consumption perspective. The sequence represents the optimal way to allocate the resource over time according to the prior estimate. In Algorithm 1, from the update rule of the dual price, we know that if at time period , the resource consumption of constraint is larger (resp. smaller) than , i.e., (resp. ), then we have that (resp. ). In this sense, the dual price balances the process of the resource consumption. However, when the prior estimates ’s are available, it may be no longer desirable to allocate the resource evenly over all time periods. Thus reflects the adjustment on resource consumption suggested by the prior estimate. A larger (resp. smaller) value of indicates that more (resp. less) resource should be allocated to time period .

### 5.2 Regret Analysis

Based on the notion of WBNB-P, we define a set of distributions

 ΞP(^P)\coloneqq{P:WPT(P,^P)≤WT,P=(