Log In Sign Up

Stateful Posted Pricing with Vanishing Regret via Dynamic Deterministic Markov Decision Processes

by   Yuval Emek, et al.

In this paper, a rather general online problem called dynamic resource allocation with capacity constraints (DRACC) is introduced and studied in the realm of posted price mechanisms. This problem subsumes several applications of stateful pricing, including but not limited to posted prices for online job scheduling and matching over a dynamic bipartite graph. As the existing online learning techniques do not yield vanishing-regret mechanisms for this problem, we develop a novel online learning framework defined over deterministic Markov decision processes with dynamic state transition and reward functions. We then prove that if the Markov decision process is guaranteed to admit an oracle that can simulate any given policy from any initial state with bounded loss – a condition that is satisfied in the DRACC problem – then the online learning problem can be solved with vanishing regret. Our proof technique is based on a reduction to online learning with switching cost, in which an online decision maker incurs an extra cost every time she switches from one arm to another. We formally demonstrate this connection and further show how DRACC can be used in our proposed applications of stateful pricing.


No-Regret Stateful Posted Pricing

In this paper, a rather general online problem called dynamic resource a...

Online Allocation and Pricing: Constant Regret via Bellman Inequalities

We develop a framework for designing tractable heuristics for Markov Dec...

Online Learning in Markov Decision Processes with Adversarially Chosen Transition Probability Distributions

We study the problem of learning Markov decision processes with finite s...

An online learning approach to dynamic pricing and capacity sizing in service systems

We study a dynamic pricing and capacity sizing problem in a GI/GI/1 queu...

Online learning in MDPs with side information

We study online learning of finite Markov decision process (MDP) problem...

Logarithmic Regret in Feature-based Dynamic Pricing

Feature-based dynamic pricing is an increasingly popular model of settin...

Fast rates for online learning in Linearly Solvable Markov Decision Processes

We study the problem of online learning in a class of Markov decision pr...

1 Introduction

Price posting is a common selling mechanism across various corners of e-commerce. Its applications span from more traditional domains such as selling flight tickets on Delta’s website or selling products on Amazon, to more emerging domains such as selling cloud services on AWS or pricing ride-shares in Uber. The prevalence of price posting comes from its several important advantages: it is incentive compatible, simple to grasp, and can easily fit in an online (or dynamic) environment where buyers arrive sequentially over time. Therefore, online posted pricing mechanisms, also known as dynamic pricing, have been studied quite extensively in computer science, operations research, and economics (for a comprehensive survey, see den Boer (2015)).

A very useful method for devising online posted prices is via vanishing-regret online learning algorithms in an adversarial environment (Bubeck et al., 2019, 2017; Feldman et al., 2016; Blum and Hartline, 2005; Blum et al., 2004; Kleinberg and Leighton, 2003)

. Here, a sequence of buyers arrive, each associated with her own valuation function that is assumed to be devised by a malicious adversary, and the goal is to post a sequence of price vectors that perform almost as good as the best fixed pricing policy in hindsight. Despite its success, a technical limitation of this method (shared by the aforementioned papers) forces the often less natural assumption of unlimited item supply to ensure that the selling platform is

stateless. However, in many applications of online posted pricing, the platform is stateful; indeed, prices can depend on previous sales that determine the platform’s state. Examples for such stateful platforms include selling resources of limited supply, in which the state encodes the number of remaining inventories of different products, and selling resources in cloud computing to schedule online jobs, in which the state encodes the currently scheduled jobs.

The above mentioned limitation is in sharp contrast to the posted prices literature that consider stochastic settings where the buyers’ valuations are drawn independently and identically from unknown distributions (Badanidiyuru et al., 2018; Babaioff et al., 2015; Zhang et al., 2018), or independently from known distributions (Chawla et al., 2010; Feldman et al., 2015; Chawla et al., 2017a). By exploiting the randomness (and distributional knowledge) of the input and employing other algorithmic techniques, these papers cope with limited supply and occasionally, with more complicated stateful pricing scenarios. However, the stochastic approach does not encompass the (realistic) scenarios in which the buyers’ valuations are correlated in various complex ways, scenarios that are typically handled using adversarial models. The only exception in this regard is the work of Chawla et al. Chawla et al. (2017b) that takes a different approach: they consider the online job scheduling problem, and given access to a collection of (truthful) posted price scheduling mechanisms, they show how to design a (truthful) vanishing-regret online scheduling mechanism against this collection in an adversarial environment.

Motivated by the abundance of stateful posted pricing platforms, and inspired by Chawla et al. (2017b), we study the design of adversarial online learning algorithms with vanishing regret for a rather general online resource allocation framework. In this framework, termed dynamic resource allocation with capacity constraints (DRACC), dynamic resources of limited inventories arrive and depart over time, and an online mechanism sequentially posts price vectors to (myopically) strategic buyers with adversarially chosen combinatorial valuations (refer to Section 2 for the formal model). The goal is to post a sequence of price vectors with the objective of maximizing revenue, while respecting the inventory restrictions of dynamic resources for the periods of time in which they are active. We consider a full-information setting, in which the buyers’ valuations are elicited by the platform after posting prices in each round of the online execution.

Given a collection of pricing policies for the DRACC framework, we aim to construct a sequence of price vectors that is guaranteed to admit a vanishing regret with respect to the best fixed pricing policy in hindsight. Interestingly, our abstract framework is general enough to admit, as special cases, two important applications of stateful posted pricing, namely, online job-scheduling and matching over a dynamic bipartite graph; these applications, for which existing online learning techniques fail to obtain vanishing regret, are discussed in detail in Section 4.

Our contributions and techniques.

Our main result is a vanishing-regret posted price mechanism for the DRACC problem (refer to Section 3 for a formal exposition).

For any DRACC instance with users and for any collection of pricing policies, the regret of our proposed posted price mechanism (in terms of expected revenue) with respect to the in-hindsight best policy in is sublinear in .

We prove this result by abstracting away the details of the pricing problem and considering a more general stateful decision making problem. To this end, we introduce a new framework, termed dynamic deterministic Markov decision process (Dd-MDP), which generalizes the classic deterministic MDP problem to an adversarial online learning dynamic setting. In this framework, a decision maker picks a feasible action for the current state of the MDP, not knowing the state transitions and the rewards associated with each transition; the state transition function and rewards are then revealed. The goal of the decision maker is to pick a sequence of actions with the objective of maximizing her total reward. In particular, we look at vanishing-regret online learning, where the decision maker is aiming at minimizing her regret, defined with respect to the in-hindsight best fixed policy (i.e., a mapping from states to actions) among the policies in a given collection .

Not surprisingly, vanishing-regret online learning is impossible for this general problem (see Proposition 3.1). To circumvent this difficulty, we introduce a structural condition on Dd-MDPs that enables online learning with vanishing regret. This structural condition ensures the existence of an ongoing chasing oracle that allows one to simulate a given fixed policy from any initial state, irrespective of the actual current state, while ensuring a small (vanishing) chasing regret. The crux of our technical contribution is cast in proving that the Dd-MDPs induced by DRACC instances satisfy this chasability condition.

Subject to the chasability condition, we establish a reduction from designing vanishing-regret online algorithms for Dd-MDP to the extensively studied (classic stateless) setting of online learning with switching cost (Kalai and Vempala, 2005). At high level, we have one arm for each policy in the given collection and employ the switching cost online algorithm to determine the next policy to pick. Each time this algorithm suggests a switch to a new policy , we invoke the chasing oracle that attempts to simulate , starting from the current state of the algorithm which may differ from ’s current state. In summary, we obtain the following result (see Theorem 3.8 for a formal exposition).

For any -round Dd-MDP instance that satisfies the chasability condition and for any collection of policies, the regret of our online learning algorithm with respect to the in-hindsight best policy in is sublinear (and optimal) in .

We further study the bandit version of the above problem, where the state transition function is revealed at the end of each round, but the learner only observes the current realized reward instead of the complete reward function. By adapting the chasability condition to this setting, we obtain near optimal regret bounds. See Theorem B.2 and Corollary B.3 in Appendix B for a formal statement.

Our abstract frameworks, both for stateful decision making and stateful pricing, are rather general and we believe that they will turn out to capture many natural problems as special cases (on top of the applications discussed in Section 4).

Additional related work and discussion.

In the DRACC problem, the class of feasible prices at each time is determined by the remaining inventories, which in turn depends on the prices picked at previous times . This kind of dependency cannot be handled by the conventional online learning algorithms, such as follow-the-perturbed-leader Kalai and Vempala (2005) and EXP3 Auer et al. (2002). That is why we aim for the stateful model of online learning, which allows a certain degree of dependence on the past actions.

Several attempts have been made to formalize and study stateful online learning models. The authors of Arora et al. (2012); Feldman et al. (2016) consider an online learning framework where the reward (or cost) at each time depends on the recent actions for some fixed . This framework can be viewed as a reward function that depends on the system’s state that, in this case, encodes the last actions.

There is an extensive line of work on online learning models that address general multi-state systems, typically formalized by means of stochastic Even-Dar et al. (2004); Guan et al. (2014); Yu et al. (2009); Abbasi-Yadkori et al. (2013); Neu et al. (2014) or deterministic Dekel and Hazan (2013) MDPs. The disadvantage of these models from our perspective is that they all have at least one of the following two restrictions: (a) all actions are always feasible regardless of the current state Abbasi-Yadkori et al. (2013); Even-Dar et al. (2004); Guan et al. (2014); Yu et al. (2009); or (b) the state transition function is fixed (static) and known in advance Dekel and Hazan (2013); Even-Dar et al. (2004); Guan et al. (2014); Neu et al. (2014); Yu et al. (2009).

In the DRACC problem, however, not all actions (price vectors) are feasible for every state and the state transition function at time is revealed only after the decision maker has committed to its action. Moreover, the aforementioned MDP-based models require a certain type of state connectivity

in the sense that the Markov chain induced by each action should be irreducible

Abbasi-Yadkori et al. (2013); Even-Dar et al. (2004); Guan et al. (2014); Neu et al. (2014); Yu et al. (2009) or at least the union of all induced Markov chains should form a strongly connected graph Dekel and Hazan (2013). In contrast, in the DRACC problem, depending on the inventories of the resources, it may be the case that certain inventory vectors can never be reached (regardless of the decision maker’s actions).

On the algorithmic side, a common feature of all aforementioned online learning models is that for every instance, there exists some that can be computed in a preprocessing stage (and does not depend on ) such that the online learning can “catch” the state (or distribution over states) of any given sequence of actions in exactly time units. While this feature serves as a corner stone for the existing online learning algorithms, it is not present in our model, hence our online learning algorithm has to employ different ideas.

In Devanur et al. (2019); Kesselheim et al. (2014); Agrawal and Devanur (2015), a family of online resource allocation problems is investigated under a different setting from ours. The resources in their problem models are static, which means that every resource is revealed at the beginning, and remains active from the first user to the last one. Different from our adversarial model, these papers take different stochastic settings on the users, such as the random permutation setting where a fixed set of users arrive in a random order Kesselheim et al. (2014); Agrawal and Devanur (2015), and the random generation setting where the parameters of each user are drawn from some distribution Devanur et al. (2019); Agrawal and Devanur (2015). In these papers, the assignment of the resources to the requests are fully determined by a single decision maker, and the decision for each request depends on the revealed parameters of the current request and previous ones. By contrast, we study the scenario where each strategic user makes her own decision of choosing the resources, and the price posted to each user should be specified independently of the valuation of the current user.

2 Model and Definitions

The DRACC problem.

Consider dynamic resources and strategic myopic users arriving sequentially over rounds , where round lasts over the time interval . Resource arrives at the beginning of round and departs at the end of round , where ; upon arrival, it includes units. We say that resource is active at time if and denote the set of resources active at time by . Let and be upper bounds on and , respectively.

The arriving user at time has a valuation function that determines her value for each subset of resources active at time . We assume that and that the users are quasi-linear, namely, if a subset of resources is allocated to user and she pays a total payment of in return, then her utility is . A family of valuation functions that receives a separated attention in this paper is that of -demand valuation functions, where user is associated with an integer parameter and with a value for each active resource so that her value for a subset is .

Stateful posted price mechanisms.

We restrict our attention to dynamic posted price mechanisms that work based on the following protocol. In each round , the mechanism first realizes which resources arrive at the beginning of round , together with their initial capacity , and which resources departed at the end of round , thus updating its knowledge of . It then posts a price vector that determines the price of each resource at time . Following that, the mechanism elicits the valuation function of the current user and allocates (or in other words sells) one unit of each resource in the demand set to user at a total price of , where


for any price vector , consistently breaking ties according to the lexicographic order on . A virtue of posted price mechanisms is that if the choice of does not depend on , then it is dominant strategy for (myopic) user to report her valuation truthfully.

Let be the inventory vector that encodes the number of units remaining from resource at time . Formally, if , then ; and if (a unit of) is allocated to user and is still active at time , then . We say that a price vector is feasible for the inventory vector if for every such that , that is, for every (active) resource exhausted by round . To ensure that the resource inventory is not exceeded, we require that the posted price vector is feasible for for every ; indeed, since is always strictly smaller than , this requirement ensures that the utility of user from any resource subset that includes an exhausted resource is negative, thus preventing from becoming the selected demand set, recalling that the utility obtained by user from the empty set is .

In this paper, we aim for posted price mechanisms whose objective is to maximize the extracted revenue defined to be the total expected payment received from all users, where the expectation is over the mechanism’s internal randomness.111The techniques we use in this paper are applicable also to the objective of maximizing the social welfare.

Adversarial online learning over pricing policies.

To measure the quality of the aforementioned posted price mechanisms, we consider an adversarial online learning framework, where at each time , the decision maker picks the price vector and an adaptive adversary simultaneously picks the valuation function . The resource arrival times , departure times , and initial capacities are also determined by the adversary. We consider the full information setting, where the valuation function of user is reported to the decision maker at the end of each round . It is also assumed that the decision maker knows the parameters and upfront and that these parameters are independent of the instance length .

A (feasible) pricing policy is a function that maps each inventory vector , , to a price vector , subject to the constraint that is feasible for .222The seemingly more general setup, where the time is passed as an argument to on top of , can be easily reduced to our setup (e.g., by introducing a dummy resource active only in round ). The pricing policies are used as the benchmarks of our online learning framework: Given a pricing policy , consider a decision maker that repeatedly plays according to ; namely, she posts the price vector at time , where is the inventory vector at time obtained by applying recursively on previous inventory vectors and posting prices at times . Denoting , the revenue of this decision maker is given by .

Now, consider a collection of pricing policies. The quality of a posted price mechanism is measured by means of the decision maker’s regret that compares her own revenue to the revenue generated by the in-hindsight best pricing policy in . Formally, the regret (with respect to ) is defined to be

where the expectation is taken over the decision maker’s randomness. The mechanism is said to have vanishing regret if it is guaranteed that the decision maker’s regret is sublinear in , which means that the average regret per time unit vanishes as .

3 Dynamic Posted Pricing via Dd-MDP with Chasability

The online learning framework underlying the DRACC problem as defined in Section 2 is stateful with the inventory vector playing the role of the framework’s state. In the current section, we first introduce a generalization of this online learning framework in the form of a stateful online decision making, formalized by means of dynamic deterministic Markov decision processes (Dd-MDPs). Following that, we propose a structural condition called chasability and show that under this condition, the Dd-MDP problem is amenable to vanishing-regret online learning algorithms. This last result is obtained through a reduction to the extensively studied problem of “experts with switching cost”  Kalai and Vempala (2005). Finally, we prove that the Dd-MDP instances that correspond to the DRACC problem indeed satisfy the chasability condition.

3.1 Viewing DRACC as a Dd-MDP

A (static) deterministic Markov decision process (d-MDP) is defined over a set of states and a set of actions. Each state is associated with a subset of actions called the feasible actions of . A state transition function maps each state and action to a state . This induces a directed graph over , termed the state transition graph, where an edge labeled by leads from node to node if and only if . The d-MDP also includes a reward function that maps each state-action pair with and to a real value in .

Dynamic deterministic MDPs.

Notably, static d-MDPs are not rich enough to capture the dynamic aspects of the DRACC problem. We therefore introduce a more general object where the state transition and reward functions are allowed to develop in an (adversarial) dynamic fashion.

Consider a sequential game played between an online decision maker and an adversary. As in static d-MDPs, the game is defined over a set of states, a set of actions, and a feasible action set for each . We further assume that the state and action sets are finite. The game is played in rounds as follows. The decision maker starts from an initial state . In each round , she plays a (randomized) feasible action , where is the state at the beginning of round . Simultaneously, the adversary selects the state transition function and the reward function . The decision maker then moves to a new state (which is viewed as a movement along edge in the state transition graph induced by ), obtains a reward , and finally, observes and as the current round’s (full information) feedback.333No (time-wise) connectivity assumptions are made for the dynamic transition graph induced by , hence it may not be possible to devise a path between two given states as is done in Dekel and Hazan (2013) for static d-MDPs. The game then advances to the next round . The goal is to maximize the expected total reward .

Policies, simulation, & regret.

A (feasible) policy is a function that maps each state to an action . A simulation of policy over the round interval is given by the state sequence and the action sequence defined by setting


The cumulative reward obtained by this simulation of is given by .

Consider a decision maker that plays the sequential game by following the (randomized) state sequence and action sequence , where for every . For a (finite) set of policies, the decision maker’s regret with respect to is defined to be


Relation to the DRACC Problem

Dynamic posted pricing for the DRACC problem can be modeled as a Dd-MDP. To this end, we identify the state set with the set of possible inventory vectors , . If state is identified with inventory vector , then we identify with the set of price vectors feasible for . The reward function is defined by setting


where is defined as in Eq. (1), recalling that the valuation function , required for the computation of , is available to the decision maker at the end of round . As for the state transition function , the new state is the inventory vector obtained by posting the price vector to user given the inventory vector , namely,

Given the aforementioned definitions, the notion of (pricing) policies and their recursive simulations and the notion of regret translate directly from the DRACC setting to that of Dd-MDPs.

3.2 The Chasability Condition

As the Dd-MDP framework is very inclusive, it is not surprising that in general, it does not allow for vanishing regret.

Proposition 3.1.

For every online learning algorithm, there exists a -round Dd-MDP instance for which the algorithm’s regret is .


Consider a simple scenario where there are only two states with being the initial state and two actions that are feasible for both states. Without loss of generality, let

be the action that the decision maker’s algorithm chooses with probability at least

at time . Now, consider an adversary that works in the following manner: It sets and for every . Regarding the state transition, the adversary sets and ; and for every , it sets and . In such case, the expected cumulative reward of the decision maker is at most , while the policy that always plays action obtains a cumulative reward of . ∎

As a remedy to the impossibility result established in Proposition 3.1, we introduce a structural condition for Dd-MDPs that makes them amenable to online learning with vanishing regret.

Definition 3.2 (Chasability condition for Dd-MDPs).

A Dd-MDP instance is called -chasable for some if it admits an ongoing chasing oracle that works as follows for any given target policy . The chasing oracle is invoked at the beginning of some round and provided with an initial state ; this invocation is halted at the end of some round . In each round , the chasing oracle generates a (random) action that is feasible for state


following that, the chasing oracle is provided with the Dd-MDP’s state transition function and reward function . The main guarantee of is that its chasing regret (CR) satisfies

We emphasize that the initial state provided to the chasing oracle may differ from .

Relation to the DRACC Problem (continuted)

Interestingly, the Dd-MDPs corresponding to DRACC instances are -chasable for , where the exact bound on depends on whether we consider general or -demand valuation functions. Before establishing these bounds, we show that the chasing oracle must be randomized.

Proposition 3.3.

There exists a family of -round DRACC instances whose corresponding Dd-MDPs do not admit a deterministic chasing oracle with chasing regret CR.


Consider an ongoing chasing oracle that is implemented in a deterministic manner for a DRACC instance with , . The adversary chooses initial step and initial state so that , , and . Note that throughout this proof, the inventory vectors and price vectors containing two elements are presented in an ordered way, which means that the first element corresponds to the resource with the smaller index.

The target policy is chosen to have . Moreover, it maps every inventory vector to a price vector of . The adversary ensures the feasibility of such a policy by setting for each resource that is sold out at with the price vector generated by , and setting for a new resource. With this setting, it holds for every that .

The adversary configures the valuation functions for each in an adaptive way, and ensures that for all such


With the initial state chosen by the adversary, Eq. (6) holds for . Suppose it holds for some . Then the price vector generated by the oracle must be in the form of for some . Let and be the two resources in with . If , the adversary sets and . Then with the price vector generated by , payment is obtained from the user for resource , while the oracle obtains payment from the user for . The difference in rewards is


Moreover, since is sold out with , the adversary sets and for a new resource . In such case, it is guaranteed that Eq. (6) holds for .

For the case where , it can be verified that Eq. (7) still holds for and Eq. (6) holds for when the adversary sets . Since Eq. (7) is established for every , . With , taking gives the desired bound. ∎

We now turn to study chasing oracles for DRACC instances implemented by randomized procedures.

Theorem 3.4.

The Dd-MDPs corresponding to -round DRACC instances with -demand valuation functions are -chasable.


Consider some DRACC instance and fix the target pricing policy ; in what follows, we identify with a decision maker that repeatedly plays according to . Given an initial round and an initial inventory vector , we construct a randomized chasing oracle that works as follows until it is halted at the end of round . For each round , recall that is the inventory vector at time obtained by running from round to , and let be the inventory vector at time obtained by as defined in Eq. (5). We partition the set of resources active at time into and . In each round , the chasing oracle posts the (-dimensional) all- price vector with probability , where is a parameter to be determined later on; and it posts the price vector

with probability , observing that this price vector is feasible for by the definition of and . Notice that never sells a resource and that for all . Moreover, if resource arrives at time , then .

To analyze the CR, we classify the rounds in

into two classes called and : round is said to be if at least one (unit of a) resource in is sold by in this round; otherwise, round is said to be . For each round , if posts in round , then sells exactly the same resources as for the exact same prices; otherwise ( posts the all- price vector in round ), does not sell any resource. Hence, the CR increases in round by at most in expectation. For each round , the CR increases in round by at most . Therefore the total CR over the interval is upper bounded by , where and denote the number of and rounds, respectively.

To bound , we introduce a potential function , , defined by setting

By definition, and . We argue that is non-increasing in . To this end, notice that if is a round, then , hence . If is a round and posts the all- price vector, then as sells no resource whereas sells at least one (unit of a) resource in . So, it remains to consider a round in which posts the price vector . Let and be the sets of (active) resources sold by and , respectively, in round and notice that a resource may move from to . The key observation now is that since is a -demand valuation function, it follows that , thus . As both and sell exactly one unit of each resource in and , respectively, we conclude that .

Therefore, is upper bounded by plus the expected number of rounds in which does not decrease. Since strictly decreases in each round in which posts the all- price vector, it follows that the number of rounds in which

does not decrease is stochastically dominated by a negative binomial random variable

with parameters and . Recalling that , we conclude that . The assertion is now established by setting . ∎

Remark 3.5.

Theorem 3.4 can be in fact extended – using the exact same line of arguments – to a more general family of valuation functions defined as follows. Let be a price vector, be a subset of the active resources, and be the price vector obtained from by setting if ; and otherwise. Then, . Besides -demand valuations, this class of valuation functions includes OXS valuations Lehmann et al. (2006) and single-minded valuations Lehmann et al. (2002).

Theorem 3.6.

The Dd-MDPs corresponding to -round DRACC instances with arbitrary valuation functions are -chasable.


The proof follows the same line of arguments as that of Theorem 3.4, only that now, it no longer holds that the potential function is non-increasing in . However, it is still true that (I) for every ; (II) if is a round and posts the all- price vector in round , then ; and (III) if for some , then for all . We conclude that if posts the all- price vector in contiguous rounds, then must reach zero and following that, there are no more rounds. Therefore the total number of rounds is stochastically dominated by times a geometric random variable with parameter . Since , it follows that . Combined with the rounds, the CR is upper bounded by . The assertion is established by setting . ∎

3.3 Putting the Pieces Together: Reduction to Online Learning with Switching Cost

Having an ongoing chasing oracle with vanishing chasing regret in hand, our remaining key technical idea is to reduce online decision making for the Dd-MDP problem to the well-studied problem of online learning with switching cost (OLSC) Kalai and Vempala (2005). The problem’s setup under full-information is exactly the same as the classic problem of learning from experts’ advice, but the learner incurs an extra cost , a parameter referred to as the switching cost, whenever it switches from one expert to another. Here, we have a finite set of experts (often called actions or arms) and rounds. The expert reward function is revealed as feedback at the end of round . The goal of an algorithm for this problem is to pick a sequence of experts in an online fashion with the objective of minimizing the regret, now defined to be

Theorem 3.7 (Kalai and Vempala (2005)).

The OLSC problem with switching cost admits an online algorithm whose regret is .

Note that the same theorem also holds for independent stochastic switching costs with as the upper bound on the expected switching cost, simply because of linearity of expectation and the fact that in algorithms for OLSC, such as the Following-The-Perturbed-Leader Kalai and Vempala (2005), switching at each time is independent of the realized cost of switching.

We now present our full-information online learning algorithm for -chasable Dd-MDP instances; the reader is referred to Appendix B for the bandit version of this algorithm. Our (full-information) algorithm, called chasing and switching (C&S), requires a black box access to an algorithm for the OLSC problem with the following configuration: (1) the expert set of is identified with the policy collection of the Dd-MDP instance; (2) the number of rounds of is equal to the number of rounds of the Dd-MDP instance (); and (3) the switching cost of is set to .

The operation of C&S is described in Algorithm 1. This algorithm maintains, in parallel, the OLSC algorithm and an ongoing chasing oracle ; produces a sequence of policies and produces a sequence of actions based on that. Specifically, is restarted, i.e., invoked from scratch with a fresh policy , whenever switches to from some policy .

Input: Policy set , OLSC algorithm , chasing oracle , initial state ;
Output: Sequence of actions, (implicit) sequence of states;
Start from initial state ;
for each round  do
       Invoke to pick a policy at the beginning of round ;
       if  and  then
             Invoke from scratch with target policy , initialized with round and state ;
             Select the action returned by ;
             Continue the existing run of and select the action it returns;
       Feed with and as the state transition and reward functions of round ;
       for each  do
             Compute by simulating policy up to time (see Eq. (2));
       Feed with as the reward function of round ;
ALGORITHM 1 Online Dd-MDP algorithm C&S
Theorem 3.8.

The regret of C&S for -round -chasable Dd-MDP instances is .


Partition the rounds into episodes so that each episode is a maximal contiguous sequence of rounds in which the policy chosen by does not change. Let and be the first and last rounds of episode , respectively. Consider some episode with corresponding policy . Since C&S follows an action sequence generated by during the round interval and since the chasing regret of is upper bounded by , it follows that

Therefore, for each policy , we have

By Theorem 3.7, the last expression is at most . ∎

So far, we have only considered the notion of policy regret as defined in Eq. 3. An extension of our results to the notion of external regret (Arora et al., 2018) is discussed in Appendix A. Furthermore, we investigate the bandit version of the problem in Appendix B. In a nutshell, by introducing a stateless version of our full-information chasing oracle and reducing to the adversarial multi-armed-bandit problem (Audibert and Bubeck, 2009), we obtain regret bound for Dd-MDP under bandit feedback. Finally, we obtain near-matching lower bounds for both the full-information and bandit feedback versions of the Dd-MDP problem under the chasability condition in Appendix C.

Relation to the DRACC Problem (continued)

We can now use C&S (Algorithm 1) for Dd-MDPs that correspond to DRACC instances. This final mechanism is called learning based posted pricing (LBPP). It first provides the input parameters of C&S, including the collection of pricing policies, the OLSC algorithm , the ongoing chasing oracle and the initial state . It then runs C&S by posting its price vectors (actions) and updating the resulting inventory vectors (states). For , we employ the (randomized) chasing oracles promised in Theorem 3.4 and Theorem 3.6. The following theorems can now be inferred from Theorem 3.8, Theorem 3.4, and Theorem 3.6.

Theorem 3.9.

The regret of LBPP for -round DRACC instances with -demand valuation functions (or more generally, with the valuation functions defined in Remark 3.5) is .

Theorem 3.10.

The regret of LBPP for -round DRACC instances with with arbitrary valuation functions is .

Note that the regret bounds in Theorem 3.9 and Theorem 3.10 depend on the parameters and of the DRACC problem; as shown in the following theorem, such a dependence is unavoidable.

Theorem 3.11.

If , then the regret of any posted price mechanism is .


Here we construct two instances of DRACC. The following settings are the same between these two instances.

  • The parameters and are chosen so that . Set .

  • For each resource , and . This setting implies that for every user , , which is consistent with . Every has the same capacity .

  • For each user , the valuation function is set as follows.

For the users , their valuation functions are different between the two instances. In particular, in the first instance, for any , while in the second instance

where is some small enough constant in .

Now consider an arbitrary deterministic mechanism . Such a mechanism will output the same sequence of price vectors for the first half of the users in these two instances. Therefore, the total number of resources that are allocated by to the first half of users must be the same in the two instances for some . Then, the revenue of is at most in the former instance, while at most in the latter one. Now consider a pricing policy that maps every inventory vector except to a price vector that only contains . The revenue of in the first instance is . Similarly, there exists a policy with revenue in the second instance. Therefore, the regret of is at least

To generalize the result above to the mechanisms that can utilize the random bits, here we adopt Yao’s principle Yao (1977). In particular, we construct a distribution over the inputs which assigns probabilities and to the two instances constructed above, respectively. It can be verified that against such a distribution, the expectation of any random mechanism’s regret is at least . By Yao’s principle, the lower bound on the regret of any mechanism that can utilizes the random bits is also . Therefore, this proposition is established. ∎

4 Applications of the DRACC Problem

The mechanism LBPP proposed for the DRACC problem can be directly applied to a large family of online pricing problems arising in practice. Two examples are presented in this section: the online job scheduling (OJS) problem and the problem of matching over dynamic bipartite graphs (MDBG).

4.1 Online Job Scheduling

The OJS problem described in this section is motivated by the application of assigning jobs that arrive online to limited bandwidth slots for maximizing the total payments collected from the jobs. Formally, in the OJS problem, there are strategic myopic jobs, arriving sequentially over time slots. Each slot lasts over the time interval and is associated with a bandwidth , which means that this slot can be allocated to at most jobs. For each job , the adversary specifies an arrival slot , a departure slot , a length , and a value . We emphasize that any number (including zero) of jobs may have slot as their arrival (or departure) slot. The goal of job is to get an allocation of contiguous slots within , namely, a slot interval in

with being the job’s value for each such allocation. Let and be upper bounds on and , respectively.

Job is reported to the OJS mechanism at the beginning of slot ; if several jobs share the same arrival slot, then they are reported to the mechanism sequentially in an arbitrary order. At the beginning of slot , the mechanism is also informed of the bandwidth parameter of every slot , where is defined to be the slot interval

note that the mechanism may have been informed of the bandwidth parameters of some slots in beforehand (if they belong to for ). In response, the mechanism posts a price vector and elicits the parameters ,