No-Regret Stateful Posted Pricing

05/04/2020 ∙ by Yuval Emek, et al. ∙ 0

In this paper, a rather general online problem called dynamic resource allocation with capacity constraints (DRACC) is introduced and studied in the realm of posted price mechanisms. This problem subsumes several applications of stateful pricing, including but not limited to posted prices for online job scheduling. As the existing online learning techniques do not yield no-regret mechanisms for this problem, we develop a new online learning framework defined over deterministic Markov decision processes with dynamic state transition and reward functions. We then prove that if the Markov decision process is guaranteed to admit a dominant state in each round and there exists an oracle that can switch the internal state with bounded loss, a condition that is satisfied in the DRACC problem, then the online learning problem can be solved with vanishing regret. Our proof technique is based on a reduction to full information online learning with switching cost (Kalai and Vempala, 2005), in which an online decision maker incurs an extra cost every time she switches from one arm to another. We demonstrate this connection formally, and further show how DRACC can be used in our proposed applications of stateful pricing.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Price posting is a common selling mechanism across various corners of e-commerce. Its applications span from more traditional domains such as selling flight tickets on Delta website or selling products on Amazon, to more emerging domains such as selling cloud services on AWS or pricing ride-shares in Uber. Not surprisingly, the prevalence of price posting comes from its several important advantages: it is incentive compatible, simple to grasp, and can easily fit in an online (or dynamic) environment where buyers arrive sequentially over time. Therefore, online posted pricing mechanisms, also known as dynamic pricing, have been studied quite extensively in computer science, operations research, and economics literature (for a comprehensive survey, see den Boer (2015)).

One method of devising online posted prices, which has been proven to be extremely useful, is via no-regret online learning algorithms in an adversarial environment (Bubeck et al., 2019, 2017; Feldman et al., 2016; Blum and Hartline, 2005; Blum et al., 2004; Kleinberg and Leighton, 2003). Here, sequence of valuations of buyers are arbitrary, i.e., picked by an adversary, and the goal is to pick a sequence of posted prices that perform almost as good as the best fixed posted price in hindsight. Despite its success, a technical limitation of this method forces the often less natural assumption of unlimited item supply (including in the aforementioned papers). However, in several applications of online posted pricing the platform needs to keep track of the state of the sale and prices can potentially depend on this state. Examples are selling resources of limited supply, in which the state is the number of remaining inventories of different products, or selling resources in cloud computing to schedule online jobs, in which the state is which jobs are currently scheduled.

The above mentioned limitation is in sharp contrast to the posted prices literature that consider a stochastic setting where the buyers’ values are drawn independently and identically from unknown distributions (Badanidiyuru et al., 2013; Babaioff et al., 2015; Zhang et al., 2018), or independently from known distributions (Chawla et al., 2010; Feldman et al., 2014; Chawla et al., 2017b). By exploiting the randomness (and distributional knowledge) of the input and employing other algorithmic techniques, these papers handle the more natural assumption of limited supply, and occasionally more complicated stateful pricing scenarios. However this approach does not encompass the (realistic) scenarios in which the buyers’ values are correlated in various complicated ways, scenarios that are typically handled using adversarial models. The only exception is the work of Chawla et al. (2017a) that takes a different approach. They consider online job scheduling, and given access to a collection of (truthful) FIFO-posted price scheduling mechanisms they show how to design a (truthful) no-regret online scheduling policy against this collection in an adversarial environment.

Motivated by the above applications of stateful posted pricing in a detail-free environment, and inspired by Chawla et al. (2017a), we study the design of no-regret adversarial online learning algorithms for a rather general stateful online resource allocation framework. In this framework, which is termed as Dynamic Resource Allocation with Capacity Constraints (DRACC), dynamic resources of limited inventories arrive and depart over time, and an online mechanism sequentially posts prices to (myopically) strategic buyers with adversarial combinatorial valuations. The goal is to post a sequence of prices to maximize revenue, while respecting the inventory restrictions of dynamic resources for the periods of time in which they are active. We consider a full-information setting, in which buyers valuations will be elicited by the platform after posting prices at each time.

Given a collection of pricing policies for the DRACC framework, we aim to design a new posted pricing policy that obtains vanishing regret with respect to the best fixed pricing policy in hindsight among the policies in the collection. In order to obtain such a result, we pin down the right set of modeling assumptions that make this problem tractable (which we justify their necessity later in Section 6). Using this framework, we investigate other stateful pricing problems for which existing online learning techniques cannot be exploited to obtain no-regret policies, and that they are better suited for our framework. Interestingly, our abstract framework with its assumptions is general enough to admit as special cases two other important applications of stateful posted pricing, namely online job-scheduling and matching over a dynamic bipartite graph, which we discuss in details in Section 5.

Our Contributions and Techniques.

Our main result is a no-regret posted price mechanism for the DRACC problem (refer to Section 4, Theorem 4.7 for more technical details ).

For any DRACC instance with users (under our modeling assumptions in  Section 2.1) and for any collection of pricing policies, our proposed posted price mechanism obtains a regret bound that is sublinear in against the in-hindsight optimal policy of (in terms of expected revenue).

Our technique to prove the above result is by abstracting away the details of the pricing problem, and considering the more general underlying “stateful decision making" problem. To this end, we introduce a new framework, termed as Dynamic Deterministic Markov Decision Process (Dd-MDP), which generalizes the classic deterministic MDP problem to an adversarial online learning setting. In this problem a decision maker picks a feasible action at any time given the current state of the MDP, not knowing the state transitions and the rewards associated with each transition. The state transition functions and rewards are then revealed. The goal of the decision maker is to pick a sequence of actions that maximizes reward. In particular, we look at no-regret online learning, where the decision maker is trying to minimize regret and regret is defined with respect to the best fixed in hindsight policy (i.e., a mapping from states to actions) among policies in a given collection .

Not surprisingly, no-regret online learning is not possible for this general problem (see Proposition 2.1). Inspired by the DRACC problem and its application, we introduce a critical structural condition on Dd-MDP which makes no-regret online learning possible. This condition ensures the existence of “dominant” states, that are states that satisfy the following property: by magically switching its internal state to the dominant state at any time, any policy can always weakly obtain more future reward. Moreover, it ensures the existence of switching procedures, termed as “chasing” oracles, that can switch the internal state of a policy to a dominant state with affordable loss.

Given the Dominance & Chasing condition above, our main technical contribution is to show a reduction from designing no-regret online policies for Dd-MDP to the well-studied (classic stateless) setting of online learning with switching costs and experts’ advice (Kalai and Vempala, 2005). Given any no-regret algorithm for the latter problem, we design a new policy called Chasing & Simulation (C&S) for Dd-MDP that uses this algorithm in a blackbox fashion. At high level, we have one arm for each policy in the given collection, and C&S invokes the switching cost algorithm to find the next policy to pick. Each time this algorithm suggests a different policy, we invoke the chasing oracle to ensure ending up at a dominant state first (chasing phase), and then we simulate the new suggested policy from the beginning to find which actions to pick until the next time we have a switch (simulation phase). In summary, we obtain the following result (see Theorem 3.2 in Section 3 for more details)

For any Dd-MDP instance with rounds and satisfying the (approximate) Dominance&Chasing condition (Section 3.1), and any collection of policies, Chasing&Simulation (Algorithm 1) obtains a regret bound sublinear in against the best in-hindsight policy in .

Our frameworks, both for stateful decision making and stateful pricing, are rather general and we believe that more problems will turn out to fit them. We study two of them in the applications section, and we leave investigating more applications as a future direction.

Additional Related Work and Discussions.

In the DRACC problem, the class of feasible prices at each time is determined by the remaining inventories, which in turn depends on the prices picked at previous times . This kind of dependency cannot be handled by the conventional online learning algorithms, such as follow-the-perturbed-leader Kalai and Vempala (2005) and EXP3 Auer et al. (2002). That is why we aim for the stateful model of online learning, which allows a certain degree of dependence on the past actions.

Several attempts have been made to formalize and study stateful online learning models. The authors of Arora et al. (2012); Feldman et al. (2016) consider an online learning framework where the reward (or cost) at each time depends on the recent actions for some fixed . This framework can be viewed as a reward function that depends on the system’s state that, in this case, encodes the last actions.

There is an extensive line of work on online learning models that address general multi-state systems, typically formalized by means of stochastic Even-dar et al. (2005); Guan et al. (2014); Yu et al. (2008); Abbasi-Yadkori et al. (2013); Neu et al. (2014) or deterministic Dekel and Hazan (2013) MDPs. The disadvantage of these models from our perspective is that they all have at least one of the following two restrictions: (a) all actions are always feasible regardless of the current state Abbasi-Yadkori et al. (2013); Even-dar et al. (2005); Guan et al. (2014); Yu et al. (2008); or (b) the state transition function is fixed (static) and known in advance Dekel and Hazan (2013); Even-dar et al. (2005); Guan et al. (2014); Neu et al. (2014); Yu et al. (2008).

In the DRACC problem, however, not all actions (price vectors) are feasible for every state and the state transition function at time

is revealed only after the decision maker has committed to its action. Moreover, the aforementioned MDP-based models require a certain type of state connectivity

in the sense that the Markov chain induced by each action should be irreducible

Abbasi-Yadkori et al. (2013); Even-dar et al. (2005); Guan et al. (2014); Neu et al. (2014); Yu et al. (2008) or at least the union of all induced Markov chains should form a strongly connected graph Dekel and Hazan (2013). In contrast, in the DRACC problem, depending on the inventories of the resources, it may be the case that certain surplus vectors can never be reached (regardless of the decision maker’s actions).

On the algorithmic side, a common feature of all aforementioned online learning models is that for every instance, there exists some that can be computed in a preprocessing stage (and does not depend on ) such that the online learning can “catch” the state (or distribution over states) of any given sequence of actions in exactly time units. While this feature serves as a corner stone for the existing online learning algorithms, it is not present in our model, hence our online learning algorithm has to employ different ideas.

In Devanur et al. (2011); Kesselheim et al. (2014); Agrawal and Devanur (2015), a family of online resource allocation problems is investigated under a different setting from ours. The resources in their problem models are static, which means that every resource is revealed at the beginning, and remains active from the first user to the last one. Different from our adversarial model, these papers take different stochastic settings on the users, such as the random permutation setting where a fixed set of users arrive in a random order Kesselheim et al. (2014); Agrawal and Devanur (2015), and the random generation setting where the parameters of each user are drawn from some distribution Devanur et al. (2011); Agrawal and Devanur (2015). In these papers, the assignment of the resources to the requests are fully determined by a single decision maker, and the decision for each request depends on the revealed parameters of the current request and previous ones. By contrast, we study the scenario where each strategic user makes her own decision of choosing the resources, and the price posted to each user should be specified independently of the valuation of the current user.

2 Preliminaries

In this section, we formalize the problem of designing no-regret posted price mechanisms for the DRACC setting, and then formalize its generalization to no-regret online learning for the abstract Dd-MDP setting. In principle, both of these problems are examples of online learning in a stateful environment. We further dig into the connection between the two settings in Section 4.

2.1 DRACC : Notations, Basics & Online Learning

In the DRACC problem, we have dynamic resources and strategic users arriving sequentially over times . For each dynamic resource , units of this resource arrive at time and they all expire at time , where . We say a resource is active at time if , and denote the set of all active resources at this time by . Let , and . At each time , the set of resources that arrive at the beginning of this time, i.e., , and the set of resources that depart after the end of this time, i.e., , are revealed. An arriving user at time has a valuation set function for subsets of active resources at this time. We assume . Moreover, users are quasi-linear, i.e., if a subset of resources is allocated to user and she pays a total payment of in return, her utility will be .

Posted Price Mechanisms.

We restrict our attention to dynamic posted price mechanisms that work based on a protocol as follows. At each time , the mechanism first realizes which new resources arrive at this time and hence are active, and which resources have departed at time and hence are not active anymore. It then posts a vector of prices for active resources, where is the price that arriving user should pay if it is allocated active resource . It then elicits the arriving user’s valuation function . Finally, a unit of each resource in the demand set of user is allocated to that user and she pays to the mechanism, where

(1)

The crux of posted price mechanisms is that if the price vector is determined independently of the valuation function , then it is a dominant strategy for (myopic) user to report truthfully. To ensure inventory feasibility of the posted pricing mechanism, we force for an exhausted resource ; in this way, since is strictly smaller than , a user’s utility will be negative upon receiving any units of an exhausted resource, which cannot happen as utility obtained from the empty set is .

In this paper, we aim for a posted price mechanisms that maximizes the extracted revenue, which is the total expected payment received from all users , where the expectation is over internal randomness of the pricing mechanism (as we allow randomized prices).111The techniques we use in this paper are applicable also for the objective of maximizing the social welfare.

Online Learning, Feasible One-shot Policies & Regret.

To design posted price mechanisms, we consider a full-information adversarial online learning framework. At each time , a decision maker picks the price vector and an oblivious adversary simultaneously picks the valuation function . The adversary also selects arrival times , expiry times , and inventory sizes of resources upfront, and reveals them to the decision maker in an online fashion (upon arrival or departure of a dynamic resource) as discussed earlier. Roughly speaking, the goal of the decision maker is to generate a sequence of prices that extracts as much revenue as possible.

Formally, let denote the remaining inventories of active resources at time , also termed as active resources surplus vector. A (feasible) one-shot pricing policy is then a mapping from each possible surplus vector to a price vector , so that (i) has the same dimension as ; and (ii) if for some resource , then .222Recall that for any and , is strictly less than . Therefore, a feasible pricing policy ensures that a resource with will not be allocated to a user. Given a pricing policy , we consider a benchmark that is the total revenue extracted by a decision maker who follows pricing policy . This decision maker posts prices recursively, i.e., by applying mapping repeatedly on previous surplus vectors and posting prices at times . Denoting , the revenue generated by can be written as .

Now, consider a collection of feasible one-shot pricing policies. The quality of a posted price mechanism is measured by means of the decision maker’s regret that compares her own revenue to the revenue generated by the best pricing policy in in hindsight. Formally, the regret (with respect to ) is defined to be

where the expectation is taken over the decision maker’s randomness. The mechanism is called no-regret if it is guaranteed that the decision maker’s regret is sublinear in , which means that the average regret per time unit vanishes as .

Modeling Assumptions.

In our treatment of DRACC, we make a few assumptions:

  • Bounded parameters: The collection of pricing policies is given as input by an oblivious adversary, and its size is assumed to satisfy

    (2)

    where the parameter is defined to be a non-negative real number such that for any dynamic resource ,

    (3)
  • Finite discrete prices: posted prices are restricted to a finite subset that explicitly contains the price .333Since the collection is specified by the adversary, its size does not directly depend on , and . Thus, we do not make any further assumption on how , and are bounded.

2.2 Dd-MDP: Notations, Basics & Online Learning

We start by formalizing the static decision process. We then introduce our dynamic setting, which is a generalization of the online learning framework introduced in Dekel and Hazan (2013) to a setting where rewards of the decision process are non-stationary and selected by an oblivious adversary.

The static setting.

A deterministic Markov Decision Process (d-MDP) is associated with a set of states and a set of actions . Each state is associated with a subset of actions, which are called the feasible actions of . For each state and each action , a state transition function maps the state-action pair to a new state . It induces a directed graph with vertex set , termed as the state transition graph, where a node is connected to a node by a directed edge with iff . Usually, the transition digraph is assumed to be strongly connected. There is also a transition reward function that maps each state-action pair , i.e., each directed edge of the transition graph, to a real value in . Generally speaking, the goal of a decision maker in this setup is to pick a sequence of actions and move between states so as to maximize her accumulated reward.

Dynamic Deterministic MDP & Online Learning.

Our proposed online learning framework, which generalizes the setup of Dekel and Hazan (2013), is a sequential game played between an online decision maker and an adversary. The game is defined by a set of states , a set of actions , feasible actions sets (as in the static d-MDP), and a number of rounds . We further assume states and actions are discrete and finite. The game is played as follows. The decision maker starts from the initial state . At each time , the decision maker plays a (randomized) feasible action , where is the current state at the beginning of time . Simultaneously, the adversary selects the state transition function and the reward function at time . We further restrict our attention to non-adaptive adversaries.444This means that the adversary chooses its transition and rewards functions at the beginning of the game, possibly based on the algorithm that the decision maker will use, but without knowing the actual realization of actions. After the selections of the two parties, the decision maker moves to a new state (which can be viewed as a movement along the directed edge in the transition graph defined by ), obtains a reward of , gets to know the state transition function and transition reward function , and finally the sequential game moves to its next round.555Notably, the transition graphs at times are not assumed to be strongly connected. These features of Dd-MDPs make it difficult or even impossible to plan a path with desired length between two given states as in Dekel and Hazan (2013).. The decision maker aims to maximize the expected value of the cumulative reward.

Feasible One-shot Policies, Simulation & Regret.

Similar to our treatment for the DRACC problem in Section 2.1, we consider one-shot policies for the Dd-MDP problem. Formally, a one-shot policy maps each state to an action . A policy is said to be feasible if for every state . Now we can consider simulation of a fixed policy from time to by a decision maker in our above sequential game. Formally, define and as follows.

The cumulative reward obtained by simulating a fixed policy is . Now, given a (finite) set of feasible policies , the regret of the decision maker, who (sequentially) generates action-state pairs , w.r.t. the best in hindsight fixed policy in is defined as

(4)

An online decision maker is said to be no-regret if the upper bound on the regret is sublinear in , which means that the average regret vanishes as . Our objective is to develop no-regret online decision making algorithms for this setting.

We conclude this section by showing the impossibility of no-regret for general Dd-MDP (similar to Arora et al. (2012)).

Proposition 2.1.

There is an instance of Dd-MDP problem in which any online learning algorithm obtains a regret no smaller than .

Proof.

Consider a simple scenario where there are only two states and two actions that are feasible for both states. Without loss of generality, let

be the action that the algorithm would choose with probability at least

at time . Now consider an adversary that works in the following manner. It sets , and at each time . Regarding the state transition, the adversary sets , , and for every , , and . In such case, the expected cumulative reward of the algorithm is at most , while the best fixed policy in hindsight, which is always playing action , ensures that the cumulative reward is . Therefore, regret is at least . ∎

3 No-regret for Dd-MDP with Dominance and Chasing

Motivated by the DRACC problem discussed in Section 2.1, and especially its application in online job scheduling, which we elaborate in Section 5, we restrict our attention to a special class of Dd-MDP instances. In particular, we consider a class of instances that satisfy a structural condition called “Dominance and Chasing”.

3.1 Dominance and Chasing Conditions

For an instance of Dd-MDP, we say that this instance satisfies the -Dominance&Chasing condition for some parameter if there exists a sequence of states , referred to as -dominant states, that satisfy the following:

  • Domination: for any policy and any time interval , we have:

    (5)

    where and for . Moreover, it is required that is a feasible action for state , i.e., for every .

  • Online Chasing: there exists a chasing oracle that works as follows. For any initial time and initial state , it generates a sequence of (possibly randomized) actions (and implicitly a sequence of states ), starting from and terminating at , where

    (6)

    Moreover, is a feasible action for state for . We assume the chasing oracle satisfies two properties:

    1. [label=()]

    2. Termination at a dominant state: either state is reached at , or .

    3. Bounded chasing regret: the Chasing Regret (CR) of the oracle is bounded by , i.e.,

3.2 Online Learning with Switching Cost

In order to design no-regret algorithms for Dd-MDP under dominance & chasing, we develop techniques that are related to the full-information adversarial model of Online Learning with Switching Cost (OLSC) Kalai and Vempala (2005). We briefly describe this framework here. Using a blackbox algorithm for this problem, Section 3.3 shows how to obtain vanishing regret for our problem.

In the OLSC problem, there is a set of actions/arms/experts , and rounds. At each round , an adversary specifies a reward function , which is unknown to the online algorithm in the beginning of this round. Simultaneously, the algorithm chooses an action . Then the reward function is revealed to the algorithm. The goal of the algorithm is to pick a sequence of actions in an online fashion to maximize , where is essentially the extra cost that the algorithm incurs due to switching its actions and the parameter is referred to as the switching cost. The regret is defined to be

Theorem 3.1 (Kalai and Vempala (2005)).

For the OLSC problem with a switching cost , there exists an algorithm with regret .

3.3 Chasing & Simulation Algorithm and Analysis

We now present our algorithm for the class of Dd-MDP problems satisfying -Dominance&Chasing condition. Our algorithm, called Chasing and Simulation (C&S), requires blackbox access to an algorithm for the OLSC problem, where the action set is set to be the collections of policies for the Dd-MDP problem, there are rounds and the switching cost is set to . With blackbox access to the chasing oracle (with chasing regret bounded by ) and an algorithm for OLSC (that guarantees the regret bound in Theorem 3.1), the procedure of C&S is described in Algorithm 1. It starts to run from the given state as follows. At the beginning of each round , C&S invokes Algorithm to choose a policy from . If is different from the one chosen in the last step or , then C&S stops the chasing oracle if it is running, starts a new run of with as the initial state and as the initial time, and takes the action generated by . If and is running, then C&S takes the action yielded by as well. Otherwise, C&S computes the action by simulating the policy up to , and takes this action. After C&S makes its decision for round , the reward function is revealed. C&S then computes for every policy by simulation, and feeds to Algorithm as the reward of at .

Input: Policy set , algorithm , oracle , initial state ;
Output: Sequence of actions , (implicitly) sequence of states ;
Start from initial state ;
for each round  do
      Invoke algorithm to pick a policy at the beginning of ;
      if  then
           Stop current run of oracle , if still running;
           Restart a new run of oracle with initial state and initial time ;
           Pick the action chosen by oracle ;
           end if
          else
                if  is running then
                     Pick the action chosen by oracle ;
                     end if
                    else
                          Compute action by simulating policy up to time and pick it;
                          end if
                         
                          end if
                         for each  do
                               Compute by simulating policy up to time ;
                               Let , and feed it to as the reward of action ;
                               end for
                              
                               end for
ALGORITHM 1 Chasing & Simulation
Theorem 3.2.

The regret of C&S (Algorithm 1) is bounded by .

Proof.

Divide the rounds into a sequence of consecutive episodes so that the policy chosen by does not change during an episode, and a new episode starts whenever changes the policy. Let and be the first round and the last round of the episode , respectively. Now consider an arbitrary episode .

First, suppose that in this episode the chasing oracle does not terminate before or at . By the fact that C&S algorithm follows actions generated by during , and by definition of chasing regret, we have

Now consider the case where in the episode , the chasing oracle terminates at some round . Again, by definition of chasing regret, we have:

Moreover, the chasing oracle ensures that C&S is at the dominant state at the beginning of . Using the dominant condition, we have

where is the sequence of states generated by following actions generated by simulating policy (from the beginning up to time , for to ), but this time starting from the dominant state at time (as described in the domination property in Section 3.1). Therefore,

For any policy , we have

where for every episode and every round , . By Theorem 3.1, the formula above is bounded by . ∎

So far, we have only considered the notion of policy regret defined in Eq. 4. To complement our result, we consider another natural alternative definition for regret, known as external regret, defined as follows (see Arora et al. (2018) for more details).

(7)

In words, while policy regret is the difference between the simulated reward of the optimal fixed policy and the actual reward of the algorithm, in external regret the reward that is being accredited to the optimal fixed policy in each round is the reward that policy would have obtained when being in the actual state of the algorithm (versus being in its simulated current state). In Section 7, we show for Dd-MDP with dominance & chasing condition obtaining both sublinear external regret and sublinear policy regret is impossible, and hence we focus on obtaining vanishing policy regret.

4 No-regret Posted Pricing: Reduction from DRACC to Dd-MDP

This section designs a learning-based posted price mechanism (LBPP) for the DRACC problem using the C&S algorithm, and prove that this mechanism is no-regret.

The mechanism LBPP works as follows. It first provides the initial input parameters to the Algorithm C&S, including the collection of policies, the OLSC algorithm , the chasing oracle and the initial state . In particular, the existence of the algorithm is guaranteed in Lemma 3.1, and the desired chasing oracle is constructed at the end of this section. Without loss of generality, we assume that , because otherwise the users with can never get any resource. Therefore, the initial state can be specified by setting it to , and this can be done before the arrival of the first user.

When the user starts to arrive, LBPP invokes C&S to obtain the first action , and posts as the price vector for the user . Next, upon the arrival of each user , LBPP constructs the reward function and the state transition function , which are described in detail later. After these two functions are fed to C&S, C&S chooses a new action , and LBPP posts as the price vector for the user . This process proceeds iteratively until all the users have been processed.

For every , the reward function is constructed for every state-action pair with as follows:

(8)

where is defined in Eq. (1) for the payment of user given the price vector . Note that the function can also be defined in the same way, although it is only used in the analysis. By Eq. (1), the computation in Eq. (8) only relies on , which is available upon the arrival of the user due to the truthfulness of LBPP. Therefore, the reward function is computable.

Recall that the valuation function of every user is fixed by the adversary in advance, and does not change with the state. Then, it can be inferred from Eq. (8) that for any two states and an action which is feasible for both and , performing on and always gives the same reward. A formal statement of this observation is given as follows.

Lemma 4.1.

For each and any two states , it holds for every action that .

Now let us consider the construction of the state transition function . Recall that the each state is corresponding to a surplus vector. We construct the surplus vector as follows. For each resource ,

For two states and , we say if and encodes the surpluses of the same set of active resources and for each item , . Then the construction of the state transition function gives the following lemma.

Lemma 4.2.

For any two states and with , any action and any , it holds that .

Proof.

Let , and . For every resource , it trivially holds that .

Now consider the resource . By definition, is fully determined by the valuation function of user and the price vector . Therefore, the expression does not depend on the state. Then,

The second transition above holds because . ∎

Lemma 4.3.

For any states and with , if action is feasible for , then it is feasible for .

Proof.

Let the set of active resources whose surpluses are encoded by be . For any resource , if , then by definition, we have . Since is feasible for , we know that for every such , . Therefore, is feasible for . ∎

For each , let to be a vector . Then we have the following results.

Lemma 4.4.

The sequence of states satisfies the domination condition with .

Proof.

Let , and for every . Since encodes the full capacities of the resources in , for any policy , it trivially holds that . Now suppose that for . By definition, the action is feasible for . Therefore, it can be inferred from Lemma 4.3 that is also feasible for . Using Lemma 4.2, we can obtain that

Therefore, it can be proved inductively that holds for every . By Lemma 4.3, it ensures that the action is feasible for . Furthermore, by Lemma 4.1, it holds for every that . Then this proof is completed. ∎

Lemma 4.5.

There exists a chasing oracle whose CR is bounded by .

Proof.

We show the existence of by constructing such an oracle that satisfies the online chasing condition and gives the desired CR. Given a starting round and a starting state as the initial parameters, the oracle works as follows. For every , if and for every , then terminates, otherwise it posts a price vector that only contains . Such an action is trivially feasible, and it rejects any user since .

It is easy to see that when terminates at some round , it satisfies the requirement that either is reached, or . Now we proceed to analyze the CR.

Proposition 4.6.

Let , and , then .

Proof.

By contradiction, suppose that . This means that , since otherwise . In such case, let be the smallest integer in so that . For every resource , we have . This is because if there exists a resource with , then by Eq. (3), we have for every with , as . Then it holds for every with that . This gives , which conflicts with the setting .

For every , since , it cannot be allocated to any user because is set to . Therefore, the state reached by at round contains the full capacity of every active resource, which means that it is a dominant state. This conflicts with the termination condition of the chasing oracle. Therefore, this proposition holds. ∎

Proposition 4.6 implies that the total number of resources that can be assigned to the users in is no more than . Since the payment for any resource is less than , it holds for any that

where because the payment obtained from every user in is less than . Since the total payment is always non-negative, the CR is at most . ∎

Lemma 4.4 and Lemma 4.5 imply that satisfies the -Dominance&Chasing condition with . The following result can be directly inferred from Theorem 3.2, Lemma 4.4, and Lemma 4.5.

Theorem 4.7.

The regret of LBPP is bounded by