1. Introduction
Everyday life is replete with settings where we have to make decisions while facing uncertainty over future outcomes. Some examples include allocating cloud resources, matching an empty car to a ridesharing passenger, displaying online ads, selling airline seats and hotel rooms, hiring candidates to fill open positions, etc. In many of these instances, the underlying arrivals arises from some known generative process. Even when the underlying model is unknown, companies can turn to everimproving machine learning tools to build predictive models based on past data. This raises a fundamental question in online decisionmaking:
how can we use predictive models to make good decisions?Broadly speaking, an online decisionmaking problem is defined by a current state and a set of actions, which together determine the next state as well as generate rewards. In stochastic online decisionmaking settings (a.k.a. Markov decision processes or MDPs), the rewards and state transitions are also affected by some random shock. Optimal policies for such problems are known only in some special cases, when the underlying problem is sufficiently simple, and knowledge of the generative model sufficiently detailed. More generally, MDP theory
(bertsekas1995dynamic) asserts that optimal policies for general MDPs can be computed via stochastic dynamic programming. For many problems of interest, however, such an approach is infeasible due to two reasons: insufficiently detailed models of the generative process of the randomness, andthe complexity of computing the optimal policy (the socalled ‘curse of dimensionality’). These shortcomings have inspired a long line of work on approximate dynamic programming (ADP). Our paper follows in this tradition, but also aims to build deeper connections to Bayesian learning theory to better understand the use of prediction oracles in decisionmaking.
Our work focuses on two important classes of online decisionmaking problems: online packing and online matching. In brief, these problems involve a set of distinct resources, and a principal with some initial budget vector
of these resources, which have to be allocated among incoming agents. Each agent has a type comprising of some specific requirements for resources and associated rewards. The agenttypes are known in aggregate, but the exact type becomes known only when the agent arrives. The principal must make irrevocable accept/reject decisions to try and maximize rewards, while obeying the budget constraints.Online packing and matching problems are fundamental in MDP theory; they have a rich existing literature and widespread applications in many domains. Nevertheless, our work develops new policies for both these problems which admit performance guarantees that are orderwise better than existing approaches. These policies can be stated in classical ADP terms (for example, see Algorithms 3 and 2), but draw inspiration from ideas in Bayesian learning. In particular, our policies can be derived from a metaalgorithm, the Bayes selector (Algorithm 1), which makes use of a blackbox prediction oracle to obtain statistical information about a chosen offline benchmark, and then acts on this information to make decisions. Such policies are simple to define and implement in practice, and our work provides new tools for bounding their regret vìsavìs the offline benchmark. Though we focus on online packing and matching problems, we believe our approach provides a new way for designing and analyzing online decisionmaking policies using predictive models.
1.1. Our Contributions
We believe our contributions in this work are threefold:

Technical: We present a new stochastic coupling technique, which we call the compensated coupling, for evaluating the regret of online decisionmaking policies visàvis offline benchmarks.

Methodological: Inspired by ideas from Bayesian learning, we propose a class of policies, expressed as the Bayes Selector, for general online decisionmaking problems.

Algorithmic: For online packing and matching problems, we prove that the Bayes Selector gives regret guarantees that are independent of the size of the statespace, i.e., constant with respect to the horizon length and budgets.
Organization of the paper:
We first formally introduce the online packing and matching problems in Section 2, define the notion of prophet benchmarks, and discuss the shortcoming of prevailing approaches to such problems. Next, in Section 3, we present our compensated coupling approach in the general context of finitestate finitehorizon MDPs. We then introduce the Bayes selector policy, and discuss how the compensated coupling approach helps provide a generic recipe for obtaining regret bounds for such a policy. Finally, in Sections 5 and 4, we use these techniques for the online packing and matching problems. In particular, in Section 4, we propose a Bayes Selector policy for such problems and demonstrate the following performance guarantee
Theorem 1.1 (Informal).
For any online packing problem with a finite number of resources and finite type space, the Bayes Selector achieves regret which is independent of the horizon and resource budgets
(both in expectation and with high probability).
In more detail, our regret bounds depend on the ‘resource matrix’ and the distribution of arriving types, but are independent of and . Moreover, the results holds under weak assumptions on the arrival process, including Multinomial and Poisson arrivals, timedependent processes, and Markovian arrivals. This result generalizes prior and contemporaneous results (Reiman_nrm; jasin2012; wang_resolve; wu2015algorithms; itai_secretary). We show similar results for matching problems in Section 5.
1.2. Related Work
Our work is related to several active areas of research in MDPs and online algorithms. We now briefly survey some of the most relevant connections.
Approximate Dynamic Programming:
The complexity of computing optimal MDP solutions can scale with the state space, which often makes it impractical (the socalled ‘curse of dimensionality’ (powell2011approximate)). This has inspired a long line of work on approximate dynamic programming (ADP) (powell2011approximate; tsitsiklis2001regression)
to develop lower complexity heuristics. Although these methods often work well in practice, they require careful choice of basis functions, and any bounds are usually in terms of quantities which are difficult to interpret. Our work provides an alternate framework, which is simpler and has interpretable guarantees.
Model Predictive Control:
Another popular heuristic for ADP and control which is closer to our paradigm is that of model predictive control (or receding horizon control) (morari1993model; borrelli2003constrained), which is a widelyused heuristic in practice. Recently, MPC techniques have also been connected with online convex optimization (OCO) (huang2015receding; chen2016using; chen2015online) to show how prediction oracles can be used for OCO, and applying these policies to problems in power systems and network control. These techniques however generally require continuous controls, and do not handle combinatorial constraints.
Information Relaxation:
Parallel to the ADP focus on developing better heuristics, there is a line of work on deriving upper bounds via martingale duality, sometimes referred to as information relaxations (brown2013optimal; desai2012pathwise; brown2014information). These work by adding a suitable martingale term to the current reward, so as to penalize ‘future information’. Our approach follows a similar construction of bounds via offline benchmarks; however, we then use them to derive control policies.
Online Packing and Prophet Inequalities:
Though online packing has been widely studied in literature, the majority of work focuses on competitive ratio bounds under worstcase distributions. In particular, there is an extensive literature on the socalled Prophet Inequalities, starting with the pioneering work of (hill_iid), to more recent extensions and applications to algorithmic economics (kleinberg2012matroid; duetting2017prophet; correa2017posted; alaei_bayesian). We note however that any competitive ratio guarantee essentially implies a linear regret, in comparison to our sublinear regret guarantees – the cost for this, however, is that our results hold under more restrictive assumptions on the inputs.
Regret bounds in online packing:
The first work to prove constant regret in a context similar to ours is (itai_secretary), who prove a similar result for the multisecretary setting with multinomial arrivals, using an elegant policy which we revisit in Section 4.1. This result follows in a long line of work in applied probability, notable among which are those of (Reiman_nrm) which provides an asymptotically optimal policy under the diffusion scaling (i.e., scaling arrivals by and then renormalizing by ), and (jasin2012) who provide a resolving policy with constant regret for distributions obeying a certain nondegeneracy condition. More recently, (wang_resolve) extended the constant regret result of (itai_secretary) for more general packing problems with Poisson arrivals, based on a partial resolving policy. However, their resulting policy is complex and specialized for packing with i.i.d. arrivals; moreover, simulations done by the authors indicate that their policy is highly suboptimal compared to a simple ‘resolveandround’ heuristic, which is identical to one of our proposed Bayesian selection policies (cf. Algorithm 3).
2. Problem Setting and Overview
Our focus in this work is on the subclass of online packing problems. This is a subclass of the wider class of finitehorizon online decisionmaking problems: given a time horizon with discrete timeslots , we need to make a decision at each time leading to some cumulative reward. Note that throughout our timeslot index indicates the time to go rather than elapsed time. We present the details of our technical approach in this more general context whenever possible, indicating additional assumptions when required.
In what follows, we use to indicate the set , and denote the th entry of any given matrix interchangeably by or . We work in an underlying probability space , and the complement of any event is denoted . For any optimization problem , we use to indicate its objective value.
2.1. Online Packing and Matching Problems
Online packing is a canonical and widelystudied subclass of online decision problems, which is studied across several communities. In particular, this class encompasses problems related to network resource allocation in control, network revenue management in operations research, and posted pricing with singleminded buyers in algorithmic economics.
The basic setup in online packing is as follows: There are distinct resourcetypes denoted by the set , and at time , we have an initial availability (budget) vector . At every time , nature draws an arrival with type from a finite set of distinct types , via some distribution which is known to the algorithm designer (or principal). We denote to be the cumulative vector of the last arrivals, where .
An arrival of type corresponds to a resource request with associated reward and resource requirement , where denotes the units of resource required to serve the arrival. At each time, the principal must decide whether to accept the request (thereby generating the associated reward while consuming the required resources), or reject it (no reward and no resource consumption). Accepting a request requires that there is sufficient budget of each resource to cover the request. The principal’s aim is to make irrevocable decisions so as to maximize overall rewards.
Online matching problems are a closely related class of problems, wherein we have the same setup with resources with fixed budget , but now each type comprises of a menu of (requirement, reward) pairs, and is satisfied with any option from the given set, resulting in the associated reward. The canonical example here is that of online weighted bipartite matching with static nodes (with potentially multiple copies of each node) and types of dynamic nodes, each corresponding to a subset of compatible static nodes. The types can be represented a reward matrix and adjacency matrix ; if the arrival is of type , we can allocate at most one resource such that , leading to a reward of . As before, any allocation to an arrival must respect the budget constraints.
Arrival Processes:
To completely define a packing/matching problem, we need to specify the generative model for the type sequence .
An important subclass here is that of stationary independent arrivals, which further admits two widelystudied cases:
The Multinomial process is defined by a known distribution over ; at each time, the arrival is of type with probability , thus .
The Poisson arrival process is characterized by a known rate vector . Arrivals of each class are assumed to be independent such that . Note that, though this is a continuoustime arrival process, the principal needs to make decisions only at discrete times, based on arrivals; however, the number of arrivals (i.e., the horizon) is now random.
More general models allow for nonstationary and/or correlated arrival processes – for example, nonhomogeneous Poisson processes, Markovian models, etc. An important feature of our framework is that it is capable of handling a wide variety of such processes in a unified manner, without requiring extensive information regarding the generative model. In particular, our results hold under fairly weak regularity conditions on the cumulative vectors . We provide the most general conditions in Section 4.3.
The prophet benchmark:
Both online packing and matching are subclasses of the more general class of finitestate, finitehorizon Markov decision processes; in Section 3.1, we introduce this more general class of problems. We discussed in Section 1 the drawbacks of using dynamic programming, hence our work instead follows the approach of designing heuristic policies with rigorous guarantees with respect to prophet benchmarks. We now describe this approach for packing problems, and formalize it more generally in Section 3.1.
Our performance guarantees are best illustrated by adopting the view that a given packing/matching problem is simultaneously solved by two ‘agents’, and , who are primarily differentiated based on their access to information. can only take nonanticipatory actions, which at each time can be based on the current state and arrival, past trajectory, and distributional information. On the other hand, at time is allowed to make decisions with full knowledge of future arrivals . Both agents start from a common initial state , experience the same arrivals, and want to maximize their total reward. Denoting the total realized rewards of and on any problem instance (i.e., sequence of arrivals) as and respectively, we define the regret of an online policy w.r.t. to an offline benchmark to be the additive loss . Observe that depends on the policy used by , the underlying policy will always be clear from context. Our aim is to design policies with low and, in particular, low dependence on the size of the statespace (since the complexity of the optimal solution grows with the statespace).
A natural benchmark is the optimal policy in hindsight, wherein makes allocation decisions with knowledge of future arrivals. Observe that this corresponds to solving an integer programming problem. A weaker, but more tractable benchmark is given by an LP relaxation of this policy: given arrivals vector , we assume solves the following:
(1) 
The corresponding (random) reward is denoted . Note that realizing the solution to the above LP requires to make fractional allocations. On the other hand, is constrained to make integer allocation decisions (i.e., whether to accept or reject a request) in a nonanticipatory manner.
The statespace in online packing/matching problems grows at least as fast as (for independent, stationary arrivals); our results provide simple policies for such problems with regret which is independent of the statespace size.
The fluid problem and randomized allocation rules
: To understand the deficiencies in prevailing approaches for designing online policies, it is useful to focus on a canonical example: the socalled stochastic multisecretary problem (itai_secretary) (or unitweight online stochastic knapsack). This comprises of a single resource with initial budget ; arriving requests each require one unit of resource, and a request of type has associated reward . In this case, ’s solution (based on the LP in Eq. 1) corresponds to sorting all arrivals by their reward and picking the highest . For multinomial arrivals, the optimal policy for this setting can be computed in time; nevertheless, it is instructive to study heuristics for this problem since these are useful for more complex arrival processes where the state space grows quickly.
The most common technique for obtaining online packing policies is based on the socalled fluid (or deterministic) LP benchmark . It is easy to see via Jensen’s Inequality that , and hence the fluid LP is an upper bound for any online policy. This also leads to a natural randomized control policy, wherein given any solution to the fluid LP , each class is admitted with probability . Such a policy is known to give regret w.r.t. the fluid benchmark (talluri2006theory); moreover, subsequent works (Reiman_nrm; jasin2012; wu2015algorithms) have shown that by resolving the fluid LP at each time, one can obtain regret bounds w.r.t. which are tighter in special cases (in particular, they grow when the fluid LP is close to being dualdegenerate).
Despite these successes, the following result shows that the approach of using as a benchmark can never lead to a constant regret policy, as the fluid benchmark can be far off from the optimal solution in hindsight.
Proposition 2.1 ().
For any online packing problem, if the arrival process satisfies the Central Limit Theorem and the fluid LP is dual degenerate, then
.This gap has been reported in literature, both informally and formally (see (itai_secretary; wang_resolve)); for completeness, we provide a proof in Appendix A. Note though that this gap does not pose a barrier to showing constantfactor competitive ratio guarantees (and hence the fluid LP benchmark is widely used for prophet inequalities), but rather, that it is a barrier for obtaining regret bounds. Breaking this barrier thus requires a fundamentally new approach.
Overview of our results:
Our approach can be viewed as a metaalgorithm that uses blackbox prediction oracles to make decisions. The quantities estimated by the oracles are related to our offline benchmark and can be interpreted as
probabilities of regretting each particular action in hindsight. Note that such estimates can easily be obtained, for example, via simulation given knowledge of the arrival process. Moreover, a natural ‘Bayesian selection’ strategy given such estimators is to adopt the action that is least likely to cause regret in hindsight. This is precisely what we do in Algorithm 1), and hence, we refer to it as the Bayes Selector policy.We note that Bayesian selection techniques are not new. In fact, they are often used as heuristics in practice. The main theoretical challenge in analyzing such a policy is that they are based on adaptive and potentially noisy estimates; this is perhaps why such policies have not been formally analyzed for packing and matching problems. Our work however shows that such policies in fact have excellent performance in such settings – in particular, we show that for matching and packing problems:

There are easy to compute estimators (in particular, ones which are based on simple adaptive LP relaxations) that, when used for Algorithm 1, give constant regret for a wide range of distributions (see Theorems 5.1, 4.9, 4.3 and 4.1).

The above results also provide structural insights for online packing and matching problems, which show that using other types of estimators for Algorithm 1 yields comparable performance guarantees (see Corollaries 5.2, 4.10, 4.4 and 4.2). This holds, for example, if the estimations are obtained through sampling.
At the core of our analysis is a novel stochastic coupling technique for analyzing online policies based on offline (or prophet) benchmarks. In particular, unlike traditional approaches to regret analysis, which are based on showing that an online policy tracks a fixed offline policy, our approach is instead based on forcing to follow ’s actions. We describe this in more detail in the next section.
3. Compensated Coupling and the Bayes Selector
We now introduce our two main technical ideas: the compensated coupling technique, and the Bayes selector heuristic for online decisionmaking. In particular, we describe them here in the broader context of finitestate finitehorizon MDPs, and defer the details of their application in online packing and matching problems to Sections 5 and 4.
3.1. MDPs and Offline Benchmarks
The basic MDP setup is as follows: at each time (where represents the timetogo), based on previous decisions, the system state is one of a set of possible states . Next, nature generates an arrival , following which we need to choose from among a set of available actions . The state updates and rewards are determined via a transition function and a reward function : for current state , arrival and action , we transition to the state and collect a reward . Infeasible actions for a given state correspond to . The sets , as well as the measure over arrival process , are known in advance. Finally, though we focus mainly on maximizing rewards, the formalism naturally ports over to costminimization.
As before, we eschew solving the MDP optimally via backward induction and instead focus on providing performance guarantees for policies with respect to any given offline benchmark. As in packing and matching problems, we again adopt the view that the problem is simultaneously solved by two ‘agents’: and . can only take nonanticipatory actions while makes decisions with knowledge of future arrivals. To keep the notation simple, we restrict ourselves to deterministic policies for and , thereby implying that the only source of randomness is due to the arrival process (our results can be extended to randomized policies).
Let denote the set of all arrival sequences . For a given samplepath and time to go, ’s value function is specified via the deterministic Bellman equations
(2) 
with boundary condition for all . The notation is used to emphasize that, given samplepath , ’s value function is a deterministic function of and . Note though that the sequence of actions that achieves (and hence, the sequence of states) may not be unique.
On the other hand, chooses actions based on a policy . At time , if is in state and observes , then it takes action . The restriction that is nonanticipatory imposes that it be adapted to the filtration generated by the current algebra . Let denote ’s state at time
as determined by the policy and the arrivals (note this is a random variable). We can write ’s value function, for a given policy
, asFor notational ease, we omit explicit indexing of on policy .
Now, denoting the total realized rewards of and on any samplepath as and , we can define the regret of an online policy to be the additive loss incurred by using w.r.t. , i.e.,
Note that as defined, is a random variable – we are interested in bounding for different policies, and also providing tail bounds for the same.
We are now in a position to introduce our main technical tool, the compensated coupling, which we use for obtaining regret bounds for our policies. We introduce this first in the context of online packing problems, before presenting it in more generality. Finally, in Section 3.4, we discuss how the compensated coupling naturally leads to the Bayes selector policy.
3.2. WarmUp: Compensated Coupling for Online Packing
At a highlevel, the compensatedcoupling is a sample pathwise charging scheme, wherein we try to couple the trajectory of a given policy to a sequence of offline policies. Given any nonanticipatory policy (played by the agent) and any offline benchmark (played by the agent), the technique works by making follow – formally, we couple the actions of to those of , while compensating to preserve its collected value along every samplepath. This allows us to bound the expected regret for the given policy in terms of its ‘disagreement’ with respect to the offline benchmark.
To build some intuition for our approach, consider the multisecretary problem with budget and three arriving types with . Suppose for the arrivals on a given samplepath are . Note that (i.e, the agent attempting to achieve the optimal offline reward) will accept exactly one arrival of type , but is indifferent to which arrival. While analyzing , we have the freedom to choose a benchmark by specifying the tiebreaking rule for – for example, we can compare the decisions of to an agent who chooses to frontload the decision by accepting the arrival at (i.e., as early in the sequence as possible) or backload it by accepting the arrival at . This complicates the analysis, as typically regret bounds are obtained with respect to a fixed policy. Many existing work (talluri2006theory; jasin2012; wu2015algorithms) attempts to circumvent this by using the fluid LP; however, Proposition 2.1 shows that this approach can not break the barrier.
Now suppose instead that we choose to reject the first arrival (), and then want to accept the type arrival at – this would lead to a decrease in ’s final reward. The crucial observation is that we can still incentivize to accept arrival type by offering a compensation (i.e., additional reward) of for doing so. The basic idea behind the compensated coupling is to generalize this argument: in particular, for general online decisionmaking problems, we want to couple the states of and by requiring to follow the actions of at each period, while paying an appropriate compensation whenever they diverge, see Fig. 1.
Henceforth, we define as the maximum reward over all classes. Also, for simplicity, we assume that all resource requirements are binary, i.e., . Now for a given samplepath and any given budget , if decides to accept the arrival at , we can instead make it reject the arrival while still earning a greater or equal reward by paying a compensation of . On the other hand, note that can at most extract in the future for every resource uses; hence on samplepaths where wants to reject , it can be made to accept instead with a compensation of .
We define a particular policy for , i.e., a sequence of actions that collects the value given by the Bellman Eq. 2. To create a compensated coupling, we specify ’s policy as follows: given budget and arrival , suppose chooses an action . Given the same budget , chooses the action that maximizes the Bellman Eq. 2 for . If the maximizer for is , we define and otherwise define as the opposite action. Intuitively, by defining this we specify ’s tiebreaking rule. Next, for given budget and time , we define the disagreement set to be the set of samplepaths where the action chosen by is not a maximizer of Eq. 2, i.e., . Now we have the following result.
Lemma 3.1 (Compensated Coupling).
For the Online Packing Problem, under any policy for
Proof.
We first prove the following claim: For every time ,
Suppose the arrival at is of class . If both agents take the same action at time , then we have and the claim holds. When the agents disagree, we have two cases: In the first case, if rejects and accepts, then we have and thus . On the other hand, if accepts and rejects, then we have . This finishes the proof of our claim. Telescoping, we obtain the result. ∎
Before tackling the general case, we point out some notable features of the above result.

[nosep,leftmargin=0.5cm]

is a samplepath property that makes no reference to the arrival process. Though we use it primarily for analyzing MDPs, it can also be used for adversarial settings – for example, if the arrival sequence is arbitrary but satisfies some regularity properties (eg. bounded variance). We do not further explore this, but believe it is a promising avenue.

For stochastic arrivals, by linearity of expectation we have ; it follows that, if the disagreement probabilities are summable over all , then the expected regret is constant. In Sections 5 and 4 we show how to bound for different problems.

The Lemma also provides a distributional characterization of the regret in terms of a weighted sum of Bernoulli variables. This allows us to get highprobability bounds

Lemma 3.1 gives a tractable way of bounding the regret which does not require either reasoning about the past decisions of , or the complicated process may follow. In particular, it suffices to bound , i.e., the probability that, given budget levels at time , loses optimality in trying to follow .

Another advantage of Lemma 3.1 is that it is agnostic to the particular benchmark that uses (as long as it admits a natural Bellman recursion). For example, can use an approximation algorithm (lower bound) or a relaxation (upper bound).
3.3. Generalized Compensated Coupling
To extend Lemma 3.1 to general decisionmaking problems, first we need some definitions.
Given samplepath with arrivals , recall denotes ’s value starting from state with periods to go. obeys the Bellman Eq. 2.
Definition 3.2 (Satisfying Action).
For any given state and time , we say is satisfied with an action at if is a maximizer in the Bellman equation, i.e.,
Example 3.3 ().
Consider the multisecretary problem with , initial budget , types with , and a particular sequence of arrivals . The optimal value of is , and this is achieved by accepting the sole type arrival as well as any one out of the two type arrivals. At time , is satisfied either accepting or rejecting . Further, at , for any budget the only satisfying action is to accept.
Although may be satisfied with multiple actions (see above example), its value remains unchanged under any satisfying action; in fact, the Bellman optimality principle is equivalent to requiring to choose a satisfying action for every state and time , on every samplepath . We define a valid policy for to be any anticipatory functional mapping to satisfying the optimality principle: for all ,
For ease of exposition, we henceforth assume that ’s policy is deterministic; however, the same methodology extends to settings where is randomized. Given samplepath , we define to be ’s valuetogo on this sample path; note this depends on the specific policy we consider. The regret incurred by on this samplepath is thus given by . Moreover, we denote as the reward collected by at time , and hence . Note that the state is a random process, but is completely determined given deterministic policy and samplepath .
Next, we quantify by how much we need to compensate when ’s action is not satisfying, as follows
Definition 3.4 (Marginal Compensation).
For action , time and state , we denote the random variable
And also define
The random variable captures exactly how much we need to compensate , while provides a uniform (over ) bound on the compensation required when errs on an arrival of type by choosing an action . Though there are several ways of bounding , we choose as it is clean and expressive, and admits good bounds in many problems; in particular, for the online packing problem, we have .
The final step is to fix ’s policy (in terms of tiebreaking) to be one which ‘follows ’ as closely as possible. For this, given a policy , on any samplepath we set if is satisfying, and otherwise set to an arbitrary satisfying action.
Definition 3.5 (Disagreement Set).
For any state and time , and any action , we define the disagreement set to be the set of samplepaths where is not satisfying for , i.e.,
Finally, let be the event when cannot follow , i.e., the set of samplepaths in (this depends on , but we omit the indexing). Observing that only under we need to compensate , we get the following.
Lemma 3.6 (General Compensated Coupling).
For any online decisionmaking problem, fix any policy with resulting state process . Then we have:
and thus .
Proof.
The proof follows a similar argument as Lemma 3.1. We first claim that, for every time ,
(3) 
To see this, let . If is satisfied taking action in state , then . On the other hand, if is not satisfied taking action , then by the definition of marginal compensation (Definition 3.4) we have, . Since by definition and , we obtain Eq. 3. Finally, our first result follows by telescoping the summands and the second by linearity of expectation. ∎
Lemma 3.6 thus gives a generic tool for obtaining regret bounds against the offline optimum for any online policy. Note also that the compensated coupling argument generalizes to settings where the transition and reward functions are time dependent. The compensated coupling also suggests a natural greedy policy, which we define next.
3.4. The Bayes Selector Policy
Using the formalism defined in the previous sections, let be the disagreement probability of action at time in state (i.e., the probability that is not a satisfying action). Now for any , suppose we have an oracle that gives us for every feasible action . This could be done via an approximation technique (for example, in Sections 5 and 4 we estimate the probabilities with a natural LP relaxation), by simulating future arrivals, learning the probability based on past data, etc. The results below are essentially agnostic of how we obtain this oracle.
Given oracle access to , a natural greedy policy suggested by Lemma 3.6 is that of choosing action that minimizes the disagreement. This is similar in spirit to the Bayes selector (i.e., hard thresholding) in statistical learning – given an estimate of the bias of a variable, one can maximize the probability of correct prediction by thresholding the estimate. Algorithm 1 formalizes the use of this idea in online decisionmaking.
From Lemma 3.6, we immediately have the following:
Corollary 3.7 (Regret Of Bayes Selector).
Consider Algorithm 1 with overestimates . If denotes the policy’s action at time , then
We could be in a scenario where the probabilities need to be estimated through sampling or simulation. The next result states that, if we can bound the estimation error uniformly over states and actions, then the guarantee of the algorithm increases additively on the error (not multiplicatively, as one may suspect).
Corollary 3.8 (Bayes Selector w/ Imperfect Estimators).
Assume we have estimators of the probabilities such that for all . If we run Algorithm 1 with overestimates , and denotes the policy’s action at time , then
Proof.
Given the condition on , then is an over estimate and we can apply Corollary 3.7. ∎
Observe that, the total error induced due to estimation is a constant if, e.g., we can guarantee or .
4. Regret Bounds for Online Packing
We now turn to the use of the Bayes selector in the online packing and matching problems defined in Section 2. In this section, we show that for online packing, the Bayes Selector achieves a regret which is independent of the number of arrivals and the initial budgets ; in Section 5, we extend this to matching problems.
One challenge in characterizing the performance of the Bayes Selector is that there is no general closedform oracle for the exact statistics for the disagreement probabilities in such settings. We circumvent this by showing that the dynamic fluid relaxation in Eq. 4 provides a good estimator for , and moreover, that the Bayes Selector based on these statistics reduces to a simple resolve and threshold policy. For ease of exposition, we directly present the resulting policy, but the connections to Algorithm 1 will become apparent by the end of this section.
Recall that denotes the cumulative arrivals in the last periods. Given knowledge of and state , we define the expost relaxation and fluid relaxation as follows.
(4) 
Observe that both problems depend on ’s budget at ; this is a crucial technical point and can only be accomplished due to the coupling we have developed.
Now let be the solution of and the solution of . We present our policy in Algorithm 2, which is equivalent to running the Bayes Selector (Algorithm 1) using the fluid relaxation as a proxy for the estimators .
Intuitively, we ‘frontload’ classes such that and backload the rest. Now if is satisfied accepting a frontloaded class (resp. rejecting a backloaded class), he will do so. Accepting class is therefore an error if , given the same budget as , picks no future arrivals of that class (i.e., ). On the other hand, rejecting is an error if . We summarize this as follows:

Incorrect rejection: if and .

Incorrect acceptance: if and .
Observe that a compensation is paid only when the fluid solution is far off the correct stochastic solution. Below, we formalize the fact that, since estimates , such an event is highly unlikely – this along with the compensated coupling provides our desired regret guarantees.
We need some additional notation before presenting our results. Let () be the expectation (probability) conditioned on the arrival at time being of type , i.e., . We denote and .
4.1. Warmup: Single Resource Allocation with Multinomial Arrivals
We consider the multinomial arrival process, where with probability . In this subsection we prove the following.
Theorem 4.1 ().
The regret of the Fluid Bayes Selector (Algorithm 2) for the multisecretary problem with multinomial arrivals is at most .
This recovers the bestknown regret bound for this problem shown in a recent work (itai_secretary). However, while the result in (itai_secretary) depends on a complex martingale argument, our proof is much more succinct, and provides explicit and stronger guarantees; in particular, in Section 4.4, we provide concentration bounds for the same.
Moreover, Theorem 4.1, along with Corollary 3.8, provides a critical intermediate step for characterizing the performance of Algorithm 1 for the multisecretary problem.
Corollary 4.2 ().
For the multisecretary problem with multinomial arrivals, the regret of a Bayes selector policy (Algorithm 1) with any imperfect estimators is at most , where is the accuracy defined by .
Observe that, if is summable, e.g., or , then Corollary 4.2 implies constant regret for all these types of estimators we can use in Algorithm 1.
Theorem 4.1 Assume w.l.o.g. that . This one dimensional version can be written as follows.
The optimal solution to is to sort all the arrivals and pick the top ones. Now define the probability of ‘arrival or better’ by . The solution to is to pick the largest such that , then make for and . Round this solution, we arrive at the following policy: First, always accept class . Second, if class arrives, accept if and reject if .
Recall that is the probability that is not satisfied with ’s action and is this probability conditioned on . Our aim in the rest of the section is to show that is summable over .
As we observed before: (1) is not satisfied rejecting a class iff he accepts all the future arrivals type , i.e., . (2) is not satisfied accepting class iff he rejects all future type arrivals, i.e., . We now bound these using the following standard Chernoff bounds: for :
(5) 
We now bound the disagreement probabilities . Take rejected by , i.e., it must be that and . Since we are rejecting, a compensation is paid only when condition (1) applies, thus . By the structure of ’s solution, all classes are accepted in the last rounds, i.e., it must be that for all . We must be in the event . We know that . Since , the probability of error is:
Using Eq. 5, it follows that .
Now let us consider when is accepted by . A compensation is paid only when and condition (2) applies, thus . Again, by the structure of , necessarily for . Therefore we must be in the event . Recall that is accepted iff , thus
This event is also exponentially unlikely. Using Eq. 5, we conclude . Overall we can bound the total compensation as:
Using compensated coupling (Lemma 3.1), we get our result.
Corollary 4.2 We denote the action accept and reject. In the proof of Theorem 4.1 we concluded that the following are overestimates of the disagreement probabilities (in the sense ):
Crucially, observe that for every and every type , the estimate is independent of the state . This proves that . The proof is now completed by invoking the Compensated Coupling (Lemma 3.1) and Corollary 3.8.
4.2. Online Packing with Multinomial Arrivals
We consider now the case and (we lift this restriction in Section 4.3). Since the matrix is binary, we do not need to check feasibility as it is guaranteed by the fact that is a feasible solution of . In the remainder of this subsection we generalize our ideas to prove the following.
Theorem 4.3 ().
The regret of the Fluid Bayes Selector (Algorithm 2) for online packing with binary matrix and multinomial arrivals is at most , where and is a constant that depends only on the matrix .
Just as before, Theorem 4.3, along with Corollary 3.8, provides a performance guarantee for Algorithm 1. We state the corollary without proof, since it is identical to that of Corollary 4.2.
Corollary 4.4 ().
For online packing with binary matrix and multinomial arrivals, the regret of the Bayes Selector (Algorithm 1) with any imperfect estimators is at most , with constant is as in Theorem 4.3, and the additive estimator accuracy at time (i.e, ).
To prove Theorem 4.3
, We first need a result from linear programming, which characterizes the sensitivity of the solution to an LP in terms of perturbations to the budget vector. The following proposition is based on a more general result from
(mangasarian).Proposition 4.5 (LP Lipschitz Property).
Given , and any norm in , consider the following LP
Then constant such that, for any and any solution to , there exists a solution solving such that .
This result implies that small changes in the arrivals vector do not change the solution by much. To bound the changes, we use the following concentration bound for multinomial r.v. (taken from (multinomial_bound, Lemma 3)), based on a standard Poissonization argument.
(6) 
Theorem 4.3 Recall our two observations: (1) Incorrect rejection of happens if and . (2) Incorrect acceptance of means and . Now we can upper bound the probability of paying a compensation as follows. Call the event . In this event, Proposition 4.5 implies , hence does not err when occurs. By setting in Eq. 6 we can bound as long as the condition is satisfied. Putting these arguments together, we have
(7) 
Finally, summing up over time, we get
Comments
There are no comments yet.