Consider the following revenue maximization problem in a repeated setting, called the online posted pricing problem. In each period, the seller has a single item to sell, and a new prospective buyer. The seller offers to sell the item to the buyer at a given price; the buyer buys the item if and only if the price is below his private valuation for the item. The private valuation of the buyer itself is never revealed to the seller. How should a monopolistic seller iteratively set the prices if he wishes to maximize his revenue? What if he also cares about the market share, i.e. the fraction of time periods at which the item is sold?
Estimating price sensitivities and demand models in order to optimize revenue and market share is the bedrock of econometrics. The emergence of online marketplaces has enabled sellers to costlessly change prices, as well as collect huge amounts of data. This has renewed the interest in understanding best practices for data driven pricing. The extreme case of this when the price is updated for each buyer is the online pricing problem described above; one can always use this for less frequent price updates. Moreover this problem is intimately related to the classical experimentation and estimation procedures.
This problem has been studied from an online learning perspective, as a variant of the multi-armed bandit problem. In this variant, there is an arm for each possible price (presumably after an appropriate discretization). The revenue of each arm is either or zero, depending on whether the arriving value is at least equal to the price or smaller than the price , respectively. The total revenue of a pricing algorithm is then compared to the total revenue of the best fixed posted price in hindsight. The difference between the two, called the regret, is then bounded from above. No assumption is made on the distribution of values; the regret bounds are required to hold for the worst case sequence of values. Blum et al. (2004) assume that the buyer valuations are in , and show the following multiplicative plus additive bound on the regret: for any , the regret is at most times the revenue of the optimal price, plus . Blum and Hartline (2005) show that the additive factor can be made to be , trading off a factor for an extra factor.
An undesirable aspect of these bounds is that they scale linearly with ; this is particularly problematic when is an estimate and we might set it to be a generous upper bound on the range of prices we wish to consider. A typical use case is when the same algorithm is used for many different products, with widely varying price ranges. We may not be able to manually tune the range for each product separately.
One might wonder if this dependence on is unavoidable, as it seems to be reflected by the existing lower bounds for this problem in the literature (lower bounds are discussed later in the introduction with more details). Interestingly, in all of these lower-bound instances the best fixed price is equal to itself; Therefore, it is not clear whether this dependency on is required for instances where is only a pessimistic upper-bound on the best fixed price. We now ask the following question:
Question: do online learning algorithms exist for the online posted pricing problem, such that their regrets are proportional to the best fixed price instead of the highest value?
Standard off-the-shelf bounds allow regret to depend on the loss of the best arm instead of the worst case loss. However, even such bounds still depend linearly on the maximum range of all the losses, and thus they would not allow to replace by the best fixed price.
Fortunately, in the online pricing problem the reward function of the arms is well structured. In particular, as a neat observation, the reward of the arm is upper-bounded by (and not only the maximum value). Can we use this structure in our favor to improve the standard regret bounds? We answer this question in the affirmative by the means of reducing the problem to a pure learning problem termed as mutli-scale online learning.
1.1 Multi-scale online learning
The main technical ingredients in our results are variants of the classical problems of learning from expert advice and multi-armed bandit. We introduce the multi-scale versions of these problems, where each action has its reward bounded in a different range. Here, we seek to design online learning algorithms that guarantee multi-scale regret bounds, i.e. their regrets with respect to each certain action scales with the range of that particular action, instead of the maximum possible range. These guarantees are in contrast with the regret bounds of the standard versions, which scale with the maximum range.
Main result (informal): we give algorithms for the full information and bandit information versions of the multi-scale online learning problem with multi-scale regret guarantees.
While we use these bounds mostly for designing online auctions and pricing mechanisms, we expect such bounds to be of independent interest.
The main idea behind our algorithms is to use a tailored variant of online (stochastic) mirror descent (OSMD) (Bubeck, 2011). In this tailored version, the algorithm uses a weighted negative entropy as the Legendre function (also known as the mirror map), where the weight of each term (corresponding to arm ) is actually equal to the range of that arm. More formally, assuming the range of arm is equal to , our mirror descent algorithms (Algorithm 1 for full information, and Algorithm 3 for the bandit information) use the following mirror map:
Intuitively speaking, these algorithms take into account different ranges for different arms by first normalizing the reward of each arm by its range (i.e. divide the reward of arm by its corresponding range ), and then projecting the updated weights by performing a smooth multi-scale projection into the simplex. This projection is an instant of the more general Bregman projection (Bubeck, 2011) for the special case of weighted negative entropy as the mirror map. The mirror descent framework then gives regret bounds in terms of a “local norm” as well as an “initial divergence”, which we then bound differently for each version of the problem. In the technical sections we highlight how the subtle variations arise as a result of different techniques used to bound these two terms.
While our algorithms have the style of the multiplicative weights update (up to a normalization of the rewards), the smooth projection step at each iteration makes them drastically different. To shed some insight on this projection step, which plays an important role in our analysis, consider a very special case of the problem where the reward of each arm is deterministically equal to . The multiplicative weights algorithm picks arm
with a probability proportional to. However, as it is clear from the description of Algorithm 1, our algorithm uniformly scales the weight of each arm first. Then, in the projection step the weight of each arm is multiplied by for some parameter . Hence, arm will be sampled with a probability proportional to (which is a smooth approximation to , but in a different way compared to the vanilla multiplicative weights).
The multi-scale versions exhibit subtle variations that do not appear in the standard versions. First of all, our applications to auctions and pricing have non-negative rewards, and this actually makes a difference. For both the expert and the bandit versions, the minimax regret bounds for non-negative rewards are provably better than those when rewards could be negative. Further, for the bandit version, we can prove a better bound if we only require the bound to hold with respect to the best action, rather than all actions (for non-negative rewards). The various regret bounds and comparison to standard bounds are summarized in Tables 1.
|Standard regret bound||Multi-scale bound (this paper)|
|Upper bound||Lower bound|
|, is the best action||-|
1.2 The implications for online auctions and pricing
As a direct application of our multi-scale online learning framework, somewhat surprisingly,
Second contribution: we show that we can get regret proportional to the best fixed price instead of the highest value for the online posted pricing problem.
(i.e., we can replace by the best fixed price, which is used in the definition of the benchmark). In particular, we show that the additive bound can be made to be , where is the best fixed price in hindsight. This allows us to use a very generous estimate for and let the algorithm adapt to the actual range of prices; we only lose a factor. The algorithm balances exploration probabilities of different prices carefully and automatically zooms in on the relevant price range. This does not violate known lower bounds, since in those instances is close to .
Bar-Yossef et al. (2002), Blum et al. (2004), and Blum and Hartline (2005) also consider the “full information” version of the problem, or what we call the online (single buyer) auction problem, where the valuations of the buyers are revealed to the algorithm after the buyer has made a decision. Such information may be available in a context where the buyers have to bid for the items, and are awarded the item if their bid is above a hidden price. In this case, the additive term can be improved to , which is tight. Once again, by a reduction to multi-scale online learning, we show that can be replaced with ; in particular, we show that the additive term can be made to be .
1.3 Purely multiplicative bounds and sample complexity
The regret bounds mentioned above can be turned into a purely multiplicative factor in the following way: for any , the algorithm is guaranteed to get a fraction of the best fixed price revenue, provided the number of periods where is the additive term in the regret bounds above. This follows from the observation that a revenue of is a lower bound on the best fixed price revenue. Define the number of periods required to get a multiplicative approximation (as a function of ) to be the convergence rate of the algorithm.
A multiplicative factor is also the target in the recent line of work, on the sample complexity of auctions, started by Balcan et al. (2008); Elkind (2007); Dhangwatnotai et al. (2014); Cole and Roughgarden (2014). (We give a more comprehensive discussion of this line of work in Section 1.4.) Here, i.i.d. samples of the valuations are given from a fixed but unknown distribution, and the goal is to find a price such that its revenue with respect to the hidden distribution is a fraction of the optimum revenue for this distribution. The sample complexity is the minimum number of samples needed to guarantee this (as a function of ).
The sample complexity and the convergence rate (for the full information setting) are closely related to each other. The sample complexity is always smaller than the convergence rate: the problem is easier because of the following.
The valuations are i.i.d. in the case of sample complexity, whereas they can be arbitrary (worst case) in the case of convergence rate.
Sample complexity corresponds to an offline problem: you get all the samples at once. Convergence rate corresponds to an online problem: you need to decide what to do on a given valuation without knowing what valuations arrive in the future.
This is formalized in terms of an online to offline reduction [folklore] which shows that a convergence rate upper bound can be automatically translated to a sample complexity upper bound. This lets us convert sample complexity lower bounds into lower bounds on the convergence rate, and in turn into lower bounds on the additive error in an additive plus multiplicative regret bound. For example, the additive error for the online auction problem (and hence also for the posted pricing problem*** We conjecture that the lower bound for the posted pricing problem should be worse by a factor of , since one needs to explore about different prices. ) cannot be (Huang et al., 2015b). Moreover, it is insightful to compare convergence rates we show with the best known sample complexity upper bound; proving better convergence rates would mean improving these bounds as well.
A natural target convergence rate for a problem is therefore the corresponding sample complexity, but achieving this is not always trivial. In particular, we consider an interesting version of the sample complexity bound for auctions, for which no analogous convergence rate bound is known in the literature. This version takes into account both revenue and market share, and gets sample complexity bounds that are scale free; there is no dependence on , which means it works for unbounded valuations! For any , the best fixed price benchmark is relaxed to ignore those prices whose market share (which is equivalent to the probability of sale) is below a fraction; as increases the benchmark is lower. This is a meaningful benchmark since in many cases revenue is not the only goal, even if you are a monopolist. A more reasonable goal is to maximize revenue subject to the constraint that the market share is above a certain threshold. What is more, this gives a sample complexity of (Huang et al., 2015b). In fact can be set to without loss of generality, when the values are in ,††† When the values are in , we can guarantee a revenue of by posting a price of 1, and to beat this, any other price (and in particular a price of ) would have to sell at least times.
and the above bound then matches the sample complexity with respect to the best fixed price revenue. In addition, this bound gives a precise interpolation: as the target market shareincrease, the number of samples needed decreases almost linearly.
Third contribution: we show a convergence rate that almost matches the above sample complexity, for the full information setting.
We have a mild dependence on ; the rate is proportional to . Further, we also show a near optimal convergence rate for the online posted pricing problem.‡‡‡ Unfortunately, we cannot yet guarantee that our online algorithm itself gets a market share of , although we strongly believe that it does. Showing such bounds on the market share of the algorithm is an important avenue for future research.
All of our results in the full information (online auction) setting extend to the multiple buyer model. In this model, in each time period, a new set of buyers competes for a single item. The seller runs a truthful auction that determines the winning buyer and his payment. The benchmark here is the set of all “Myerson-type” mechanisms. These are mechanisms that are optimal when each period has buyers of potentially different types, and the value of each buyer is drawn independently from a type dependent distribution. In fact, our convergence rates also imply new sample complexity bounds for these problems (except that they are not computationally efficient).
|Lower bound||Upper bound|
|Best known (Sample complexity)||Best known (Convergence rate)||This paper (Thm. 4.2)|
|Online single buyer auction||11footnotemark: 1||22footnotemark: 2||22footnotemark: 2|
|Online posted pricing||11footnotemark: 144footnotemark: 4||-||22footnotemark: 2|
|Online multi buyer auction||11footnotemark: 1||33footnotemark: 3||-|
|Lower bound (Sample complexity)||Upper bound|
|Best known (Sample complexity)||This paper (Thm. 4.4)|
|Online single buyer auction||11footnotemark: 1||11footnotemark: 1|
|Online posted pricing||11footnotemark: 122footnotemark: 2||-|
|Online multi buyer auction||11footnotemark: 1||-|
1.4 Other related work
The online pricing problem, also called dynamic pricing, is a much studied topic, across disciplines such as operations research and management science (Talluri and Van Ryzin, 2006), economics (Segal, 2003), marketing, and of course computer science. The multi-armed bandit approach to pricing is particularly popular. See den Boer (2015) for a recent survey on various approaches to the problem.
Kleinberg and Leighton (2003) consider the online pricing problem, under the assumption that the values are in , and considered purely additive factors. They showed that the minimax additive regret is , where is the number of periods. This is similar in spirit to regret bounds that scale with , since one has to normalize the values so that they are in . The finer distinction about the magnitude of the best fixed price is absent in this work. Recently, Syrgkanis (2017) also consider the online auction problem, with an emphasis on a notion of “oracle based” computational efficiency. They assume the values are all in and do not consider the scaling issue that we do; this makes their contribution orthogonal to ours.
Starting with Dhangwatnotai et al. (2014), there has been a spate of recent results analyzing the sample complexity of pricing and auction problems. Cole and Roughgarden (2014) and Devanur et al. (2016) consider multiple buyer auctions with regular distributions (with unbounded valuations) and give sample complexity bounds that are polynomial in and , where is the number of buyers. Morgenstern and Roughgarden (2015) consider arbitrary distributions with values bounded by , and gave bounds that are polynomial in and . Roughgarden and Schrijvers (2016); Huang et al. (2015b) give further improvements on the single- and multi-buyer versions respectively; Tables 3 and 3 give a comparison of these results with our bounds, for the problems we consider. The dynamic pricing problem has also been studied when there are a given number of copies of the item to sell (limited supply) (Agrawal and Devanur, 2014; Babaioff et al., 2015; Badanidiyuru et al., 2013; Besbes and Zeevi, 2009). There are also variants where the seller interacts with the same buyer repeatedly, and the buyer can strategize to influence his utility in the future periods (Amin et al., 2013).
Foster et al. (2017) also consider the multi-scale online learning problem motivated by a model selection problem. They consider additive bounds, for the symmetric case, for full information, but not bandit feedback. Their regret bounds are not comparable to ours in general; our bounds are better for the pricing/auction applications we consider, and their bounds are better for their application.
We start in Section 2 by showing regret upper bounds for the multi-scale experts problem with non-negative rewards (Theorem 2.2). The corresponding upper bounds for the bandit version are in section 3 (Theorem 3.1). In Section 4 we show how the multi-scale regret bounds (Theorems 2.2 and 3.1) imply the corresponding bounds for the auction/pricing problems (Theorems 4.2 and 4.4). Finally, the regret (upper and lower) bounds for the symmetric range are discussed in Section 5 (Theorems 5.1, ‡‡ ‣ 5.1, 5.1, and 5.1).
2 Full Information Multi-scale Online Learning
We consider a variety of online algorithmic problems that are all parts of the multiscale online learning framework. We start by defining this framework, in which different actions have different ranges. We exploit this structure and express our results in terms of action-specific regret bounds for this general problem. To obtain these results, we use a variant of online mirror descent and propose a multiplicative-weight update style learning algorithm for our problem, termed as Multi-Scale Multiplicative-Weight (MSMW) algorithm.
Next, we investigate the single buyer auction problem (or equivalently the full-information single buyer dynamic pricing problem) as a canonical application, and show how to get multiplicative cum additive approximations here by the help of the multi-scale online learning framework. To show the tightness of our bounds, we compare the convergence rate of our dynamic pricing with the sample complexity of a closely related offline problem, i.e. the near optimal Bayesian revenue maximization from samples (Cole and Roughgarden, 2014).
2.1 The framework
Our full-information multi-scale online learning framework is basically the classical learning from expert advice problem. The main difference is that the range of rewards of different experts could be different. More formally, suppose there is a set of actions .§§§We use the terms experts, arms and actions interchangeably in this paper. The online problem proceeds in rounds, where in each round ¶¶¶ We use the notation , for any
The adversary picks a reward function , where is the reward of action .
The algorithm picks an action simultaneously.
Then the algorithm gets the reward and observes the entire reward function .
The total reward of the algorithm is denoted by
The standard “best fixed action” benchmark is
We further assume that the action set is finite. Without loss of generality, if the action set is of size , we identify . The reward is such that for all , , where is the range of action .
2.2 Multi-scale regret bounds
We prove action-specific regret bounds, which we call also multi-scale regret guarantees. Towards this end, we define the following quantities.
The regret bound w.r.t. action , i.e., an upper bound on , depends on the range , as well as any prior distribution over the action set ; this way, we can handle countably many actions. Let and (if applicable) be the minimum and the maximum range. We first state a version of the regret bound which is parameterized by ; such bounds are stronger than type bounds which are more standard. [Main Result] There exists an algorithm for the full-information multi-scale online learning problem that takes as input any distribution over , the ranges and a parameter , and satisfies:
Compare this to what you get by using the standard analysis for the experts problem (Arora et al., 2012), where the second term in the regret bound is . Choosing
to be the uniform distribution in the above theorem gives. Also, one can compare the pure-additive version of this bound with the classic pure-additive regret bound for the experts problem by setting (Corollary 3). There exists an algorithm for the full-information multi-scale online learning problem that takes as input the ranges , and satisfies:
We should assert that in a multi-scale regret guarantee, we provide a separate regret bound for each action, where the bound on the regret of action only scales linearly with . This type of guarantee should “not” be mistaken as a bound on the worst action.
Here is the map of the rest of this section. In Section 2.3 we propose an algorithm that exploits the reward structure, and later in Section 2.4 we show how this algorithm is an online mirror descent with weighted negative entropy as its mirror map. For reward-only instances, we prove the regret bound in Section 2.5. We finally turn our attention to the single buyer online auction problem in Section 2.6.
2.3 Multi-Scale Multiplicative-Weight (MSMW) algorithm
We achieve our regret bound in Theorem 2.2 by using the MSMW algorithm (Algorithm 1). The main idea behind this algorithm is to take into account different ranges for different experts, and therefore:
We normalize the reward of each expert accordingly, i.e. divide the reward of expert by its corresponding range ;
We project the updated weights by performing a smooth multi-scale projection into the simplex: the algorithm finds a such that multiplying the current weight of each expert by
makes a probability distribution over the experts. It then uses this resulting probability distribution for sampling the next expert.
2.4 Equivalence to online mirror descent with weighted negative entropy
While it is possible to analyze the regret of the MSMW algorithm (Algorithm 1) by using first principles, we take a different approach (the elementary analysis can still be found in the appendix, Section A.2). We show how this algorithm is indeed an instance of the Online Mirror Descent (OMD) algorithm for a particular choice of the Legendre function (also known as the mirror map).
2.4.1 Preliminaries on online mirror descent.
Fix an open convex set and its closure , which in our case are and respectively, and a closed-convex action set , which in our case is , i.e. the set of all probability distributions over experts in . At the heart of an OMD algorithm there is a Legendre function , i.e. a strictly convex function that admits continuous first order partial derivatives on and , where denotes the gradient map of . One can think of OMD as a member of projected gradient descent algorithms, where the gradient update happens in the dual space rather than in primal , and the projection is defined by using the Bregman divergence associated with rather than -distance (see Figure 1).
[Bregman Divergence (Bubeck, 2011)] Given a Legendre function over , the Bregman divergence associated with , denoted as , is defined by
[Online Mirror Descent (Bubeck, 2011)] Suppose is a Legendre function. At every time , the online mirror descent algorithm with Legendre function selects an expert drawn from distribution , and then updates and given rewards by:
where is called the learning rate of OMD.
We use the following standard regret bound of OMD (Refer to Bubeck (2011) for a thorough discussion on OMD. For completeness, a proof is also provided in the appendix, Section A.3). Roughly speaking, this lemma upper-bounds the regret by the summation of two separate terms: “local norm” (the first term), which captures the total deviation between and , and “initial divergence” (the second term), which captures how much the initial distribution is far from the target distribution. For any learning rate parameter and any benchmark distribution over , the OMD algorithm with Legendre function admits the following:
2.4.2 MSMW algorithm as an OMD
For our application, we focus on a particular choice of Legendre function that captures different learning rates proportional to for different experts, as we saw earlier in Algorithm 1. We start by defining the weighted negative entropy function. Given expert-ranges , the weighted negative entropy is defined by
It is straightforward to see is a non-negative Legendre function over . Moreover, and . We now have the following lemma that shows Algorithm 1 is indeed an OMD algorithm. The MSMW algorithm, i.e. Algorithm 1, is equivalent to an OMD algorithm associated with the weighted negative entropy as its Legendre function.
and therefore, . Moreover, for the Bregman projection step we have
This is a convex minimization over a convex set. To find a closed form solution, we look at the Lagrangian dual function and the Karush-Kuhn-Tucker (KKT) conditions . We have
As , should be unique number s.t. , and then . So, Algorithm 1 is equivalent to OMD with weighted negative entropy as its Legendre function.
By combining Lemma 1, Corollary 2.4.2 and finally Lemma 8 we prove the following regret bound for the MSMW algorithm. We encourage the reader to also look at the appendix, Section A.2, for an extra proof using first principles.
For any initial distribution over , and any learning rate parameter , and any benchmark distribution over , the MSMW algorithm satisfies that:
[of Proposition 2.4.2] We have:
By applying the regret bound of OMD (Lemma 1) to upper-bound the RHS, we have
To bound the first term in regret, a.k.a local norm, we have:
Note that because and . By for and that , the above is upper bounded by . We can also rewrite the second term in regret. In fact, if we set , then
By summing the upper-bounds on each term of local norm in (13) for and putting all the pieces together, we get the desired bound.
2.5 Regret analysis for non-negative rewards
There exists an algorithm for the full-information multi-scale online learning problem that takes as input any distribution over , the ranges and a parameter , and satisfies:
By , the second term on the RHS is upper bounded as:
Similarly, by , the third term on the RHS is upper bounded as
Finally, note that for all in reward-only instances. So the LHS is lower bounded by
Putting all this together, we get that
The theorem then follows by choosing and rearranging terms.
2.6 A canonical application: online single buyer auction
The simple auction design problem that we consider is as follows. There is a seller with infinite identical copies of an item. Buyers arrive over time. At each round, the seller picks a price and the arriving buyer reports her value. If the value is no less than the price, the trade happens; money goes to the seller and the copy of the item goes to the arriving buyer. The goal is to maximize the revenue of the seller.
Formally, we look at this problem as an instance of the full information multi-scale online learning framework; The action set is . ∥∥∥Here, we allow an infinite action set. Later, we show how to discretize to get around this issue. The reward function is such that at round the adversary (i.e. the arriving buyer) picks a value and for any price picked by the seller (i.e. the algorithm), the reward is . This is a full information setting, because the value is revealed to the algorithm after each round .
The additive/multiplicative approximation.
In order to obtain a -approximation of the optimal revenue, i.e. the revenue of the best fixed price in hindsight, it suffices to consider prices of the form for . As a result, we reduce the online single buyer auction problem to the multi-scale online learning with full information and finite actions. The action set has actions whose ranges form a geometric sequence , .
Recall the definition of in Section 2.1, and let be the best fixed price in hindsight, which is the price that achieves . We now show how to get a multiplicative cum additive approximation for this problem with as the benchmark, à la Blum et al. (2004); Blum and Hartline (2005). The main improvement over these results is that the additive term scales with the best price rather than . There is an algorithm for the online single buyer auction problem that takes as input a parameter , and satsify , where:
Also, even if is not known up front, there is an (slightly modified) algorithm that achieves a similar approximation guarantee for online single buyer auction with:
[of Theorem 2.6]
[Part 1: known ] Recall the above formulation of the problem as an online learning problem with full information. The proof then follows by Theorem 2.2, letting to be the uniform distribution over the actions, i.e., discretized prices.
[Part 2: unknown ] When is not known up front, we consider a variant of our algorithm (Algorithm 2) that picks the next price in each round from the set of relevant prices (denoted by ), updates this set if necessary, and then updates the weights of prices in this set as in Algorithm 1. The main new idea here is to update the set of prices so that it only includes prices that are at most the highest value we have seen so far (let the highest seen value be at the beginning). Now, for the sake of analysis, consider a hypothetical algorithm (called ) that considers a countably infinite action space comprising all prices of the form , for . We first show this hypothetical algorithm satisfies the required approximation guarantee in Theorem 2.6. We then show the expected revenue of Algorithm 2 is at least the expected revenue of (minus a constant that is negligible in our bound), and hence the final proof.
The proof of the regret bound of Theorem 2.2 works when we have countably many actions (although we cannot implement such algorithms directly). Now, consider simulating and let the prior distribution be such that for any price , (this choice will become more clear later in the proof; in short we need to be proportional to ). The approximation guarantee in Theorem 2.6 then follows by Theorem 2.2. We now argue the followings:
For any price , consider the first time a value at least shows up. Algorithm 2 suffers a loss of at most compared to , due to ’s probability of playing in that round, where is the probability of playing in the initial distribution. This is because the probability that plays in this round is at most as has not got any positive gains before this round.
Then, by choosing to be inversely proportional to , we can show that Algorithm 2 has an additive loss of compared to, where is the normalization constant of the initial distribution . This finishes the proof.
Bounds on the sample complexity of auctions for single buyer problem (Huang et al., 2015a) imply that the first bound in this theorem is tight up to factors: the lower bound is in an instance where is actually equal to . Also, the best upper bound known is by Blum et al. (2004); Blum and Hartline (2005), which is
We conclude that Theorem 2.6 generalizes the known tight sample complexity upper-bound for the offline single buyer Bayesian revenue maximization to the online adversarial setting.
3 Multi-Scale Online Learning with Bandit Feedback
In this section, we look at the bandit feedback version of multi-scale online learning framework proposed in Section 2.1. Essentially, the only difference here is that after the algorithm picks an arm at time , it only observes the obtained reward, i.e. , and does not observe the entire reward function .
Inspired by the online stochastic mirror descent algorithm (Bubeck, 2011) we introduce Bandit-MSMW
algorithm. Our algorithm follows the standard bandit route of using unbiased estimators for the rewards in a full information strategy (in this case MSMW). We also mix the MSMW distribution with an extra uniform exploration, and use a tailored initial distribution to obtain the desired mutli-scale regret bounds.
3.1 Bandit multi-scale regret bounds
For the bandit version, we can get similar regret guarantees as in Section 2.2 for the full-information variant, but only for the best action. If we require the regret bound to hold for all actions, then we can only get a weaker bound, where the second term has instead of . The difference between the bounds for the bandit and the full information setting is essentially a factor of , which is unavoidable. There exists an algorithm for the online multi-scale problem with bandit feedback that takes as input the ranges , and a parameter , and satisfies,
for all ,
Also, one can compute the pure-additive versions of the bounds in Theorems 3.1 by setting and resepctively (Corollary 3.1), and compare with the pure-additive regret bound for the adversarial multi-armed bandit problem (Audibert and Bubeck, 2009; Auer et al., 1995). There exist algorithms for the online multi-scale bandits problem that satisfies,
For all ,
3.2 Bandit Multi-Scale Multiplicative Weight (Bandit-MSMW) algorithm
We present our Bandit algorithm (Algorithm 3) when the set of actions is finite (with ). Let be the learning rate and be the exploration probability. We show the following regret bound.
For any exploration probability and any learning rate parameter , the Bandit-MSMW algorithm achieves the following regret bound when the gains are non-negative :
[of Lemma 3.2] We further define:
In expectation over the randomness of the algorithm, we have:
for any .
Hence, to upper bound , it suffices to upper bound .
By the definition of the probability that the algorithm picks each arm, i.e., , we have: