# Waterfall Bandits: Learning to Sell Ads Online

A popular approach to selling online advertising is by a waterfall, where a publisher makes sequential price offers to ad networks for an inventory, and chooses the winner in that order. The publisher picks the order and prices to maximize her revenue. A traditional solution is to learn the demand model and then subsequently solve the optimization problem for the given demand model. This will incur a linear regret. We design an online learning algorithm for solving this problem, which interleaves learning and optimization, and prove that this algorithm has sublinear regret. We evaluate the algorithm on both synthetic and real-world data, and show that it quickly learns high quality pricing strategies. This is the first principled study of learning a waterfall design online by sequential experimentation.

There are no comments yet.

## Authors

• 42 publications
• 2 publications
• 6 publications
• 38 publications
• 7 publications
• ### Online Pricing with Reserve Price Constraint for Personal Data Markets

The society's insatiable appetites for personal data are driving the eme...
11/28/2019 ∙ by Chaoyue Niu, et al. ∙ 0

• ### Online Learning and Optimization for Revenue Management Problems with Add-on Discounts

We study in this paper a revenue management problem with add-on discount...
05/02/2020 ∙ by David Simchi-Levi, et al. ∙ 0

• ### Online Pricing with Offline Data: Phase Transition and Inverse Square Law

This paper investigates the impact of pre-existing offline data on onlin...
10/19/2019 ∙ by Jinzhi Bu, et al. ∙ 0

• ### IRSA Transmission Optimization via Online Learning

In this work, we propose a new learning framework for optimising transmi...
01/27/2018 ∙ by Laura Toni, et al. ∙ 0

• ### An Algorithmic Framework to Control Bias in Bandit-based Personalization

Personalization is pervasive in the online space as it leads to higher e...
02/23/2018 ∙ by L. Elisa Celis, et al. ∙ 0

• ### An online learning approach to dynamic pricing and capacity sizing in service systems

We study a dynamic pricing and capacity sizing problem in a GI/GI/1 queu...
09/07/2020 ∙ by Xinyun Chen, et al. ∙ 0

• ### A Primal-dual Learning Algorithm for Personalized Dynamic Pricing with an Inventory Constraint

A firm is selling a product to different types (based on the features su...
12/20/2018 ∙ by Ningyuan Chen, et al. ∙ 0

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Online publishers typically generate revenue by placing advertisements. For example, when a user visits a webpage, there are locations called slots each of which may have an impression of an advertisement (ad).

• A slot may be sold directly to a specific brand advertiser. In that case, when a user arrives at the webpage, the publisher calls the advertiser and places the ad returned as the impression.

• A slot may be sold via third parties such as Google’s DoubleClick Ad Exchange. In this case, when a user arrives at the webpage, the publisher calls the ad exchange which in turn calls many intermediaries called ad networks. Each ad network has several advertisers as its customers and bids on behalf of one of its chosen customers. The ad exchange runs an auction among the bids and returns the winner to the publisher which becomes the ad impression for the user.

• A slot may be sold directly to different ad networks. In this case, publishers typically run what is a called a waterfall. In the waterfall, the publisher chooses a permutation of the ad networks. The publisher calls each ad network sequentially according to the permutation and offers a price. The ad network has to bid above that price to win the opportunity to place the ad at that slot. If the ad network does not make an adequate bid, the slot is offered to the next ad network and so on. The publisher gets to choose the permutation and reserve prices.

The three methods above trade off between control, margins and relationships between publishers, ad networks and advertisers. Often publishers combine these methods. For example, publishers might use direct deals for premium slots (like top of homepages), use waterfall variants for torso inventory, and Ad Exchanges for tail or remnant slots that did not get sold by the other methods. Readers who wish more background on the ad business and the role of waterfalls can see [19, 17] or see DoubleClick’s support pages .

In this paper, we address the central question how publishers can design the waterfall. We formalize this problem as learning the optimal order of ad networks with their offered prices. Our objective is to maximize the expected revenue of the publisher online in a sequence of steps, which is equivalent to minimizing the expected regret with respect to the best solution in hindsight.

• We formalize and study the problem of publisher learning and optimizing ad revenue from waterfall design as an online learning problem with partial feedback.

• We develop a bandit style solution and propose a computationally-efficient UCB-like algorithm for this problem, which we call . Our learning problem is challenging for two reasons. First, the space of feasible solutions, all permutations of ad networks and their offered prices, is exponentially large. Second, our problem suffers from partial feedback, which is similar to that in cascading bandits [11, 12]. In particular, if an ad network accepts an offer, the learning agent does not learn if any of the subsequent ad networks would have accepted their offered prices.

• We prove an upper bound on the expected -step regret of algorithm . The upper bound is sublinear in and polynomial in all other quantities of interest. The key step in our analysis is a new regret decomposition, which is of independent interest beyond our motivating domain of online advertising.

• We conduct extensive experiments on both synthetic and real-world data, which show that learns high-quality solutions. In addition, we investigate several practical settings that are encountered by publishers:

• Publishers typically have many web pages with multiple ad slots per page. We show with real-world data that waterfall learning for all ad slots yields good solutions when the ad networks behave similarly across the ad slots.

• Instead of going directly to ad networks, publishers can go to third parties that aggregate over ad networks. We show our algorithm can also learn to sell in this setting and it does not overfit.

Taken together, the above represents the first principled study of publisher revenue when using waterfall to optimize ad placement.

## 2 Selling in the Waterfall

The problem of selling one ad slot in the waterfall can be formalized as follows. Let be a set of ad networks. Let be a set of prices, where for all . We use discrete prices as in [6]; they are a first reasonable approach for the waterfall. We will discuss this further in Section 5. Then any instance of our problem can be defined by a tuple , where

is a probability distribution over the valuation of ad network

. Without loss of generality, we assume that all prices in are in , and that the support of is a subset of for all . We assume that the valuation of any ad network , , is drawn independently from the valuations of all other ad networks.

The publisher sells to the ad networks as follows. First, it chooses a permutation of the ad networks and offered price for each ad network , where is the set of all permutations of . Then the publisher contacts the ad networks sequentially, from to , and tries to sell the ad slot to them. In particular:

• Ad network is contacted first.

• If ad network is contacted and , the offered price is lower than or equal to the valuation of ad network , the offer is accepted. Then the publisher earns and does not contact any of the remaining ad networks.

• If ad network is contacted and , the offered price is higher than the valuation of ad network , the offer is rejected. Then the publisher contacts ad network if . If , the publisher does not sell the ad slot and earns zero.

We denote by the action of the publisher. The set of feasible actions is . For any ad network and price , we define acceptance probability , the probability that ad network accepts price under valuation distribution . We refer to any pair of the ad network and price, for and , as an item; and define the set of all items as . Note that . For any action and weight function , we define

 f(A,u)=K∑i=1[i−1∏k=1[1−u(aj,pj)]]u(ai,pi)pi. (1)

This is the expected revenue of the publisher under action and acceptance probabilities . In particular, assuming the valuations of ad networks are independent, is the probability that all ad networks before ad network do not accept their offered prices, which is equal to the probability that is contacted. Moreover, is the conditional probability of ad network accepting its offered price after it is contacted. The objective of the publisher is to maximize its expected revenue by choosing ,

 A∗=argmaxA∈Af(A,¯w). (2)

We refer to as the optimal solution.

### 2.1 Oracles

No polynomial-time algorithm is known for solving all instances of problem (2). However, computationally-efficient approximations exist [6, 7]. In this work, we consider approximation algorithms whose inputs are a weight function , the number of ad networks , and a set of prices ; and the output is . We say that algorithm is a -approximation for if for any .

Note that when ad networks are assigned prices, the optimal order of the ad networks is in the descending order of their assigned prices. This follows from the definition of the revenue in (1). Since the output of

can be always ordered to satisfy this property, we assume that this property is satisfied without loss of generality. We consider two oracles in this paper, greedy and based on linear programming (LP).

The pseudocode of the greedy oracle is in Algorithm 1. The oracle has two main stages. First, it assigns to each ad network the price that maximizes the expected revenue of that ad network conditioned on being contacted, . Second, it orders the ad networks in the descending order of their assigned prices. This oracle is easy to implement and performs well in our experiments. It does not have any approximation guarantee though.

The pseudocode of the LP oracle is in Algorithm 2. The oracle is based on linear programming and is a -approximation algorithm [6]. The oracle has three main stages. First, it solves an LP to obtain the value of the dual variable corresponding to the last constraint . Second, it assigns to each ad network the price that maximizes . This price is denoted by . Finally, it orders the ad networks in the descending order of their assigned prices.

The optimized variables in the linear program are and , for and . The variable represents the probability that ad network is offered price . The variable represents the joint probability that ad network is offered price and accepts. The objective is the expected return. The constraints guarantee that the probabilities are consistent and non-negative.

Both discussed oracles can find high-quality strategies for selling a single ad slot in the waterfall. In practice, publishers may be interested in maximizing the revenue from all of their many ad slots. We return to these practical issues in Section 6.

## 3 Waterfall Bandit

As we discussed earlier, publishers often do not know the valuation distributions of ad networks in advance. However, since they repeatedly sell ad slots to the ad networks, they can learn it. This motivates our study of the waterfall as a multi-armed bandit (MAB) [14, 3], which we call a waterfall bandit. Formally, the waterfall bandit is a tuple , where the valuation distributions of ad networks are unknown to the publisher. Let be the stochastic valuation of ad network at time . We assume that is drawn independently from , both across ad networks and in time.

We evaluate the performance in the waterfall bandit by a form of regret, where the cumulative reward of the optimal solution is weighted by a factor of . In particular, the scaled -step regret is defined as

 Rγ(n)=nγf(A∗,¯w)−E[n∑t=1f(At,¯w)], (3)

where is the aforementioned scaling factor and is the optimal solution in (2). The reason for the scaling factor is that no polynomial-time algorithms exists for solving our offline optimization problem (Section 2.1). Therefore, it is unreasonable to assume that we can learn such solutions online, and it is reasonable to compete with the best offline -approximation. Note that the scaled -step regret reduces to the standard -step regret when .

Naive Solutions. The waterfall bandit can be solved as a multi-armed bandit problem where the expected revenue of each action

is estimated separately. This solution would not be statistically efficient. The reason is that the number of actions is

, and so a naive unstructured solution would have exponential regret in .

The key structure in our learning problem is that the publisher receives feedback on individual ad networks in each action. This setting is reminiscent of stochastic combinatorial semi-bandits [9, 8, 13, 18], which can be solved statistically efficiently. The challenge is that the publisher may not receive feedback on all ad networks. More specifically, when at time , the publisher does not know if ad network would accepted price if it was offered that price. Therefore, our problem cannot be formulated and solved as a stochastic combinatorial semi-bandit.

A similar form of partial feedback was studied in cascading bandits [11], where the learning agent receives feedback on a ranked list of items, for all items in the list up to the first clicked item. The difference in our setting is that Kveton et al. [11] do not consider pricing. Nevertheless, it is reasonable to assume that a similar learning algorithm, which maintains upper confidence bounds on all acceptance probabilities , for any and , could solve our problem. We present such an algorithm in Section 4.

## 4 WaterfallUCB1 Algorithm

In this section, we propose a UCB-like algorithm for the waterfall bandit, which we call . The algorithm is presented in Algorithm 3.

The inputs to are an approximation oracle , the number of ad networks , and a set of prices . At each time , proceeds as follows. First, it computes an upper confidence bound (UCB) on the acceptance probability of all , where is the set of all ad-network and price pairs (Section 2), is the fraction of accepted offers in trials when ad network is offered price , is the number of times that ad network is offered price up to time , and

is the radius of a confidence interval around

after steps such that holds with a high probability. We trim at so that it can be interpreted as a probability.

After the UCBs are computed, computes its action at time , , using the oracle and UCBs . Then it takes that action and receives feedback . Finally, updates its statistics for all observed pairs of ad networks and prices, such that .

## 5 Analysis

The regret of is bounded in the following theorem.

###### Theorem 1.

Let be run in the waterfall bandit with a -approximation oracle . Then where is the number of ad networks and is the number of prices in .

###### Proof.

We sketch the proof of Theorem 1 below. The detailed proof is in Appendix A.

The proof proceeds as follows. First, we show that is monotone in weight function for any fixed action (Lemma A in Appendix A). Second, based on this monotonicity property, we bound the per-step scaled regret at time , , under “good event” , that all are inside of their confidence intervals at time . This novel regret decomposition is presented in Lemma 5 below.

lemmalemmadecomp Conditioned on “good event” , the per-step scaled regret at time is bounded as where , , and is the event that item is observed at time .

The proof of Lemma 5 is in Appendix A. Note that the lemma decomposes the regret at time into those of observed items. Based on the definition of “good event” and some algebra, we have On the other hand, we bound the regret under “bad event” , that at least one is outside of its confidence interval, using Hoeffding’s inequality. Specifically, we get . The bound in Theorem 1 follows directly from combining the above two inequalities. ∎

Theorem 1 provides a gap-free upper bound on the scaled -step regret of . We discuss the tightness of this bound below. The dependence on is standard in gap-free bounds in similar problems [13, 12], and it is considered from being tight. The dependence is expected, since estimates values, one for each ad network and price. However, linear dependence on may not be tight. We obtain it for two reasons. First, learns separately for each item , and does not exploit any generalization across ad networks and prices. Second, our bound is proved directly from the “self-normalization” of confidence interval radii (Appendix A), not through a gap-dependent bound as in related papers [13, 12].

Our analysis also provides a sublinear regret bound with respect to the optimal continuous-price solution. In particular, let all prices be in and suppose the publisher intends to maximize expected revenue up to time . When the prices are discretized on a uniform grid with points over , the maximum instantaneous loss due to discretization is , where is a problem-specific Lipschitz factor. Under this assumption, the scaled -step regret with respect to the optimal continuous-price solution is bounded by . Now we choose

 M=√nC/(√nlogn+γ)

and we get a regret bound.

## 6 Experiments

In this section, we empirically evaluate the effectiveness of our algorithm. We also investigate the settings where our algorithm may be deployed in practice.

### 6.1 Methods and Metrics

The input to is an oracle, which orders ad networks and assigns prices to them for any model (Section 2.1). We experiment with two oracles, Greedy (Algorithm 1) and LP (Algorithm 2). We compare the following offline and online approaches, where X refers to one of the aforementioned oracles:

1. Offline-X is an offline approximation algorithm. The input to the algorithm are all acceptance probabilities, for any and . The probabilities are used by oracle to order ad networks and assign prices to them. The ordering and prices are computed only once and used in all steps. Although this approach is unrealistic because it assumes that the acceptance probabilities are known, it is a useful baseline for evaluating revenue loss due to not knowing the dynamics of the system.

2. UCB-X is the in Algorithm 3.

3. Exp-X is an online approximation algorithm, which explores in the first steps and then exploits [16]. In the first steps, the algorithm offers random prices to randomly ordered ad networks and collects observations. Then it estimates all acceptance probabilities from its observations. The probabilities are used by oracle to order ad networks and assign prices to them. The ordering and prices are computed in step and then used in all remaining steps. The exploration parameter tends to be small in practice because random exploration hurts experience.

The performance of all compared algorithms is evaluated by their expected -step reward,

 r(n)=1nn∑t=1K∑i=1[i−1∏j=11{pt,j>vt,at,j}]1{pt,i≤vt,at,i}pt,i,

where is the action of the publisher at time and is the valuation of ad network at time . We choose this metric instead of the scaled regret in (3) because the optimal solution to our offline optimization problem cannot be computed efficiently (Section 2.1). The optimal solution is necessary to evaluate (3).

We report the expected reward in hypothetical dollars to highlight the business value of our algorithm.

### 6.2 Synthetic Data

In this experiment, we show that the expected reward of our algorithm approaches that of the best approximation in hindsight. We also demonstrate that our algorithm outperforms Exp-X irrespective of the oracle.

We consider a synthetic problem with a single ad slot and ad networks. The valuation of each ad network at time

is drawn i.i.d. from beta distribution

, which is parameterized by and . As a result, the minimum and maximum valuations of each ad network are and respectively. The valuation of ad network is high, . The valuations of the remaining three ad networks are low, for any . The learning problem is to offer a high price to the ad network , ahead of the other ad networks.

The prices in all algorithms are discretized to price levels, namely . We experiment with both Greedy and LP oracles. The number of exploration steps in Exp-X is . In this experiment, this setting yields approximately observations on average for each pair of the ad network and price.

Our results with Greedy oracle are reported in Figure 1a. We observe two major trends. First, UCB-Greedy learns quickly. In particular, its expected reward is around dollars in k steps and exceeds dollars after k steps. We note that UCB-Greedy slightly outperforms Offline-Greedy after k steps. Indeed, since Greedy oracle is not guaranteed to return the optimal solution, it is possible to learn a better approximation online than offline. Second, Exp-Greedy is consistently worse than UCB-Greedy and its expected reward is only dollars in k steps. This shows that random exploration steps in Exp-Greedy are less statistically efficient than more intelligent continuous exploration in UCB-Greedy.

The results with LP oracle are reported in Figure 1b. We observe similar trends to those in Figure 1a. One minor difference is that UCB-LP performs worse than Exp-LP in the first k steps. However, it outperforms Exp-LP after k steps and its expected reward approaches dollars in k steps. The reason UCB-LP learns more slowly than UCB-Greedy is that the linear program in UCB-LP is not sensitive to small perturbations of model dynamics. That is, minor changes in the optimistic estimates of acceptance probabilities do not affect the output of the linear program. Therefore, UCB-LP explores all parameters of the model in the descending order of prices, which is inefficient. This is because higher prices are always preferred if the acceptance probabilities at all prices do not differ much. Only when the acceptance probabilities at higher prices become lower, UCB-LP explores other lower prices.

### 6.3 Publisher Insights

From the perspective of a publisher, our plots of the expected reward in the first steps can answer the following questions: (1) What is the revenue of a strategy up to step ? (2) What is the difference in revenues of strategies and up to step ?

The first question can be answered as follows. The revenue of a strategy up to step is equal to its expected reward up to step times . In Figure 1a, for instance, the expected reward of Offline-Greedy in k steps is dollars. Therefore, the revenue of Offline-Greedy in k steps is k dollars. The expected reward of UCB-Greedy in k steps is dollars. Therefore, the revenue of UCB-Greedy in k steps is k dollars. By the same line of reasoning, the revenue of Exp-Greedy in k steps is k dollars.

The second question can be answered as follows. The difference in revenues of strategies and up to step is equal to the difference of their expected rewards up to step times . We illustrate this in Figure 2. The expected rewards of UCB-Greedy and Exp-Greedy in k steps are and dollars, respectively. Therefore, the difference in their expected rewards is dollars, and the difference in their revenues in k steps is k dollars. This increase in revenue is a result of the improved statistical efficiency of UCB-Greedy relative to Exp-Greedy.

### 6.4 Real Data

#### 6.4.1 Selling a Single Ad Slot

In this experiment, we show that our algorithm can learn to sell a single ad slot, whose dynamics is estimated from a real-world dataset.

We experiment with a real-world dataset of Real-Time Bidding (RTB) iPinYou [15]. This dataset contains information regarding bidding on ad slots, such as the identity of the ad slot, the winning advertiser, and the winning price. We treat each advertiser as an ad network. Perhaps surprisingly, the winning price of any advertiser on any ad slot does not change throughout the dataset. This is common in practice because many advertisers do not behave very strategically.

We estimate the valuations of ad networks as follows. Fix the ad slot. Let be the number of times that advertiser wins bidding and be its winning price, which does not change throughout the dataset. Then ad network accepts price , independently of all other ad networks, with probability

 ¯w(a,p)=⎧⎨⎩1{p≤va}na∑a′na′,p>0;1,p=0; (4)

where is the frequency with which advertiser wins bids. Basically, is the empirical distribution of the acceptance probability of ad network when offered price . We assume that the zero price is always accepted. This does not fundamentally change our problem and allows us to avoid boundary cases in our simulations.

We experiment with most active ad slots in the iPinYou RTB dataset, and refer to this subset of data as Active20. Specifically, there are nine advertisers bidding on these ad slots. The prices in the dataset are in . We divide each price by in order to normalize all prices to . As in Section 6.2, all algorithms operate on discrete price levels. The only major difference from Section 6.2 is that the valuations of ad networks are distributed according to (4).

Our results on the most active ad slot in Active20 are reported in Figure 3a. We observe two major trends. First, Offline-X has the same performance irrespective of the oracle. The expected rewards of both Offline-Greedy and Offline-LP are dollars in k steps, or equivalently k dollars in revenue. Second, UCB-Greedy learns faster than UCB-LP. In particular, the expected reward of UCB-Greedy is dollars in k steps, or equivalently k dollars in revenue. The expected reward of UCB-LP is dollars in k steps, or equivalently k dollars in revenue. The difference in the revenues of two approaches in k steps is k dollars.

We also report the average performance of our algorithms on most active ad slots in Active20 in Figure 3b. These trends are extremely similar to those in Figure 3a. This experiment validates that our findings from Figure 3a are not limited to the most active ad slot, and that they apply to different ad slots.

#### 6.4.2 Selling Multiple Ad Slots

Publishers often have different pages and sell hundreds of ad slots. To facilitate operations and speed up learning, one option is to learn a single selling strategy across multiple ad slots. In this experiment, we evaluate this option. In particular, if the acceptance probabilities of ad networks do not change much with the ad slots, learning of one common model is expected to lead to much faster learning of a near-optimal policy.

Our results on most active ad slots in Active20 are reported in Figure 4a. We observe two major trends. First, the expected rewards of both UCB-X and Joint-UCB-X improve over time. The expected reward of UCB-Greedy is dollars in k steps, or equivalently k dollars in revenue. The expected reward of Joint-UCB-Greedy is dollars in k steps, or equivalently k dollars in revenue. Second, Joint-UCB-X learns faster than UCB-X. In particular, the difference in the expected rewards of Joint-UCB-Greedy and UCB-Greedy is dollars in k steps, or equivalently k dollars in revenue. This highlights a common trade-off in learning. Although Joint-UCB-Greedy learns only an approximate model, this model is easier to learn in a finite time because it has times less parameters than UCB-Greedy. We observe the same trends with LP oracle.

Our results on all the ad slots in Active20 are reported in Figure 4b. These trends are similar to those in Figure 4a. We note that the benefits of Joint-UCB-Greedy and Joint-UCB-LP increase with .

#### 6.4.3 Selling to Aggregated Ad Networks

A common scenario is that publishers interact with third parties, which aggregate multiple ad networks. In this section, we study the impact of ad network aggregation on learning publisher revenue.

The third parties are modeled as follows. All ad networks are partitioned into groups, . The values for will be specified later. When price is offered to group , any ad network accepts the offered price with probability in (4), independently of all other ad networks. If at least one accepts, accepts. From the point of view of the publisher and our algorithms, each group is treated as an ad network.

Learning with Aggregated Ad Networks.   We first show that our algorithm can learn to sell to aggregated ad networks. We also show that LP oracle leads to faster learning than Greedy oracle when the dynamics of selling is more complicated.

We set and evaluate Offline-X and UCB-X on the most active ad slot in Active20 dataset under two settings. In the first experiment, we fix six ad networks in and put the remaining three ad networks in . In the second experiment, we put six random ad networks in and the remaining three ad networks in . This experiment is repeated with random partitions.

The results of the first experiment are shown in Figure 5a. We observe one major trend. The expected reward of UCB-X converges to that of the best approximation in hindsight irrespective of the oracle. For example, the expected reward of Offline-X is around dollars in k steps, or equivalently k dollars in revenue. The expected reward of UCB-X reaches almost dollars in k steps, or equivalently k dollars in revenue. The difference in revenues is merely dollars, which indicates that UCB-X can learn a very good approximation in this experiment.

The results of the second experiment are shown in Figure 5b. We make two additional observations. First, the trends are very similar to those in Figure 5a. This shows that our algorithm UCB-X does not overfit to a specific group of ad networks. Second, algorithms with LP oracle learn slightly faster than those with Greedy oracle. For example, the expected reward of UCB-Greedy and UCB-LP are respectively and dollars in k steps. The difference of the expected rewards is dollars in k steps, or equivalently dollars in revenue.

Publisher revenue with Aggregated Ad Networks.   Finally, we study the impact of ad network aggregation on the expected revenue of publisher.

Again, we evaluate Offline-X and UCB-X on the most active ad slot in Active20 dataset but under three different configurations:

1. Configuration 1: with group sizes of six and three.

2. Configuration 2: with group sizes of four, four and one.

3. Configuration 3: where all group sizes are one.

These configurations represent different degrees of ad network aggregation. In all the configurations, the ad networks are partitioned in a uniformly random fashion. Each configuration is repeated for times.

The results of oracles Greedy and LP are respectively reported in Figure 6a and Figure 6b. We observe the similar results to the previous experiment that our algorithms can learn to sell under all the configurations of ad network aggregations. Moreover, we observe two additional interesting trends.

First, less aggregation of ad networks results in higher expected reward. Take the oracle Greedy as example. As shown in Figure 6a, the expected rewards of Offline-Greedy and UCB-Greedy under Configuration 3 are respectively and dollars in k steps. They are both higher than the expected rewards acquired from other configurations where ad networks aggregate into groups. One explanation is that less aggregation of ad networks allows the publisher to better customize prices to ad networks, and hence the expected reward is higher.

Second, less aggregation of ad networks requires longer time to find the optimal solution, especially for the algorithm UCB-LP. To illustrate this phenomenon, we run all algorithms with oracle LP for more steps (k). As shown in Figure 6b, the expected reward of UCB-LP reaches dollars in k steps when there are nine individual ad networks (Configuration 3). It exceeds the expected reward of dollars in the case of two aggreated groups (Configuration 1) and is close to dollars of three groups (Configuration 2). With less aggregation, although our algorithm statistically should be able to collect more responses per waterfall run, it needs to learn the behavior of more groups.

## 7 Related Work

Our work is at the intersection of online advertising and online learning with partial feedback.

The problem of waterfall optimization was studied before under the name of “sequential posted price mechanisms” [7, 6, 1, 4, 10]. In [7, 6, 1], the acceptance probabilities of ad networks are assumed to be known by the publisher. [4, 10] study the waterfall optimization problem in an online setting, under the assumption that all ad networks have the same acceptance probabilities. We do not make any of these assumptions.

Our work is a generalization of online learning to rank in the cascade model [11, 12]. More specifically, cascading bandits can be viewed as waterfall bandits when . This seemingly minor change has major implications. For instance, when , the optimal solution in (2) can be computed greedily. In our case, no polynomial-time algorithm is known for solving (2). From the learning point of view, we learn statistics. In cascading bandits, only statistics are learned because .

Our problem is a form of partial monitoring [5, 2], which is a harder class of learning problems than multi-armed bandits. The general algorithms in partial monitoring cannot solve our problem computationally efficiently because their computational cost is , where is exponential in the number of ad networks.

Our setting is also reminiscent of stochastic combinatorial semi-bandits [9, 8, 13, 18], which can be solved statistically efficiently by UCB-like algorithms. The difference is that our feedback is less than semi-bandit. In particular, if an ad network accepts an offer, the learning agent does not learn if any of the subsequent ad networks would have accepted their offered prices. In combinatorial semi-bandits, all of these events are assumed to be observed. Therefore, our problem cannot be solved as a combinatorial semi-bandit.

## 8 Conclusions

For the waterfall, we propose the algorithm , a computationally and sample efficient online algorithm for learning to price, which maximizes the expected revenue of the publisher. We derive a sublinear upper bound on the -step regret of . Note that solves a general problem of learning to maximize (2) from partial feedback. Therefore, although our main focus is online advertising, the algorithm may have other applications, especially in learning to price.

We evaluate on both synthetic and real-world data, and show that it quickly learns competitive strategies to the best approximations in hindsight. In addition, we investigate multiple real-world scenarios that are of a particular interest of publishers. We show that can learn to sell in these scenarios and it does not overfit.

We leave open several questions of interest. For instance, note that the update of statistics in can be easily modified to leverage the following two monotonicity properties. When ad network accepts price , it would have accepted any lower price . Similarly, when ad network does not accept price , it would have not accepted any higher price . Roughly speaking, this would make more statistically efficient. However, it is non-trivial to prove that this would result in a better regret bound than that in Section 5. We leave these for future work.

## References

• [1] Marek Adamczyk, Allan Borodin, Diodato Ferraioli, Bart De Keijzer, and Stefano Leonardi. Sequential posted-price mechanisms with correlated valuations. ACM Transactions on Economics and Computation (TEAC), 5(4):22, 2017.
• [2] Rajeev Agrawal, Demosthenis Teneketzis, and Venkatachalam Anantharam. Asymptotically efficient adaptive allocation schemes for controlled i.i.d. processes: Finite parameter space. IEEE Transactions on Automatic Control, 34(3):258–267, 1989.
• [3] Peter Auer, Nicolo Cesa-Bianchi, and Paul Fischer. Finite-time analysis of the multiarmed bandit problem. Machine Learning, 47:235–256, 2002.
• [4] Moshe Babaioff, Shaddin Dughmi, Robert Kleinberg, and Aleksandrs Slivkins. Dynamic pricing with limited supply. ACM Transactions on Economics and Computation (TEAC), 3(1):4, 2015.
• [5] Gabor Bartok, Navid Zolghadr, and Csaba Szepesvari. An adaptive algorithm for finite stochastic partial monitoring. In Proceedings of the 29th International Conference on Machine Learning, 2012.
• [6] Tanmoy Chakraborty, Eyal Even-Dar, Sudipto Guha, Yishay Mansour, and S Muthukrishnan. Approximation schemes for sequential posted pricing in multi-unit auctions. In WINE, pages 158–169. Springer, 2010.
• [7] Shuchi Chawla, Jason D Hartline, David L Malec, and Balasubramanian Sivan. Multi-parameter mechanism design and sequential posted pricing. In

Proceedings of the forty-second ACM symposium on Theory of computing

, pages 311–320. ACM, 2010.
• [8] Wei Chen, Yajun Wang, and Yang Yuan. Combinatorial multi-armed bandit and its extension to probabilistically triggered arms. CoRR, abs/1407.8339, 2014.
• [9] Yi Gai, Bhaskar Krishnamachari, and Rahul Jain. Combinatorial network optimization with unknown variables: Multi-armed bandits with linear rewards and individual observations. IEEE/ACM Transactions on Networking, 20(5):1466–1478, 2012.
• [10] Robert Kleinberg and Tom Leighton. The value of knowing a demand curve: Bounds on regret for online posted-price auctions. In Foundations of Computer Science, 2003. Proceedings. 44th Annual IEEE Symposium on, pages 594–605. IEEE, 2003.
• [11] Branislav Kveton, Csaba Szepesvari, Zheng Wen, and Azin Ashkan. Cascading bandits: Learning to rank in the cascade model. In Proceedings of the 32nd International Conference on Machine Learning, 2015.
• [12] Branislav Kveton, Zheng Wen, Azin Ashkan, and Csaba Szepesvari. Combinatorial cascading bandits. In Advances in Neural Information Processing Systems 28, pages 1450–1458, 2015.
• [13] Branislav Kveton, Zheng Wen, Azin Ashkan, and Csaba Szepesvari. Tight regret bounds for stochastic combinatorial semi-bandits. In

Proceedings of the 18th International Conference on Artificial Intelligence and Statistics

, 2015.
• [14] T. L. Lai and Herbert Robbins. Asymptotically efficient adaptive allocation rules. Advances in Applied Mathematics, 6(1):4–22, 1985.
• [15] Hairen Liao, Lingxiao Peng, Zhenchuan Liu, and Xuehua Shen. ipinyou global rtb bidding algorithm competition dataset. In Proceedings of the Eighth International Workshop on Data Mining for Online Advertising, pages 1–6. ACM, 2014.
• [16] Richard S Sutton and Andrew G Barto. Reinforcement learning: An introduction, volume 1. MIT press Cambridge, 1998.
• [17] Satya Vinnakota. In your waterfall - how publishers monetise their ad inventory, June 2017.
• [18] Zheng Wen, Branislav Kveton, and Azin Ashkan. Efficient learning in large-scale combinatorial semi-bandits. In Proceedings of the 32nd International Conference on Machine Learning, 2015.
• [19] Maciej Zawadzinski. What is waterfalling and how does it work?, September 2016.

## Appendix A Appendix: Proof of Theorem 1

We first prove that the function is monotone in the weight function , for any fixed action .

[] Consider such that for all , . If the items of action are sorted in descending order of prices, then

 f(A,¯w)≤f(A,¯v) (5)
###### Proof.

We prove Lemma A based on the mathematical induction on , the number of ad networks.

Induction base: We first prove that equation 5 holds for the case with . Notice that for ,

 f(A,¯w)=p1¯w(a1,p1)≤p1¯v(a1,p1)=f(A,¯v) (6)

Induction step: For any integer , we then prove that if equation 5 holds for , then it also holds for . Recall that for , . To simplify the exposition, we also define the term . Notice that

 f(A,¯w)= ¯w(a1,p1)p1+(1−¯w(a1,p1))f(~A,¯w) (a)≤ ¯w(a1,p1)p1+(1−¯w(a1,p1))f(~A,¯v)

where follows the induction hypothesis. Moreover, since is the expected revenue of , . Therefore,

 0≤(p1−f(~A,¯v))(¯v(a1,p1)−¯w(a1,p1)) (7)

which implies

 ¯w(a1,p1)p1+(1−¯w(a1,p1))f(~A,¯v)≤¯v(a1,p1)p1+(1−¯v(a1,p1))f(~A,¯v)=f(A,¯v)

As a result, . ∎

We now prove Theorem 1. First, we define the “bad event” at time as the event that at least one is outside its confidence interval at time ,

 Et={∃e∈Es.t.|¯w(e)−^wTt−1(e)(e)|>ct−1,Tt−1(e)}. (8)

Notice that , the complement of , is considered as the “good event” at time . Similar to [11], we define the event as the event that item is “observed” at time (i.e. ad network is called and offered price at time ):

 Ge,t= {∃1≤k≤K∋dt,k=e,Bt≥k} (9)

where . In addition, we define where is the history of all actions and feedbacks until time plus the action , which is determined by under Algorithm 3. The following lemma bounds the per-step scaled regret under the “good event” :

###### Proof.

Conditioning on the event , we have . Lemma A states that implies . Then we have the following bound on :

 Rγt(a)= γf(A∗,¯w)−f(At,¯w) (b)≤ γf(A∗,Ut)−f(At,¯w) ≤ γmaxAf(A,Ut)−f(At,¯w) (c)≤ f(At,Ut)−f(At,¯w), (10)

where (a) follows from the definition of , (b) follows from Lemma A, and (c) follows from the fact that is computed from a -approximation algorithm.

To simplify the exposition, in the rest of this proof, we use and to respectively denote and , and use to denote . Then we have

 Rγt(a)≤ K∑i=1[∏j

where (b) follows from for all under event , and (d) follows from for all . Notice that is the “item-wise” difference between the upper confidence and the mean, and is the conditional probability that the th ad network will be called. We have

 ∏j

From (11), we have

 Rγt≤K∑i=1Et{1{Gdt,i,t}}ϕ(at,i,pt,i),t. (13)

We use Lemma 5 to bound as follows. Notice that . We use to refer to an item in . As discussed, all prices are less or equal to 1; so, . Hence, we have . As a result,

 E[n∑t=11{Et}Rγt]≤γE[n∑t=11{Et}] ≤ γ∑e∈En∑t=1t∑s=1P(|¯w(e)−^ws(e)|≥ct,s) (a)≤ 2γ∑e∈En∑t=1t∑s=1exp(−3logt) ≤ 2γ∑e∈En∑t=1t−2≤γπ23|E|=γπ23MK

In the above derivation, follows Hoeffding’s inequality. Notice that for all and , we have (1) under event , (2) . Based on Lemma 5, we have

 E[n∑t=11{¯Et}Rγt]≤∑e∈EE[n∑t=11{¯Et,Ge,t}ϕe,t] ≤ 2∑e∈EE[n∑t=11{¯Et,Ge,t}ct−1,Tt−1(e)]≤2√1.5logn∑e∈EE⎡⎣