Algorithms for slate bandits with non-separable reward functions

04/21/2020 ∙ by Jason Rhuggenaath, et al. ∙ 4

In this paper, we study a slate bandit problem where the function that determines the slate-level reward is non-separable: the optimal value of the function cannot be determined by learning the optimal action for each slot. We are mainly concerned with cases where the number of slates is large relative to the time horizon, so that trying each slate as a separate arm in a traditional multi-armed bandit, would not be feasible. Our main contribution is the design of algorithms that still have sub-linear regret with respect to the time horizon, despite the large number of slates. Experimental results on simulated data and real-world data show that our proposed method outperforms popular benchmark bandit algorithms.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 12

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

In many practical problems an agent needs to choose an action from a set where each action leads to a random reward. The objective is to devise a policy that maximizes expected cumulative rewards over a finite time horizon. Often the reward distribution is unknown, and as a consequence, the agent faces an exploration-exploitation trade-off. The multi-armed bandit problem  [13, 3] is a standard framework for studying such exploration-exploitation problems.

Many problems in the domain of web-services, such as e-commerce, online advertising and streaming, require the agent to select not only one but multiple actions at the same time. After the agent makes a choice, a collective reward characterizing the quality of the entire selection is observed. Problems of this type are typically referred to as slate bandits or combinatorial bandits  [4, 7]. In a slate bandit problem, a slate consists of a number of slots and each slot has a number of base actions. Given a particular action for each slot, a reward function defined at the slate-level determines the reward for each slate.

One example of a slate bandit problem is the design of recommender systems for a streaming service such as Netflix. When a user logs in to his account, the streaming service displays a page that recommends various movies and shows. This can be interpreted as a slate bandit problem where the slots are the different genres (e.g. comedy, romance, etc.) and the base actions are the titles in each genre. The goal is to recommend a set of titles such that the probability of a user playing something from the set is maximized.

Another example is the reserve price optimization problem on a header bidding platform (see Section 3.2 for more details). In this problem, a publisher has to select a reserve price for each partner on the header bidding platform and the revenue at the selected reserve price is stochastic. The revenue for the publisher is the maximum of the revenues of all of the partners on the header bidding platform.

Previous studies (see e.g. [11, 17, 8]

) assume that the reward function at the slate level is additive or that the expected reward at the slate-level is a non-decreasing function of the expected rewards at the slot-level (this is also called the monotonicity assumption). This implies that the optimal action at the slate level can be found by finding the optimal base action for each individual slot. In some applications the monotonicity assumption might be reasonable, but in some cases it might not apply. For example, if the slate-level reward is the maximum (or minimum) of the rewards at the slot level, then the monotonicity assumption is no longer satisfied. This is an example of a non-separable slate-level reward function. The reserve price optimization problem mentioned above is thus a concrete example of a slate bandit problem with a non-separable reward function. Another example arises in maintenance and reliability problems, where the failure time of a system is the minimum of a set of random variables.

In this paper we study slate bandits with non-separable reward functions. To the best of our knowledge, this variant of the slate bandit problem has not been studied before and existing algorithms cannot be applied. We are mainly concerned with cases where the number of slates is large relative to the time horizon, so that trying each slate as a separate arm in a traditional multi-armed bandit, would not be feasible. In such cases it is not immediately clear whether sub-linear regret is possible, and therefore we study the design of algorithms that have sub-linear regret. We summarize the main contributions of this paper as follows:

  • To the best of our knowledge, we are the first to study slate bandits with non-separable reward functions.

  • We provide a theoretical analysis and derive problem-dependent and problem-independent regret bounds. We provide algorithms that have sub-linear regret with respect to the time horizon.

  • Experimental results on simulated data and using real-world data show that our proposed method outperforms popular benchmark bandit algorithms.

The remainder of this paper is organized as follows. In Section 2 we discuss the related literature. Section 3 provides a formal formulation of the problem. In Section 4 we present the our proposed algorithms for the slate bandit problem and provide a theoretical analysis. In Section 5 we perform experiments and compare our method with baseline strategies in order to assess the quality of our proposed algorithms. Section 6 concludes our work and provides some interesting directions for further research.

2 Related Literature

The slate bandit problem has been studied before in multiple prior papers and these papers study different variants of the problem and make different assumptions. The main variants of the slate bandit problem center around three properties of the problem: (i) whether the slot-level rewards in the slate are observed or not (the situation where the slot-level rewards are observed is often referred to as semi-bandit feedback in the literature); (ii) whether the function that determines the slate-level reward is known or not; (iii) the structural properties of the function that determines the slate-level reward.

In [11, 4, 14, 17, 5, 21, 12, 7, 20] slate bandits with semi-bandit feedback are studied. In [11, 18, 21, 12] it is assumed that the slate-level reward is an additive function of the rewards of the individual slots. In [18] the slot-level rewards are assumed to be unobserved, while [11] assumes that slate-level reward function is known. Some papers make other structural assumptions about the slate-level reward function. In [17, 5, 14, 6] two key structural assumptions are made: a monotonicity assumption and a bounded smoothness (or Lipschitz continuity) assumption. In addition, [17, 5, 14] do not assume that the slate-level reward function is known. Instead, they assume that an -approximation oracle is available.

In related work [8] do not assume that the slot-level rewards are observed and that the slate-level reward function is known. They exploit a monotonicity assumption (similar to [17, 5, 14, 6]

) that relates the slot-level rewards to the slate-level rewards and propose an approach based on Thompson sampling in order to balance exploration and exploitation.

The main difference between our paper and the aforementioned works, is that we do not assume that the slate-level reward is additive or that the expected reward at the slate-level is a non-decreasing function of the expected rewards at the slot-level. However, unlike in [8], we assume that the slate-level reward function is known. Furthermore, we do not make use of approximation oracles as in [17, 5, 14, 6].

To the best of our knowledge, this variant of the slate bandit problem has not been studied before and existing algorithms cannot be applied. We are mainly concerned with cases where the number of slates is large relative to the time horizon, so that trying each slate as a separate arm in a traditional multi-armed bandit, would not be feasible. In such cases it is not immediately clear whether sub-linear regret is possible. The main contribution of this paper is the design of algorithms that still have sub-linear regret with respect to the time horizon, despite the large number of slates.

3 Problem formulation

3.1 Problem definition and notation

We consider a slate bandit problem that is similar to [8] and the unordered slate bandit problem in [11]. The set of actions (the slates) is given by with . If action is selected, then the reward is a random variable . A slate consists of slots, where

. Each action is a vector in

. That is, for all . Slot has a set of base actions with . The set of slates is given by . We make the following assumptions regarding the slot-level action sets.

Assumption 1.

Without loss of generality we assume that for .

Assumption 2.

Without loss of generality we assume that for .

Given an action the random variable satisfies , where is the -th element of action and where for is a random variable. We make the following assumptions regarding the slate-level reward function.

Assumption 3.

Let and . Then, is independent from for all .

Assumption 4.

The rewards are bounded and such that for all .

Assumption 5.

The function is known and satisfies .

For define the quantity and let . The optimality gap for action is defined as . Define . Here measures the optimality gap between the best action and the second-best action. We assume that the optimality gaps satisfy for some . This assumption enforces that the optimality gap is bounded from below and ensures that the notion of ‘the best action’ and ‘the second-best action’ is well-defined.

We assume that the decisions are implemented according to the following online protocol: for each round

  1. the agent selects a slate .

  2. the agent observes for . The agent receives where . The rewards are independent over the rounds.

For a fixed sequence of selected actions, the pseudo-regret over rounds is defined as . The expected pseudo-regret is defined as , where the expectation is taken with respect to possible randomization in the selection of the actions .

The slate bandit problem is challenging due to the number of actions growing exponentially in , and due to the non-separable reward function which implies that a local optimization of slate rewards does not necessarily imply a global optimum. That is, cannot necessarily be maximized by choosing the action with the highest expected reward at the slot-level for each slot. Note that we allow for an arbitrary function in Assumption 5 and that the reward distributions at the slot-level can also be arbitrary (as long as they are bounded in ). Existing papers assume that is an additive function or that satisfies a monotonicity property. In Example 1 below we give a concrete example that shows that existing algorithms that exploit this monotonicity property (such as [8, 17, 5, 14, 6]) can fail to learn the best slate. Therefore, existing algorithms are in general not guaranteed to solve our problem. Assumption 3 may seem restrictive, but even under this assumption, this problem is still non-trivial and, to the best of our knowledge, there are no other existing algorithms to solve this problem. Note that Example 1 shows that, even under Assumption 3, existing algorithms can fail to learn the best slate.

Example 1.

Consider a simple instance of the slate bandit problem where there are slots. Let , . Let , , , . Here

denotes a uniform distribution on

. For each slot, there are 2 actions. There are 4 slates in total and the slates are given by .
The rewards at the slate level are given by:

Let , , and .

Existing algorithms in [8, 17, 5, 14, 6] make a monotonicity assumption. This assumption states that if the vector of mean rewards of the slots in a slate (say slate ) dominates the vector of mean rewards of the slots in another slate (say slate ), then the expected reward of slate is at least as high as the expected reward of slate . In this example the monotonicity assumption implies that, if , then it must be that .

Note that from the properties of the uniform distribution we have that . Note that the monotonicity assumption that is used in [8, 17, 5, 14, 6] implies that we should have . However, it can be shown that in this example. Therefore, the monotonicity assumption implies that slate has an expected reward that is at least as high as the expected reward of slate and this implication is false. Existing algortihms that rely on the monotonicity assumption are therefore not guaranteed to learn the best action in this slate bandit problem.

3.2 Example application: reserve price optimization and header bidding

One of the main mechanisms that web publishers use in online advertising in order to sell their advertisement space is the real-time bidding (RTB) mechanism [19]. In RTB there are three main platforms: supply side platforms (SSPs), demand side platforms (DSPs) and an ad exchange (ADX) which connects SSPs and DSPs. The SSPs collect inventory of different publishers and thus serve the supply side of the market. Advertisers which are interested in showing online advertisements are connected to DSPs. A real-time auction decides which advertiser is allowed to display its ad and the amount that the advertiser needs to pay. Most of the ad inventory is sold via second-price auctions with a reserve price [19, 15]. In this auction, the publisher specifies a value (the reserve price) which represent the minimum price that he wants for the impression. The revenue for the publisher (at a particular reserve price) is random and depends on the highest bid () and second highest bid () in the auction. The revenue of the publisher in round is given by .

In header bidding (see e.g. [10]), the publisher can connect to multiple SSPs for a single impression. The publisher specifies a reserve price for each SSP and each SSP runs a second-price auction. After the SSPs run their auctions, they return a value back to the header bidding platform indicating the revenue for the publisher if they sell on that particular SSP. The slate bandit problem studied in this paper can be used to model a reserve price optimization problem on a header bidding platform. The connection is as follows. There are SSPs on the header bidding platform. In every round the publisher needs to choose a vector of reserve prices from the set . The revenue on the header bidding platform when action chosen is given by . Note that Assumption 3 is reasonable in this setting since (i) the pool of advertisers and their bidding strategies can differ across DSPs, (ii) advertisers do not observe the bids (of their competitors) on other DSPs, and (iii) SSPs can be connected to different DSPs.

4 Algorithms and Analysis

4.1 The ETC-SLATE algorithm

In this section we discuss our proposed algorithm. We refer to our algorithm as ETC-SLATE (Explore then Commit slate bandit algorithm). The main idea that is used in our proposed algorithm relies on exploiting Assumption 3. This is best illustrated using an example.

Example 2.

Consider a simple instance of the slate bandit problem where there are slots. Assume that , , . Therefore, we have that

Suppose that, for every , we want to have i.i.d. (independent and identically distributed) samples from where . The straightforward way to do this is to collect i.i.d. samples from by selecting every action exactly times. Thus you would need samples in total.

A more efficient approach is simply to sample action and action exactly times and save the values of , , , , , . By Assumption 3, we can use these samples to obtain i.i.d. samples from for all . To get an i.i.d. sample from , we simply use a sample from , and . Note that this approach only requires samples in total and this is less than the samples of the previous approach. Note in particular that this approach allows us to obtain samples for actions that have not been selected. In our example above, action was not selected. However, by selecting action and action we do obtain the necessary information that allows us to construct an artificial i.i.d. sample from .

The pseudo-code for ETC-SLATE is given by Algorithm 1. The main idea is to divide the horizon into two phases. The first phase (the exploration phase) has length and the second phase (the commit phase) has length . In the first phase, the algorithm determines the best action in action set . In the second phase, the algorithm commits to using action in each round.

In the first phase, the algorithm takes a subset of actions from the set and selects each action in this subset times. Each time that action is selected, the rewards of the slots are observed (Line 6) and stored for later use (Line 7). In Lines 11-16, the stored rewards for the slots are used in order to generate i.i.d. samples of the random variable which are given by . In Line 17, the empirical mean of the values is determined for each action . The action is then chosen as the action with the highest empirical mean. The value of is determined by the following parameters: the horizon , , , and action set . In Section 4.2 and 4.3 we will show that this choice for leads to sub-linear regret for suitably chosen values of and .

0:  horizon , , , action sets . 1:  Set . Set . 2:  Set and . Explore Phase. 3:  for   do 4:     for   do 5:         Select action . 6:         Observe rewards for . 7:         Set for . 8:         Set . 9:     end for 10:  end for 11:  for   do 12:     for   do 13:         Select for . 14:         Set . 15:     end for 16:  end for Find best arm in . 17:  Find such that for all . Commit Phase. 18:  for   do 19:     Play action . 20:  end for
Algorithm 1 ETC-SLATE

4.2 Problem-dependent regret bounds

Lemma 1 ([9]).

Let be independent random variables such that for . Let , and let . Then, .

Proposition 1 ( ).

Let be i.i.d draws from for an action . Assume that and are independent if and . Let and for . Let , where ties are broken arbitrarily if there are multiple candidates for . Then, .

Proof.

Define . Then we have that,

Inequality (a) follows from applying a union bound over the set . Inequality (b) follows from the fact that . Inequality (c) follows from applying Lemma 1 to the differences for . Inequality (d) follows from the fact that and for .

Recall that denotes the action in that is identified as by Algorithm 1. The following proposition bounds the probability that is incorrectly identified.

Proposition 2.

Let . Let denote the action in that is identified as by Algorithm 1. If Algorithm 1 is run with the inputs: , , , and action set , then .

Proof.

From the description of Algorithm 1, it follows that . Given this choice of , we are able to generate i.i.d draws from for each . Let denote the action that has the highest empirical mean based on the samples and recall that is the action with the highest expected return. By Proposition 1 it follows that .

We can now state the main result of this subsection (Proposition 3).

Proposition 3.

Let . If Algorithm 1 is run with inputs: , , , and action set , then with .

Proof.

Note that the regret can be decomposed as . Here denotes the regret over the first rounds and denotes the regret over the last rounds. In order to bound it suffices to bound each term.

Bounding
Note that is trivially bounded by since by Assumption 4 the regret for any period is at most . From the description of Algorithm 1, it follows that . Given , it follows that Phase I has length . By substituting the quantities for and , we conclude that .

Bounding
We decompose according to two cases:

  • . If case (i) occurs, then is trivially bounded by . Therefore we conclude that in case (i) .

  • . If case (ii) occurs, then . This follows from the fact that, .

By combining the results for the two cases above and noting that by Proposition 2 we have , we obtain .

Corollary 1.

Let and . Suppose that Algorithm 1 is run with inputs: , , , and action set . Then, .

If is not precisely known, we can still run Algorithm 1 using a lower bound for if this is available. In Proposition 3, the dependence of regret on would then still be logarithmic in but with a different problem-dependent constant.

4.3 Problem-independent regret bounds

The results of the previous section show that expected regret of order is possible if the gaps are known. However, as , the regret bounds in Proposition 3 and Corollary 1 becomes vacuous. Therefore, it is useful to study whether sub-linear regret is possible when the gaps are unknown. In this section we prove problem-independent regret bounds and show that sub-linear regret is still achievable.

Proposition 4.

Let . If Algorithm 1 is run with inputs: , , , and action set , then with .

Proof.

The proof uses similar arguments as in Proposition 2 and 3. Define the set . Let denote the action in that is identified as by Algorithm 1. Using similar arguments as in the proof of Proposition 2, we conclude that . We again decompose the regret as .

Bounding
Note that is trivially bounded by since by Assumption 4 the regret for any period is at most . From the description of Algorithm 1, it follows that . Given , it follows that Phase I has length . By substituting the quantities for and , we conclude that .

Bounding
We decompose according to two cases:

  • . If case (i) occurs, then is trivially bounded by . Therefore, we conclude that in case (i) .

  • . If , then from the definition of , it follows that .

By combining the results for the two cases above and noting that , we obtain
.

Corollary 2.

Let and . Suppose that Algorithm 1 is run with inputs: , , , and action set . Then, .

It is useful to compare the obtained bounds with previously known results. If we consider every slate as a separate action in a standard multi-armed bandit algorithm such as UCB1, then regret of order is possible [3, 16]. If we compare this with Corollary 2, then we have a worse dependence on (we have instead of ) but a better dependence on . It is an open problem whether the dependence on can be improved further.

5 Experiments

In this section we conduct experiments in order to test the performance of our proposed algorithm. We conduct experiments using both simulated data and real-world data.

5.1 Experiments using simulated data

The main purposes of the experiments with simulated data are to verify the theoretical results that were derived, and to investigate the effects of ignoring the non-separability of the slate-level reward function on the regret.

5.1.1 Experimental settings

In the experiments we set and for . We consider three choices for the slate-level reward function. These choices are:

  • ,

  • .

In our experiments the rewards for follow a uniform distribution on where is chosen uniformly from independently for and for all , and is chosen uniformly from independently from . In total we have three experimental settings: Exp1, Exp2, Exp3. The abbreviation Exp1 means that is used. The other abbreviations have a similar interpretation.
The main motivation for the choice of slate-level reward functions and the reward distributions is that, the slate-level reward functions are non-separable, but since the reward distributions are uniform, the optimal slate and the regret can still be calculated analytically.

In the experiments, ETC-SLATE is tuned according to Corollary 2. To the best of our knowledge, there are no existing algorithms for our slate bandit problem with non-separable rewards. For this reason we used the following benchmark. We run a standard multi-armed bandit algorithm on the base actions at the slot-level (for each slot independently), and we then combine the base actions chosen by these independent bandits in order to form the action at the slate-level. This is a reasonable benchmark, in the sense that assuming a non-decreasing reward function at the slate-level, this should allow this benchmark to learn the optimal action over time. In the experiments we use the UCB1 [2] and Thompson sampling (TS) [1] as the multi-armed bandit algorithms at the slot-level.

5.1.2 Results

In Figure 1 the cumulative regret is shown for different experimental settings and different values for the problem horizon. Each point in the graph shows the cumulative regret over rounds for a slate bandit problem of horizon averaged over 200 simulations. The results indicate that ETC-SLATE clearly outperforms the benchmarks. The regret of UCB1 is at least twice as high as ETC-SLATE. TS tends to outperform UCB1, but the regret is still at least 40% higher compared to ETC-SLATE. Also, we note that ETC-SLATE performs similarly on all the test functions, but for UCB1 and TS the performance on and differs from . The results in Figures 1 also confirm that the regret bound from Corollary 2 indeed holds. However, by comparing the regret curve with the expression for the regret bound, it appears that the bound is not tight and this suggests that the bound could be improved further.

Figure 1:

Performance of algorithms averaged over 200 runs. Lines indicate the mean and shaded region indicates 95% confidence interval.

5.2 Experiments using real-world data

In this section we perform experiments on the reserve price optimization problem with header bidding. In this problem, there are SSPs on the header bidding platform. In every round , the publisher needs to choose a reserve price from the set . The revenue on the header bidding platform when action chosen is given by .

5.2.1 Dataset description

In order to evaluate our method we use real-life data from ad auction markets from the publicly available iPinYou dataset [22]. It contains bidding information from the perspective of nine advertisers on a Demand Side Platform (DSP) during a week. The dataset contains information about the top bid and the second bid if the advertiser wins an auction. We use the iPinYou dataset to construct synthetic data for the top bid and second bid in order to test our proposed approach.

We use data from the advertisers to model the bids from an SSP. Fix an advertiser (say advertiser ) and fix an hour of the day (say hour ). For advertiser we take the values of the second highest bid in hour and we filter these values by the ad exchange (there are two ad exchanges) on which the bids were placed. Next, we sample (with replacement) 10000 values for each ad exchange to approximate the distribution of the second bid. After these steps we end up with 2 lists and of size 10000 for each ad exchange for advertiser in hour . Define as the maximum value of all values in and . We use the following procedure to construct the bids for a horizon of length . For round we draw uniformly at random from and uniformly at random from . The highest bid in round is given by and the second-highest bid is given by

. Denote the resulting joint distribution by

.

5.2.2 Experimental settings

In the experiments we assume that there are SSPs. The action sets for SSP is given by reserve prices which are equally spaced in the interval . We consider three experimental settings and in each setting the distributions are different. The different experimental settings are summarized as follows.

  • Setting Exp1. In Exp1 we use data from advertisers 1458, 3358, 3386 and 3427 on day 2 and from hour 18.

  • Setting Exp2. In Exp2 we use data from advertisers 1458, 3358, 3386 and 3427 on day 2 and from hour 15.

  • Setting Exp3. In Exp3 we use data from advertisers 1458, 2261, 2821 and 3427 on day 3 and from hour 18.

In order to measure the performance of the methods we look at the per period reward, which is defined as . Here is the observed reward in round .

In the experiments, ETC-SLATE is tuned according to Corollary 2 and we use the same benchmarks as in the previous experiments.

5.2.3 Results

Figure 2 shows the per period reward. From this figure we observe that the difference in performance is quite substantial as ETC-SLATE has a cumulative reward that is on average 10% higher than the benchmarks. Furthermore, the results indicate that the difference in performance is not sensitive with respect to the underlying distributions at the slot-level.

Figure 2: Performance of algorithms averaged over 200 runs. Lines indicate the mean and shaded region indicates 95% confidence interval.

6 Conclusion

In this paper we study slate bandits with non-separable reward functions at the slate-level. Previous papers have only considered the case where the slate-level reward satisfies a monotonicity property. The non-separability property implies that choosing the optimal base action for each slot does not necessarily lead to the highest expected reward at the slate-level. We provide a theoretical analysis and derive problem-dependent and problem-independent regret bounds. Furthermore, we provide algorithms that have sub-linear regret with respect to the time horizon.

The work presented in this paper can be improved in a number of ways. In our analysis we made the assumption that the slot-level rewards are independent from each other. However, other papers (e.g. [8]) do not make such assumptions. It is not clear how to tackle the slate bandit problem in such a situation and future research can be directed towards deriving sub-linear regret bounds for this case.

References

  • [1] S. Agrawal and N. Goyal. Further optimal regret bounds for thompson sampling. In

    Proceedings of the Sixteenth International Conference on Artificial Intelligence and Statistics

    , volume 31 of

    Proceedings of Machine Learning Research

    , pages 99–107. PMLR, 29 Apr–01 May 2013.
  • [2] P. Auer, N. Cesa-Bianchi, and P. Fischer. Finite-time analysis of the multiarmed bandit problem. Machine Learning, 47(2):235–256, May 2002.
  • [3] S. Bubeck and N. Cesa-Bianchi. Regret analysis of stochastic and nonstochastic multi-armed bandit problems. Foundations and Trends® in Machine Learning, 5(1):1–122, 2012.
  • [4] N. Cesa-Bianchi and G. Lugosi. Combinatorial bandits. Journal of Computer and System Sciences, 78(5):1404 – 1422, 2012. JCSS Special Issue: Cloud Computing 2011.
  • [5] W. Chen, Y. Wang, and Y. Yuan. Combinatorial multi-armed bandit: General framework and applications. In Proceedings of the 30th International Conference on Machine Learning, volume 28 of Proceedings of Machine Learning Research, pages 151–159. PMLR, 17–19 Jun 2013.
  • [6] W. Chen, Y. Wang, Y. Yuan, and Q. Wang. Combinatorial multi-armed bandit and its extension to probabilistically triggered arms. J. Mach. Learn. Res., 17(1):1746–1778, Jan. 2016.
  • [7] R. Combes, M. S. Talebi Mazraeh Shahi, A. Proutiere, and m. lelarge. Combinatorial bandits revisited. In Advances in Neural Information Processing Systems 28, pages 2116–2124. Curran Associates, Inc., 2015.
  • [8] M. Dimakopoulou, N. Vlassis, and T. Jebara. Marginal posterior sampling for slate bandits. In Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence, IJCAI-19, pages 2223–2229. International Joint Conferences on Artificial Intelligence Organization, 7 2019.
  • [9] W. Hoeffding. Probability inequalities for sums of bounded random variables. Journal of the American Statistical Association, 58(301):13–30, 1963.
  • [10] G. Jauvion, N. Grislain, P. Dkengne Sielenou, A. Garivier, and S. Gerchinovitz. Optimization of a SSP’s header bidding strategy using Thompson sampling. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’18, pages 425–432, New York, NY, USA, 2018. ACM.
  • [11] S. Kale, L. Reyzin, and R. E. Schapire. Non-stochastic bandit slate problems. In Proceedings of the 23rd International Conference on Neural Information Processing Systems - Volume 1, NIPS’10, pages 1054–1062, USA, 2010. Curran Associates Inc.
  • [12] B. Kveton, Z. Wen, A. Ashkan, and C. Szepesvari. Tight Regret Bounds for Stochastic Combinatorial Semi-Bandits. In Proceedings of the Eighteenth International Conference on Artificial Intelligence and Statistics, volume 38 of Proceedings of Machine Learning Research, pages 535–543. PMLR, 09–12 May 2015.
  • [13] T. L. Lai and H. Robbins. Asymptotically efficient adaptive allocation rules. Advances in applied mathematics, 6(1):4–22, 1985.
  • [14] S. Li, B. Wang, S. Zhang, and W. Chen. Contextual combinatorial cascading bandits. In Proceedings of the 33rd International Conference on International Conference on Machine Learning - Volume 48, ICML’16, pages 1245–1253. JMLR.org, 2016.
  • [15] M. Mohri and A. M. n. Medina. Learning algorithms for second-price auctions with reserve. J. Mach. Learn. Res., 17(1):2632–2656, Jan. 2016.
  • [16] R. Munos. From bandits to monte-carlo tree search: The optimistic principle applied to optimization and planning. Foundations and Trends® in Machine Learning, 7(1):1–129, 2014.
  • [17] L. Qin, S. Chen, and X. Zhu. Contextual combinatorial bandit and its application on diversified online recommendation. In Proceedings of the 2014 SIAM International Conference on Data Mining, pages 461–469. SIAM, 2014.
  • [18] A. Swaminathan, A. Krishnamurthy, A. Agarwal, M. Dudik, J. Langford, D. Jose, and I. Zitouni. Off-policy evaluation for slate recommendation. In Advances in Neural Information Processing Systems 30, pages 3632–3642. Curran Associates, Inc., 2017.
  • [19] J. Wang, W. Zhang, and S. Yuan. Display advertising with real-time bidding (RTB) and behavioural targeting. Foundations and Trends® in Information Retrieval, 11(4-5):297–435, 2017.
  • [20] Y. Wang, H. Ouyang, C. Wang, J. Chen, T. Asamov, and Y. Chang. Efficient ordered combinatorial semi-bandits for whole-page recommendation. In Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence, AAAI’17, pages 2746–2753. AAAI Press, 2017.
  • [21] Z. Wen, B. Kveton, and A. Ashkan. Efficient learning in large-scale combinatorial semi-bandits. In Proceedings of the 32Nd International Conference on International Conference on Machine Learning - Volume 37, ICML’15, pages 1113–1122. JMLR.org, 2015.
  • [22] W. Zhang, S. Yuan, J. Wang, and X. Shen. Real-time bidding benchmarking with iPinYou dataset. arXiv preprint arXiv:1407.7073, 2014.