Log In Sign Up

Best-of-K Bandits

This paper studies the Best-of-K Bandit game: At each time the player chooses a subset S among all N-choose-K possible options and observes reward max(X(i) : i in S) where X is a random vector drawn from a joint distribution. The objective is to identify the subset that achieves the highest expected reward with high probability using as few queries as possible. We present distribution-dependent lower bounds based on a particular construction which force a learner to consider all N-choose-K subsets, and match naive extensions of known upper bounds in the bandit setting obtained by treating each subset as a separate arm. Nevertheless, we present evidence that exhaustive search may be avoided for certain, favorable distributions because the influence of high-order order correlations may be dominated by lower order statistics. Finally, we present an algorithm and analysis for independent arms, which mitigates the surprising non-trivial information occlusion that occurs due to only observing the max in the subset. This may inform strategies for more general dependent measures, and we complement these result with independent-arm lower bounds.


page 1

page 2

page 3

page 4


Max-Quantile Grouped Infinite-Arm Bandits

In this paper, we consider a bandit problem in which there are a number ...

Corralling Stochastic Bandit Algorithms

We study the problem of corralling stochastic bandit algorithms, that is...

Stochastic Online Learning with Probabilistic Graph Feedback

We consider a problem of stochastic online learning with general probabi...

On Best-Arm Identification with a Fixed Budget in Non-Parametric Multi-Armed Bandits

We lay the foundations of a non-parametric theory of best-arm identifica...

Problem Dependent View on Structured Thresholding Bandit Problems

We investigate the problem dependent regime in the stochastic Thresholdi...

Max-Min Grouped Bandits

In this paper, we introduce a multi-armed bandit problem termed max-min ...

1 Introduction

This paper addresses a variant of the stochastic multi-armed bandit problem, where given

arms associated with random variables

, and some fixed , the goal is to identify the subset that maximizes the objective . We refer to this problem as “Best-of-K” bandits to reflect the reward structure and the limited information setting where, at each round, a player queries a set of size at most , and only receives information about arms : e.g. the vector of values of all arms in , (semi-bandit), the index of a maximizer (marked bandit), or just the maximum reward over all arms (bandit). The game and its valid forms of feedback are formally defined in Figure 1.

While approximating the Best-of-K problem and its generalizations have been given considerable attention from a computational angle, in the regret setting (Yue and Guestrin, 2011; Hofmann et al., 2011; Raman et al., 2012; Radlinski et al., 2008; Yue and Guestrin, 2011; Streeter and Golovin, 2009), this work aims at characterizing its intrinsic statistical difficulty as an identification problem. Not only do identification algorithms typically imply low regret algorithms by first exploring and then exploiting, every result in this paper can be easily extended to the PAC learning setting where we aim to find a set whose reward is within of the optimal, a pure-exploration setting of interest for science applications (Kaufmann et al., 2015; Kaufmann and Kalyanakrishnan, 2013; Hao et al., 2013).

For joint reward distributions with high-order correlations, we present distribution-dependent lower bounds which force a learner to consider all subsets in each feedback model of interest, and match naive extensions of known upper bounds in the bandit setting obtained by treating each subset as a separate arm. Nevertheless, we present evidence that exhaustive search may be avoided for certain, favorable distributions because the influence of high-order order correlations may be dominated by lower order statistics. Finally, we present an algorithm and analysis for independent arms, which mitigates the surprising non-trivial information occlusion that occurs in the bandit and marked bandit feedback models. This may inform strategies for more general dependent measures, and we complement these result with independent-arm lower bounds.

1.1 Motivation

In the setting where , one can interpret the objective as trying to find the set of items which affords the greatest coverage. For example, instead of using spread spectrum antibiotics which have come under fire for leading to drug-resistant “super bugs” (Huycke et al., 1998), consider the doctor that desires to identify the best subset of narrow spectrum antibiotics that leads to as many favorable outcomes as possible. Here each draw from represents the th treatment working on a random patient, and for antibiotics, we may assume that there are no synergistic effects between different drugs in the treatment. Thus, the antibiotics example falls under the bandit feedback setting since treatments are selected but it is only observed if at least one -tuple of treatment led to a favorable outcome: no information is observed about any particular treatment.

Now consider content recommendation tasks where items are suggested and the user clicks on either 1 or none. Here each draw from represents a user’s potential interest in the -th item, which we assume is independent of the other items which are shown with it. Nevertheless, due to the variety and complexity of users’ preferences, the ’s have a highly dependent joint distribution, and we only get to observe marked-bandit feedback, namely one item which the user has clicked on. Our final example comes from virology where multiple experiments are prepared and performed at a time, resulting in simultaneous, noisy responses (Hao et al., 2013); this motivates our consideration of the semi-bandit feedback setting.

Best-of- Bandits Game for Player picks and adversary simultaneously picks Player observes

Figure 1: Best-of- Bandits game for the different types of feedback considered. While this work is primarily interested in stochastic adversaries, our lower bound construction also has consequences for non-stochastic adversaries. Moreover, in marked feedback, we might consider non-uniform and even adversarial marking.

1.2 Problem Description

We denote . For a finite set , we let denote its power set, denote the set of all subsets of of size , and write to denote that is drawn uniformly from . If is a length vector (binary, real or otherwise) and , we let denote the sub-vector indexed by entries .

In what follows, let

be a random vector drawn from the probability distribution

over . We refer to the index as the -th arm, and let denote the marginal distribution of its corresponding entry in , e.g. . We define , and for a given , we we call the expected reward of , and refer casually to the random instantiations as simply the reward of .

At each time , nature draws a rewards vector where is i.i.d from . Simultaneously, our algorithm queries a subset of of arms, and we refer to the entries as the arms pulled by the query. As we will describe later, this problem has previously been studied in a regret framework, where a time horizon is fixed and an algorithm’s objective is to minimize its regret


In this work, we are more concerned with the problem of identifying the best subset of arms. More precisely, for a given measure , denote the optimal subset


and let denote the (possibly random) number of times a particular subset has been played before our algorithm terminates. The identification problem is then

Definition 1 (Best-of-K Subset Identification).

For any measure and fixed

, return an estimate

such that , and which minimizes the sum either in expectation, or with high probability.

Again, we remind the reader that an algorithm for Best-of-K Subset Identification can be extended to active PAC learning algorithm, and to an online learning algorithm with low regret (with high probability) (Kaufmann et al., 2015; Kaufmann and Kalyanakrishnan, 2013; Hao et al., 2013).

1.3 Related Work

Variants of Best-of-K have been studied extensively in the context of online recommendation and ad placement (Yue and Guestrin, 2011; Hofmann et al., 2011; Raman et al., 2012). For example, Radlinski et al. (2008) introduces “Ranked Bandits” where the arms are stochastic random variables, which take a value if the -th user finds item relevant, and otherwise. The goal is to recommend an ordered list of items which maximizes the probability of a click on any item in the list, i.e. , and observes the first item (if any) that the user clicked on. Streeter and Golovin (2009) generalizes to online maximization of a sequence of monotone, submodular function subject to knap-sack constraints , under a variety of feedback models. Since the function is submodular, identifying corresponds to special case of optimizing the monotone, submodular function subject to these same constraints.

Streeter and Golovin (2009), Yue and Guestrin (2011), and Radlinski et al. (2008) propose online variants of a well-known greedy offline submodular optimization algorithm (see, for example Iyer and Bilmes (2013)) , which attain approximate regret guarantees of the form


where is some regret term that decays as . Computationally, this is the best one could hope: Best-of-K and Ranked Bandits are online variants of the Max-K-Coverage problem, which cannot be approximated to within a factor of for any fixed under standard hardness assumptions (Vazirani, 2013). For completeness, we provide a formal reduction from Best-of-K identification to Max-K-Coverage in Appendix A.

1.4 Our Contributions

Focusing on the stochastic pure-exploration setting with binary rewards, our contributions are as follows:

  • We propose a family of joint distributions such that any algorithm that solves the best of identification problem with high probability must essentially query all combinations of arms. Our lower bounds for the bandit case are nearly matched by trivial identification and regret algorithms that treat each -subset as an independent arm. For semi-bandit feedback, our lower bounds are exponentially higher in than those for bandit feedback (though still requiring exhaustive search). To better understand this gap, we sketch an upper bound that achieves the lower bound for a particular instance of our construction. While in the general binary case, the difficulty of marked bandit feedback is sandwiched between bandit and semi-bandit feedback, in our particular construction we show that marked bandit feedback has no benefit over bandit feedback. In particular, for worst-case instances, our lower bounds for marked bandits are matched by upper bounds based on algorithms which only take advantage of bandit feedback.

  • Our construction plants a -wise dependent set among k-wise independent sets, creating a needle-in-a-haystack scenario. One weakness of this construction is that the gap between the rewards of the best and second best subset are exponentially small in . This is particular to our construction, but not to our analysis: We present a partial converse which establishes that, for any two -wise independent distributions defined over with identical marginal means , the difference in expected reward is exponentially small in 111Note that our construction requires all subset of of to be independent. This begs the question: can low order correlation statistics allows us to neglect higher order dependencies? And can this property be exploited to avoid combinatorially large sample complexity in favorable scenarios with moderate gaps?

  • We lay the groundwork for algorithms for identification under favorable, though still dependent, measures by designing a computationally efficient algorithm for independent measures for the marked, semi-bandit, and bandit feedback models. Though independent semi-bandits is straightforward (Jun et al., 2016), special care needs to be taken in order to address the information occlusion that occurs in the bandit and marked-bandit models, even in this simplified setting. We provide nearly matching lower bounds, and conclude that even for independent measures, bandit feedback may require exponentially (in ) more samples than in the semi-bandit setting.

2 Lower Bound for Dependent Arms

Intuitively, the best-of- problem is hard for the dependent case because the high reward subsets may appear as a collection of individually low-pay off arms if not sampled together. For instance, for , if , , and for all , then clearly is the best subset because and for all . However, identifying set appears difficult as presumably one would have to consider all sets since if and are not queried together, they appear as .

Our lower bound generalizes this construction by introducing a measure such that (1) the arms in the optimal set are dependent but (2) the arms in every other non-optimal subset of arms are mutually independent. This construction amounts to hiding a “needle-in-a-haystack” among all other subsets, requiring any possibly identification to examine most elements of .

We now state our theorem, which characterizes the difficulty of recovering arms in terms of the gap between the expected reward of and of the second best subset

Theorem 2.1 (Dependent).

Fix such that . For any and there exists a distribution with such that any algorithm that identifies with probability at least requires, in expectation, at least

observations. In particular, for any there exists a distribution with that requires just (marked-)bandit observations. And for any there exists a distribution with that requires just semi-bandit observations.

Remark 2.1.

Marked-bandit feedback provides strictly less information than semi-bandit feedback but at least as much as bandit feedback. The above lower bound for marked-bandit feedback and the nearly matching upper bound for bandit feedback remarked on below suggests that marked-bandit feedback may provide no more information than bandit feedback. However, the lower bound holds for just a particular construction and in Section 3 we show that there exist instances in which marked-bandit feedback provides substantially more information than merely bandit feedback.

In the construction of the lower bound, and all other subsets behave like completely independent arms. Each individual arm has mean , i.e. for all , so each has a bandit reward of . The scaling

in the number of bandit and marked-bandit observations corresponds to the variance of this reward and captures the property that the number of times a set needs to be sampled to accurately predict its reward is proportional to its variance. Since

, we note that the term is typically very close to , unless is nearly and is nearly .

While the lower bound construction makes it necessary to consider each subset individually for all forms of feedback feedback, semi-bandit feedback presumably allows one to detect dependencies much faster than bandit or marked-bandit feedback, resulting in an exponentially smaller bound in . Indeed, Remark E.2 describes an algorithm that uses the parity of the observed rewards that nearly achieves the lower bound for semi-bandits for the constructed instance when . However, the authors are unaware of more general matching upper bounds for the semi-bandit setting and consider this a possible future avenue of research.

2.1 Comparison with Known Upper Bounds

By treating each set as an independent arm, standard best-arm identification algorithms can be applied to identify . The KL-based LUCB algorithm from Kaufmann and Kalyanakrishnan (2013) requires samples, matching our bandit lower bound up to a a multiplicative factor of (which is typically dwarfed by ). The lil’UCB algorithm of Jamieson et al. (2014) avoids paying this multiplicative factor, but at the cost of not adapting to the variance term . Perhaps a KL- or variance-adaptive extension of lil-UCB could attain the best of both worlds.

From a regret perspective, the exact construction as used in the proof of Theorem 2.1 can be used in Theorem 17 of Kaufmann et al. (2015) to state a lower bound on the regret after bandit observations. Specifically, if an algorithm obtain a stochastic regret for all , then for all , we have where is given in Theorem 2.1. Alternatively, in an adversarial setting, the above construction with also implies a lower bound of for any algorithm over a time budget . Both of these regret bounds are matched by upper bounds found in Bubeck and Cesa-Bianchi (2012).

2.2 Do Complicated Dependencies Require Small Gaps?

While Theorem 2.1 proves the existence a family of instances in which samples are necessary to identify the best -subset, the possible gaps are restricted to be no larger than . It is natural to wonder if this is an artifact of our analysis, a fundamental limitation of -wise independent sets, or a property of dependent sets that we can potentially exploit in algorithms. The following theorem suggests, but does not go as far as to prove, that if there are very high-order dependencies, then these dependencies cannot produce gaps substantially larger than the range described by Theorem 2.1. More precisely, the next theorem characterizes the maximum gap for -wise independent instances.

Theorem 2.1.

Let be a random variable supported on with -wise independent marginal distributions, such that for all . Then there is a one-to-one correspondence between joint distributions over and probability assignments . When , all such assignments lie in the range



is the largest odd integer

, and the largest even integer . Moreover, when , all such assignments lie in the range


Noting that , Theorem 2.1 implies that the difference between the largest possible and smallest possible expected rewards for a set of arms where each arm has mean and the distribution is -wise independent is no greater than , a gap of the same order of the gaps used in our lower bounds above. This implies that, in the absence of low order correlations, very high order correlations can only have a limited effect on the expected rewards of sets.

If it were possible to make more precise statements about the degree to which high order dependencies can influence the reward of a subset, strategies could exploit this diminishing returns property to more efficiently search for subsets while also maintaining large-time horizon optimality. In particular, one could use such bounds to rule out sets that need to be considered based just on their performance using lower order dependency statistics. To be clear, such algorithms would not contradict our lower bounds, but they may perform much better than trivial approaches in favorable conditions.

3 Best of K with Independent Arms

While the dependent case is of considerable practical interest, the remainder of this paper investigates the best-of- problem where is assumed to be a product distribution of

independent Bernoulli distributions. We show that even in this presumably much simpler setting, there remain highly nontrivial algorithm design challenges related to the information occlusion that occurs in the bandit and marked-bandit feedback settings. We present an algorithm and analysis which tries to mitigate information occlusion which we hope can inform strategies for favorable instances of dependent measures.

Under the independent Bernoulli assumption, each arm is associated with a mean and the expected reward of playing any set is equal to and hence best subset of arms is precisely the set of arms with the greatest means .

3.1 Results

Without loss of generality, suppose the means are ordered . Assuming ensures that the set of top means is unique, though our results could be easily extended to a PAC Learning setting with little effort. Define the gaps and variances via

and (7)

For , introduce the transformation


where hides logarithmic factors of its argument. We present guarantees for the Stagewise Elimination of Algorithm 3 in our three feedback models of interest; the broad brush strokes of our analysis are addressed in Appendix B, and the details are fleshed in the Appendices C and B.2. Our first result is holds for semi-bandits, which slightly improves upon the best known result for the -batch setting (Jun et al., 2016) by adapting to unknown variances:

Theorem 3.1 (Semi Bandit).

With probability , Algorithm 3 with semi-bandit feedback returns the arms with the top means using no more than


queries where


and is a permutation so that .

The above result also holds in the more general setting where the rewards have arbitrary distributions bounded in almost surely (where is just the variance of arm .)

In the marked-bandit and bandit settings, our upper bounds incur a dependence on information-sharing terms (marked) and (bandit) which capture the extent to which the operator occludes information about the rewards of arms in each query.

Theorem 3.2 (Marked Bandit).

Suppose we require each query to pull exactly arms. Then Algorithm 3 with marked bandit feedback returns the arms with the top means with probability at least using no more than


queries. Here, is given by


is a permutation so that , and is an “information sharing term” given by


If we can pull fewer than arms per round, then we can achieve


We remark that as long as the means are at no more than , , and thus the two differ by a constant factor when the means are not too close to (this difference comes from loosing term in a Bernoulli variance in the marked case). Furthermore, note that . Hence, when we are allowed to pull fewer than arms per round, Stagewise Elimination with marked-bandit feedback does no worse than a standard LUCB algorithms for stochastic best arm identification.

When the means are on the order of , then , and thus Stagewise Eliminations gives the same guarantees for marked bandits as for semi bandits. The reason is that, when the means are , we can expect each query to have only a constant number of arms for which , and so not much information is being lost by observing only one of them.

Finally, we note that our guarantees depend crucially on the fact that the marking is uniform. We conjecture that adversarial marking is as challenging as the bandit setting, whose guarantees are as follows:

Theorem 3.3 (Bandit).

Suppose we require each query to pull exactly arms, , and . Then Algorithm 3 with bandit feedback returns the arms with the top means with probability at least using no more than


queries where is an “information sharing term”,

and is a permutation so that .

The condition that ensures identifiability (see Remark B.11). The condition is an artifact of using a Balancing Set defined in Algorithm 4; without , our algorithm succeeds for all , albeit with slightly looser guarantees (see Remark B.9).

Remark 3.1.

Suppose the means are greater than where and is a constant; for example, think . Then . Hence, Successive Elimination requires on the order of more queries to identify the top -arms than the classic stochastic MAB setting where you get to pull -arm at a time, despite the seeming advantage that the bandit setting lets you pull arms per query. When , then is at least polynomially large in , and when , is exponentially large in (e.g, ).

On the other hand, when the means are all on the order of for , then , but the term is at least . For this case, our sample complexity looks like


which matches, but does not out-perform, the standard -arm-per-query MAB guarantees, with variance adaptation (e.g., Theorem 3.1 with , note that captures the variance). Hence, when the means are all roughly on the same order, it’s never worse to pull arm at a time and observe its reward, than to pull and observe their max. Once the means vary wildly, however, this is certainly not true; we direct the reader to Remark B.12 for further discussion.

3.2 Algorithm

At each stage , our algorithm maintains an accept set of arms which we are are confident lie in the top , a reject set of arms which we are confident lie in the bottom , and an undecided set containing arms for which we have not yet rendered a decision. The main obstacle is to obtain estimates of the relative performance of , since the bandit and marked bandit observation models occlude isolated information about any one given arm in a pull. The key observation is that, if we sample , then for , the following differences have the same sign as (stated formally in Lemma B.2):


This motivates a sampling strategy where we partition uniformly at random into subsets of size , and query each , . We record all arms for which in the semi/marked-bandit settings (Algorithm 1, Line 1), and, in the bandit setting, mark down all arms in if we observe - i.e, we observe a reward of 1 (Algorithm 1, Line 1). This recording procedure is summarized in Algorithm 1:

1 Input ,
2 Play
3 Semi/Marked Bandit Setting: for all for which we observe
4 Bandit Bandit Setting: If returns a reward of , for all
5 Return
Algorithm 1 PlayAndRecord

Note that PlayAndRecord plays a the union of and , but only records entries of whose indices lie in . UniformPlay (Algorithm 2) outlines our sampling strategy. Each call to UniformPlay returns a vector , supported on entries , for which


where and is empty unless or we are allowed to pull fewer than arms per query in which case elements of are drawn from as outlined in Algorithm 2, Line 2 otherwise.

There are a couple nuances worth mentioning. When , we cannot sample arms from the undecided set ; hence UniformPlay pulls only from per query. If we are forced to pull exactly arms per query, UniformPlay adds in a “Top-Off” set of an additional arms, from and (Lines 2-2). Furthermore, observe that lines 2-2 in UniformPlay carefully handle divisibility issues so as to not “double mark” entries , thus ensuring the correctness of Equation 18. Finally, note that each call to UniformPlay makes exactly queries.

1 Inputs: , , , sample size
2 Uniformly at random, partition into sets of size and place remainders in // thus , but not indep
3 If Require Arms per Pull and // Construct Top-Off Set
4 //
5 //sample as many items from reject as possible
6 If : // sample remaining items from accept
8 Else // Top-Off set unnecessary
9 ,
10 Initalize rewards vector
11 For
12 // only mark
13 If // if remainder
14 Draw // thus
15 // only mark to avoid duplicate marking
Algorithm 2 UniformPlay

We deploy the passive sampling in UniformPlay in a stagewise successive elimination procedure formalized in Algorithm 3. At each round , use a doubling sample size to , and set the parameter for UniformPlay to be (line 3). Next, we construct the sets from which UniformPlay samples: in the marked and semi-bandit setting, these are just (Line 3), while in the bandit setting, they are obtained by from Algorithm 4 which transfers a couple low mean arms from into (Line 3). This procedure ameliorates the effect of information occlusion for the bandit case.

Line 3 through 3 average together independent, and identically distributed samples from

to produce unbiased estimates

of the quantity defined in Equation 18. are Binomial, so we apply an empirical Bernstein’s inequality from Maurer and Pontil (2009) to build tight confidence intervals

where (19)

Note that coincide with the canonical definition of sample variance. The variance-dependence of our confidence intervals is crucial; see Remarks B.7 and B.8 for more details. For any let


As mentioned above, Lemma B.2 ensures if and only if . Thus, accepting an arm for is in the top .

1 Input , Batch Size
2 While // fewer than k arms accepted
3 Sample Size , Rewards Vector ,
4  // Sampling Sets for UniformPlay, identical to and in marked/semi bandits
5 If Bandit Setting  // Add low mean arms from to
7 For
8  // get fresh samples
9  // normalize
11  // Equation 19
13 If // arms rejected
Algorithm 3 Stagewise Elimination

The Balance Procedure is described in Algorithm 4, and ensures that contains sufficiently many arms that don’t have very high (top ) means. The motivation for the procedure is somewhat subtle, and we defer its discussion to the analysis in Appendix B.3.3, following Remark B.8:

1 Input
2 //Balancing Set
3 , // Transfer from to
Algorithm 4 Balance()

4 Lower bound for Independent Arms

In the bandit and marked-bandit settings, the upper bounds of the previous section depended on “information sharing” terms that quantified the degree to which other arms occlude the performance of a particular arm in a played set. Indeed, great care was taken in the design of the algorithm to minimize impact of this information sharing. The next theorem shows that the upper bounds of the previous section for bandit and semi-bandit feedback are nearly tight up to a similarly defined information sharing term.

Theorem 4.1 (Independent).

Fix . Let be a product distribution where each is an independent Bernoulli with mean . Assume (the ordering is unknown to any algorithm). At each time the algorithm queries a set and observes . Then any algorithm that identifies the top arms with probability at least requires, in expectation, at least

observations where

where .

Our lower bounds apply to our upper bounds when . In the bandit setting, considering reveals a trade-off between the information sharing term, which decreases with larger , with the benefit of a factor gained from querying arms at once. One can construct different instances that are optimized by the entire range of . Future research may consider varying the subset size in an adaptive setting to optimize this trade off.

The information sharing terms defined in the upper and lower bounds correspond to the most pessimistic and optimistic scenarios, respectively, and result from applying coarse bounds in exchange for simpler proofs. Thus, our algorithm may fare considerably better in practice than is predicted by the upper bounds. Moreover, when is dominated by our upper and lower bounds differ by constant factors.

Finally, we note that our upper and lower bounds for independent measures are tailored to Bernoulli payoffs, where the best -subset corresponds to the top means. However, for general product distributions on , this is no longer true (see Remark B.1). This leaves open the question: how difficult is Best-of-K for general, independent bounded product measures? And, in the marked feedback setting (where one receives an index of the best element in the query), is this problem even well-posed?


We thank Elad Hazan for illuminating discussions regarding the computational complexity of the Best-of-K problem, and for pointing us to resources adressing online submodularity and approximate regret. Max Simchowitz is supported by an NSF GRFP award. Ben Recht and Kevin Jamieson are generously supported by ONR awards , N00014-15-1-2620, and N00014-13-1-0129. BR is additionally generously supported by ONR award N00014-14-1-0024 and NSF awards CCF-1148243 and CCF-1217058. This research is supported in part by gifts from Amazon Web Services, Google, IBM, SAP, The Thomas and Stacey Siebel Foundation, Adatao, Adobe, Apple Inc., Blue Goji, Bosch, Cisco, Cray, Cloudera, Ericsson, Facebook, Fujitsu, Guavus, HP, Huawei, Intel, Microsoft, Pivotal, Samsung, Schlumberger, Splunk, State Farm, Virdata and VMware.


  • Bubeck and Cesa-Bianchi [2012] Sébastien Bubeck and Nicolo Cesa-Bianchi. Regret analysis of stochastic and nonstochastic multi-armed bandit problems. Machine Learning, 5(1):1–122, 2012.
  • Hao et al. [2013] Linhui Hao, Qiuling He, Zhishi Wang, Mark Craven, Michael A Newton, and Paul Ahlquist. Limited agreement of independent rnai screens for virus-required host genes owes more to false-negative than false-positive factors. PLoS Comput Biol, 9(9):e1003235, 2013.
  • Hofmann et al. [2011] Katja Hofmann, Shimon Whiteson, and Maarten de Rijke. A probabilistic method for inferring preferences from clicks. In Proceedings of the 20th ACM International Conference on Information and Knowledge Management, CIKM ’11, pages 249–258, New York, NY, USA, 2011. ACM. ISBN 978-1-4503-0717-8. doi: 10.1145/2063576.2063618. URL
  • Huycke et al. [1998] Mark M Huycke, Daniel F Sahm, and Michael S Gilmore. Multiple-drug resistant enterococci: the nature of the problem and an agenda for the future. Emerging infectious diseases, 4(2):239, 1998.
  • Iyer and Bilmes [2013] Rishabh K Iyer and Jeff A Bilmes. Submodular optimization with submodular cover and submodular knapsack constraints. In Advances in Neural Information Processing Systems, pages 2436–2444, 2013.
  • Jamieson et al. [2014] Kevin Jamieson, Matthew Malloy, Robert Nowak, and Sébastien Bubeck. lil’ucb: An optimal exploration algorithm for multi-armed bandits. In Proceedings of The 27th Conference on Learning Theory, pages 423–439, 2014.
  • Jun et al. [2016] Kwang-Sung Jun, Kevin Jamieson, Rob Nowak, and Xiaojin Zhu. Top arm identification in multi-armed bandits with batch arm pulls. In

    The 19th International Conference on Artificial Intelligence and Statistics (AISTATS)

    , 2016.
  • Kaufmann and Kalyanakrishnan [2013] Emilie Kaufmann and Shivaram Kalyanakrishnan. Information complexity in bandit subset selection. In Conference on Learning Theory, pages 228–251, 2013.
  • Kaufmann et al. [2015] Emilie Kaufmann, Olivier Cappé, and Aurélien Garivier. On the complexity of best arm identification in multi-armed bandit models. The Journal of Machine Learning Research, 2015.
  • Maurer and Pontil [2009] Andreas Maurer and Massimiliano Pontil. Empirical bernstein bounds and sample variance penalization. arXiv preprint arXiv:0907.3740, 2009.
  • Radlinski et al. [2008] Filip Radlinski, Robert Kleinberg, and Thorsten Joachims. Learning diverse rankings with multi-armed bandits. In Proceedings of the 25th international conference on Machine learning, pages 784–791. ACM, 2008.
  • Raman et al. [2012] Karthik Raman, Pannaga Shivaswamy, and Thorsten Joachims. Online learning to diversify from implicit feedback. In Proceedings of the 18th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’12, pages 705–713, New York, NY, USA, 2012. ACM. ISBN 978-1-4503-1462-6. doi: 10.1145/2339530.2339642. URL
  • Streeter and Golovin [2009] Matthew Streeter and Daniel Golovin. An online algorithm for maximizing submodular functions. In Advances in Neural Information Processing Systems, pages 1577–1584, 2009.
  • Vazirani [2013] Vijay V Vazirani. Approximation algorithms. Springer Science & Business Media, 2013.
  • Yue and Guestrin [2011] Yisong Yue and Carlos Guestrin. Linear submodular bandits and their application to diversified retrieval. In Advances in Neural Information Processing Systems, pages 2483–2491, 2011.

Appendix A Reduction from Max-K-Coverage to Best-of-K

As in the main text, let be a binary reward vector, let be set of all optimal -subsets of (we allow for non-uniqueness), and define the gap as the minimum gap between the rewards of an optimal and sub-optimal -set. We say is optimal for if , where . We formally introduce the classical Max-K-Coverage problem:

Definition 2 (Max-K-Coverage).

A Max-K-Coverage instance is a tuple , where is a collection of subsets . We say is a solution to Max-K-Coverage if and maximizes . Given , we say is an approximation if .

It is well known that Max-K-Coverage in NP-Hard, and cannot be approximated to within under standard hardness assumptions [14]. The following theorem gives a reduction from Best of K Indentification (under any feedback model) to Max-K-Coverage:

Theorem A.1.

Fix , and let be an algorithm which indentifies an -optimal -subset of arms probability in time polynomial in , , and , with probability at least (under any feedback model). Then there is a polynomial time -approximation algorithm for Max-K-Coveragewhich succeeds with probability at least . When , this implies a polynomial time algorithm for exact .


Consider an instance of