1 Introduction
This paper addresses a variant of the stochastic multiarmed bandit problem, where given
arms associated with random variables
, and some fixed , the goal is to identify the subset that maximizes the objective . We refer to this problem as “BestofK” bandits to reflect the reward structure and the limited information setting where, at each round, a player queries a set of size at most , and only receives information about arms : e.g. the vector of values of all arms in , (semibandit), the index of a maximizer (marked bandit), or just the maximum reward over all arms (bandit). The game and its valid forms of feedback are formally defined in Figure 1.While approximating the BestofK problem and its generalizations have been given considerable attention from a computational angle, in the regret setting (Yue and Guestrin, 2011; Hofmann et al., 2011; Raman et al., 2012; Radlinski et al., 2008; Yue and Guestrin, 2011; Streeter and Golovin, 2009), this work aims at characterizing its intrinsic statistical difficulty as an identification problem. Not only do identification algorithms typically imply low regret algorithms by first exploring and then exploiting, every result in this paper can be easily extended to the PAC learning setting where we aim to find a set whose reward is within of the optimal, a pureexploration setting of interest for science applications (Kaufmann et al., 2015; Kaufmann and Kalyanakrishnan, 2013; Hao et al., 2013).
For joint reward distributions with highorder correlations, we present distributiondependent lower bounds which force a learner to consider all subsets in each feedback model of interest, and match naive extensions of known upper bounds in the bandit setting obtained by treating each subset as a separate arm. Nevertheless, we present evidence that exhaustive search may be avoided for certain, favorable distributions because the influence of highorder order correlations may be dominated by lower order statistics. Finally, we present an algorithm and analysis for independent arms, which mitigates the surprising nontrivial information occlusion that occurs in the bandit and marked bandit feedback models. This may inform strategies for more general dependent measures, and we complement these result with independentarm lower bounds.
1.1 Motivation
In the setting where , one can interpret the objective as trying to find the set of items which affords the greatest coverage. For example, instead of using spread spectrum antibiotics which have come under fire for leading to drugresistant “super bugs” (Huycke et al., 1998), consider the doctor that desires to identify the best subset of narrow spectrum antibiotics that leads to as many favorable outcomes as possible. Here each draw from represents the th treatment working on a random patient, and for antibiotics, we may assume that there are no synergistic effects between different drugs in the treatment. Thus, the antibiotics example falls under the bandit feedback setting since treatments are selected but it is only observed if at least one tuple of treatment led to a favorable outcome: no information is observed about any particular treatment.
Now consider content recommendation tasks where items are suggested and the user clicks on either 1 or none. Here each draw from represents a user’s potential interest in the th item, which we assume is independent of the other items which are shown with it. Nevertheless, due to the variety and complexity of users’ preferences, the ’s have a highly dependent joint distribution, and we only get to observe markedbandit feedback, namely one item which the user has clicked on. Our final example comes from virology where multiple experiments are prepared and performed at a time, resulting in simultaneous, noisy responses (Hao et al., 2013); this motivates our consideration of the semibandit feedback setting.
1.2 Problem Description
We denote . For a finite set , we let denote its power set, denote the set of all subsets of of size , and write to denote that is drawn uniformly from . If is a length vector (binary, real or otherwise) and , we let denote the subvector indexed by entries .
In what follows, let
be a random vector drawn from the probability distribution
over . We refer to the index as the th arm, and let denote the marginal distribution of its corresponding entry in , e.g. . We define , and for a given , we we call the expected reward of , and refer casually to the random instantiations as simply the reward of .At each time , nature draws a rewards vector where is i.i.d from . Simultaneously, our algorithm queries a subset of of arms, and we refer to the entries as the arms pulled by the query. As we will describe later, this problem has previously been studied in a regret framework, where a time horizon is fixed and an algorithm’s objective is to minimize its regret
(1) 
In this work, we are more concerned with the problem of identifying the best subset of arms. More precisely, for a given measure , denote the optimal subset
(2) 
and let denote the (possibly random) number of times a particular subset has been played before our algorithm terminates. The identification problem is then
Definition 1 (BestofK Subset Identification).
For any measure and fixed
, return an estimate
such that , and which minimizes the sum either in expectation, or with high probability.1.3 Related Work
Variants of BestofK have been studied extensively in the context of online recommendation and ad placement (Yue and Guestrin, 2011; Hofmann et al., 2011; Raman et al., 2012). For example, Radlinski et al. (2008) introduces “Ranked Bandits” where the arms are stochastic random variables, which take a value if the th user finds item relevant, and otherwise. The goal is to recommend an ordered list of items which maximizes the probability of a click on any item in the list, i.e. , and observes the first item (if any) that the user clicked on. Streeter and Golovin (2009) generalizes to online maximization of a sequence of monotone, submodular function subject to knapsack constraints , under a variety of feedback models. Since the function is submodular, identifying corresponds to special case of optimizing the monotone, submodular function subject to these same constraints.
Streeter and Golovin (2009), Yue and Guestrin (2011), and Radlinski et al. (2008) propose online variants of a wellknown greedy offline submodular optimization algorithm (see, for example Iyer and Bilmes (2013)) , which attain approximate regret guarantees of the form
(3) 
where is some regret term that decays as . Computationally, this is the best one could hope: BestofK and Ranked Bandits are online variants of the MaxKCoverage problem, which cannot be approximated to within a factor of for any fixed under standard hardness assumptions (Vazirani, 2013). For completeness, we provide a formal reduction from BestofK identification to MaxKCoverage in Appendix A.
1.4 Our Contributions
Focusing on the stochastic pureexploration setting with binary rewards, our contributions are as follows:

We propose a family of joint distributions such that any algorithm that solves the best of identification problem with high probability must essentially query all combinations of arms. Our lower bounds for the bandit case are nearly matched by trivial identification and regret algorithms that treat each subset as an independent arm. For semibandit feedback, our lower bounds are exponentially higher in than those for bandit feedback (though still requiring exhaustive search). To better understand this gap, we sketch an upper bound that achieves the lower bound for a particular instance of our construction. While in the general binary case, the difficulty of marked bandit feedback is sandwiched between bandit and semibandit feedback, in our particular construction we show that marked bandit feedback has no benefit over bandit feedback. In particular, for worstcase instances, our lower bounds for marked bandits are matched by upper bounds based on algorithms which only take advantage of bandit feedback.

Our construction plants a wise dependent set among kwise independent sets, creating a needleinahaystack scenario. One weakness of this construction is that the gap between the rewards of the best and second best subset are exponentially small in . This is particular to our construction, but not to our analysis: We present a partial converse which establishes that, for any two wise independent distributions defined over with identical marginal means , the difference in expected reward is exponentially small in ^{1}^{1}1Note that our construction requires all subset of of to be independent. This begs the question: can low order correlation statistics allows us to neglect higher order dependencies? And can this property be exploited to avoid combinatorially large sample complexity in favorable scenarios with moderate gaps?

We lay the groundwork for algorithms for identification under favorable, though still dependent, measures by designing a computationally efficient algorithm for independent measures for the marked, semibandit, and bandit feedback models. Though independent semibandits is straightforward (Jun et al., 2016), special care needs to be taken in order to address the information occlusion that occurs in the bandit and markedbandit models, even in this simplified setting. We provide nearly matching lower bounds, and conclude that even for independent measures, bandit feedback may require exponentially (in ) more samples than in the semibandit setting.
2 Lower Bound for Dependent Arms
Intuitively, the bestof problem is hard for the dependent case because the high reward subsets may appear as a collection of individually lowpay off arms if not sampled together. For instance, for , if , , and for all , then clearly is the best subset because and for all . However, identifying set appears difficult as presumably one would have to consider all sets since if and are not queried together, they appear as .
Our lower bound generalizes this construction by introducing a measure such that (1) the arms in the optimal set are dependent but (2) the arms in every other nonoptimal subset of arms are mutually independent. This construction amounts to hiding a “needleinahaystack” among all other subsets, requiring any possibly identification to examine most elements of .
We now state our theorem, which characterizes the difficulty of recovering arms in terms of the gap between the expected reward of and of the second best subset
(4) 
Theorem 2.1 (Dependent).
Fix such that . For any and there exists a distribution with such that any algorithm that identifies with probability at least requires, in expectation, at least
observations. In particular, for any there exists a distribution with that requires just (marked)bandit observations. And for any there exists a distribution with that requires just semibandit observations.
Remark 2.1.
Markedbandit feedback provides strictly less information than semibandit feedback but at least as much as bandit feedback. The above lower bound for markedbandit feedback and the nearly matching upper bound for bandit feedback remarked on below suggests that markedbandit feedback may provide no more information than bandit feedback. However, the lower bound holds for just a particular construction and in Section 3 we show that there exist instances in which markedbandit feedback provides substantially more information than merely bandit feedback.
In the construction of the lower bound, and all other subsets behave like completely independent arms. Each individual arm has mean , i.e. for all , so each has a bandit reward of . The scaling
in the number of bandit and markedbandit observations corresponds to the variance of this reward and captures the property that the number of times a set needs to be sampled to accurately predict its reward is proportional to its variance. Since
, we note that the term is typically very close to , unless is nearly and is nearly .While the lower bound construction makes it necessary to consider each subset individually for all forms of feedback feedback, semibandit feedback presumably allows one to detect dependencies much faster than bandit or markedbandit feedback, resulting in an exponentially smaller bound in . Indeed, Remark E.2 describes an algorithm that uses the parity of the observed rewards that nearly achieves the lower bound for semibandits for the constructed instance when . However, the authors are unaware of more general matching upper bounds for the semibandit setting and consider this a possible future avenue of research.
2.1 Comparison with Known Upper Bounds
By treating each set as an independent arm, standard bestarm identification algorithms can be applied to identify . The KLbased LUCB algorithm from Kaufmann and Kalyanakrishnan (2013) requires samples, matching our bandit lower bound up to a a multiplicative factor of (which is typically dwarfed by ). The lil’UCB algorithm of Jamieson et al. (2014) avoids paying this multiplicative factor, but at the cost of not adapting to the variance term . Perhaps a KL or varianceadaptive extension of lilUCB could attain the best of both worlds.
From a regret perspective, the exact construction as used in the proof of Theorem 2.1 can be used in Theorem 17 of Kaufmann et al. (2015) to state a lower bound on the regret after bandit observations. Specifically, if an algorithm obtain a stochastic regret for all , then for all , we have where is given in Theorem 2.1. Alternatively, in an adversarial setting, the above construction with also implies a lower bound of for any algorithm over a time budget . Both of these regret bounds are matched by upper bounds found in Bubeck and CesaBianchi (2012).
2.2 Do Complicated Dependencies Require Small Gaps?
While Theorem 2.1 proves the existence a family of instances in which samples are necessary to identify the best subset, the possible gaps are restricted to be no larger than . It is natural to wonder if this is an artifact of our analysis, a fundamental limitation of wise independent sets, or a property of dependent sets that we can potentially exploit in algorithms. The following theorem suggests, but does not go as far as to prove, that if there are very highorder dependencies, then these dependencies cannot produce gaps substantially larger than the range described by Theorem 2.1. More precisely, the next theorem characterizes the maximum gap for wise independent instances.
Theorem 2.1.
Let be a random variable supported on with wise independent marginal distributions, such that for all . Then there is a onetoone correspondence between joint distributions over and probability assignments . When , all such assignments lie in the range
(5) 
Here,
is the largest odd integer
, and the largest even integer . Moreover, when , all such assignments lie in the range(6) 
Noting that , Theorem 2.1 implies that the difference between the largest possible and smallest possible expected rewards for a set of arms where each arm has mean and the distribution is wise independent is no greater than , a gap of the same order of the gaps used in our lower bounds above. This implies that, in the absence of low order correlations, very high order correlations can only have a limited effect on the expected rewards of sets.
If it were possible to make more precise statements about the degree to which high order dependencies can influence the reward of a subset, strategies could exploit this diminishing returns property to more efficiently search for subsets while also maintaining largetime horizon optimality. In particular, one could use such bounds to rule out sets that need to be considered based just on their performance using lower order dependency statistics. To be clear, such algorithms would not contradict our lower bounds, but they may perform much better than trivial approaches in favorable conditions.
3 Best of K with Independent Arms
While the dependent case is of considerable practical interest, the remainder of this paper investigates the bestof problem where is assumed to be a product distribution of
independent Bernoulli distributions. We show that even in this presumably much simpler setting, there remain highly nontrivial algorithm design challenges related to the information occlusion that occurs in the bandit and markedbandit feedback settings. We present an algorithm and analysis which tries to mitigate information occlusion which we hope can inform strategies for favorable instances of dependent measures.
Under the independent Bernoulli assumption, each arm is associated with a mean and the expected reward of playing any set is equal to and hence best subset of arms is precisely the set of arms with the greatest means .
3.1 Results
Without loss of generality, suppose the means are ordered . Assuming ensures that the set of top means is unique, though our results could be easily extended to a PAC Learning setting with little effort. Define the gaps and variances via
and  (7) 
For , introduce the transformation
(8) 
where hides logarithmic factors of its argument. We present guarantees for the Stagewise Elimination of Algorithm 3 in our three feedback models of interest; the broad brush strokes of our analysis are addressed in Appendix B, and the details are fleshed in the Appendices C and B.2. Our first result is holds for semibandits, which slightly improves upon the best known result for the batch setting (Jun et al., 2016) by adapting to unknown variances:
Theorem 3.1 (Semi Bandit).
With probability , Algorithm 3 with semibandit feedback returns the arms with the top means using no more than
(9) 
queries where
(10) 
and is a permutation so that .
The above result also holds in the more general setting where the rewards have arbitrary distributions bounded in almost surely (where is just the variance of arm .)
In the markedbandit and bandit settings, our upper bounds incur a dependence on informationsharing terms (marked) and (bandit) which capture the extent to which the operator occludes information about the rewards of arms in each query.
Theorem 3.2 (Marked Bandit).
Suppose we require each query to pull exactly arms. Then Algorithm 3 with marked bandit feedback returns the arms with the top means with probability at least using no more than
(11) 
queries. Here, is given by
(12) 
is a permutation so that , and is an “information sharing term” given by
(13) 
If we can pull fewer than arms per round, then we can achieve
(14) 
We remark that as long as the means are at no more than , , and thus the two differ by a constant factor when the means are not too close to (this difference comes from loosing term in a Bernoulli variance in the marked case). Furthermore, note that . Hence, when we are allowed to pull fewer than arms per round, Stagewise Elimination with markedbandit feedback does no worse than a standard LUCB algorithms for stochastic best arm identification.
When the means are on the order of , then , and thus Stagewise Eliminations gives the same guarantees for marked bandits as for semi bandits. The reason is that, when the means are , we can expect each query to have only a constant number of arms for which , and so not much information is being lost by observing only one of them.
Finally, we note that our guarantees depend crucially on the fact that the marking is uniform. We conjecture that adversarial marking is as challenging as the bandit setting, whose guarantees are as follows:
Theorem 3.3 (Bandit).
Suppose we require each query to pull exactly arms, , and . Then Algorithm 3 with bandit feedback returns the arms with the top means with probability at least using no more than
(15) 
queries where is an “information sharing term”,
and is a permutation so that .
The condition that ensures identifiability (see Remark B.11). The condition is an artifact of using a Balancing Set defined in Algorithm 4; without , our algorithm succeeds for all , albeit with slightly looser guarantees (see Remark B.9).
Remark 3.1.
Suppose the means are greater than where and is a constant; for example, think . Then . Hence, Successive Elimination requires on the order of more queries to identify the top arms than the classic stochastic MAB setting where you get to pull arm at a time, despite the seeming advantage that the bandit setting lets you pull arms per query. When , then is at least polynomially large in , and when , is exponentially large in (e.g, ).
On the other hand, when the means are all on the order of for , then , but the term is at least . For this case, our sample complexity looks like
(16) 
which matches, but does not outperform, the standard armperquery MAB guarantees, with variance adaptation (e.g., Theorem 3.1 with , note that captures the variance). Hence, when the means are all roughly on the same order, it’s never worse to pull arm at a time and observe its reward, than to pull and observe their max. Once the means vary wildly, however, this is certainly not true; we direct the reader to Remark B.12 for further discussion.
3.2 Algorithm
At each stage , our algorithm maintains an accept set of arms which we are are confident lie in the top , a reject set of arms which we are confident lie in the bottom , and an undecided set containing arms for which we have not yet rendered a decision. The main obstacle is to obtain estimates of the relative performance of , since the bandit and marked bandit observation models occlude isolated information about any one given arm in a pull. The key observation is that, if we sample , then for , the following differences have the same sign as (stated formally in Lemma B.2):
(17)  
This motivates a sampling strategy where we partition uniformly at random into subsets of size , and query each , . We record all arms for which in the semi/markedbandit settings (Algorithm 1, Line 1), and, in the bandit setting, mark down all arms in if we observe  i.e, we observe a reward of 1 (Algorithm 1, Line 1). This recording procedure is summarized in Algorithm 1:
Note that PlayAndRecord plays a the union of and , but only records entries of whose indices lie in . UniformPlay (Algorithm 2) outlines our sampling strategy. Each call to UniformPlay returns a vector , supported on entries , for which
(18) 
where and is empty unless or we are allowed to pull fewer than arms per query in which case elements of are drawn from as outlined in Algorithm 2, Line 2 otherwise.
There are a couple nuances worth mentioning. When , we cannot sample arms from the undecided set ; hence UniformPlay pulls only from per query. If we are forced to pull exactly arms per query, UniformPlay adds in a “TopOff” set of an additional arms, from and (Lines 22). Furthermore, observe that lines 22 in UniformPlay carefully handle divisibility issues so as to not “double mark” entries , thus ensuring the correctness of Equation 18. Finally, note that each call to UniformPlay makes exactly queries.
We deploy the passive sampling in UniformPlay in a stagewise successive elimination procedure formalized in Algorithm 3. At each round , use a doubling sample size to , and set the parameter for UniformPlay to be (line 3). Next, we construct the sets from which UniformPlay samples: in the marked and semibandit setting, these are just (Line 3), while in the bandit setting, they are obtained by from Algorithm 4 which transfers a couple low mean arms from into (Line 3). This procedure ameliorates the effect of information occlusion for the bandit case.
Line 3 through 3 average together independent, and identically distributed samples from
to produce unbiased estimates
of the quantity defined in Equation 18. are Binomial, so we apply an empirical Bernstein’s inequality from Maurer and Pontil (2009) to build tight confidence intervalswhere  (19) 
Note that coincide with the canonical definition of sample variance. The variancedependence of our confidence intervals is crucial; see Remarks B.7 and B.8 for more details. For any let
(20) 
As mentioned above, Lemma B.2 ensures if and only if . Thus, accepting an arm for is in the top .
4 Lower bound for Independent Arms
In the bandit and markedbandit settings, the upper bounds of the previous section depended on “information sharing” terms that quantified the degree to which other arms occlude the performance of a particular arm in a played set. Indeed, great care was taken in the design of the algorithm to minimize impact of this information sharing. The next theorem shows that the upper bounds of the previous section for bandit and semibandit feedback are nearly tight up to a similarly defined information sharing term.
Theorem 4.1 (Independent).
Fix . Let be a product distribution where each is an independent Bernoulli with mean . Assume (the ordering is unknown to any algorithm). At each time the algorithm queries a set and observes . Then any algorithm that identifies the top arms with probability at least requires, in expectation, at least
observations where
where .
Our lower bounds apply to our upper bounds when . In the bandit setting, considering reveals a tradeoff between the information sharing term, which decreases with larger , with the benefit of a factor gained from querying arms at once. One can construct different instances that are optimized by the entire range of . Future research may consider varying the subset size in an adaptive setting to optimize this trade off.
The information sharing terms defined in the upper and lower bounds correspond to the most pessimistic and optimistic scenarios, respectively, and result from applying coarse bounds in exchange for simpler proofs. Thus, our algorithm may fare considerably better in practice than is predicted by the upper bounds. Moreover, when is dominated by our upper and lower bounds differ by constant factors.
Finally, we note that our upper and lower bounds for independent measures are tailored to Bernoulli payoffs, where the best subset corresponds to the top means. However, for general product distributions on , this is no longer true (see Remark B.1). This leaves open the question: how difficult is BestofK for general, independent bounded product measures? And, in the marked feedback setting (where one receives an index of the best element in the query), is this problem even wellposed?
Acknowledgements
We thank Elad Hazan for illuminating discussions regarding the computational complexity of the BestofK problem, and for pointing us to resources adressing online submodularity and approximate regret. Max Simchowitz is supported by an NSF GRFP award. Ben Recht and Kevin Jamieson are generously supported by ONR awards , N000141512620, and N000141310129. BR is additionally generously supported by ONR award N000141410024 and NSF awards CCF1148243 and CCF1217058. This research is supported in part by gifts from Amazon Web Services, Google, IBM, SAP, The Thomas and Stacey Siebel Foundation, Adatao, Adobe, Apple Inc., Blue Goji, Bosch, Cisco, Cray, Cloudera, Ericsson, Facebook, Fujitsu, Guavus, HP, Huawei, Intel, Microsoft, Pivotal, Samsung, Schlumberger, Splunk, State Farm, Virdata and VMware.
References
 Bubeck and CesaBianchi [2012] Sébastien Bubeck and Nicolo CesaBianchi. Regret analysis of stochastic and nonstochastic multiarmed bandit problems. Machine Learning, 5(1):1–122, 2012.
 Hao et al. [2013] Linhui Hao, Qiuling He, Zhishi Wang, Mark Craven, Michael A Newton, and Paul Ahlquist. Limited agreement of independent rnai screens for virusrequired host genes owes more to falsenegative than falsepositive factors. PLoS Comput Biol, 9(9):e1003235, 2013.
 Hofmann et al. [2011] Katja Hofmann, Shimon Whiteson, and Maarten de Rijke. A probabilistic method for inferring preferences from clicks. In Proceedings of the 20th ACM International Conference on Information and Knowledge Management, CIKM ’11, pages 249–258, New York, NY, USA, 2011. ACM. ISBN 9781450307178. doi: 10.1145/2063576.2063618. URL http://doi.acm.org/10.1145/2063576.2063618.
 Huycke et al. [1998] Mark M Huycke, Daniel F Sahm, and Michael S Gilmore. Multipledrug resistant enterococci: the nature of the problem and an agenda for the future. Emerging infectious diseases, 4(2):239, 1998.
 Iyer and Bilmes [2013] Rishabh K Iyer and Jeff A Bilmes. Submodular optimization with submodular cover and submodular knapsack constraints. In Advances in Neural Information Processing Systems, pages 2436–2444, 2013.
 Jamieson et al. [2014] Kevin Jamieson, Matthew Malloy, Robert Nowak, and Sébastien Bubeck. lil’ucb: An optimal exploration algorithm for multiarmed bandits. In Proceedings of The 27th Conference on Learning Theory, pages 423–439, 2014.

Jun et al. [2016]
KwangSung Jun, Kevin Jamieson, Rob Nowak, and Xiaojin Zhu.
Top arm identification in multiarmed bandits with batch arm pulls.
In
The 19th International Conference on Artificial Intelligence and Statistics (AISTATS)
, 2016.  Kaufmann and Kalyanakrishnan [2013] Emilie Kaufmann and Shivaram Kalyanakrishnan. Information complexity in bandit subset selection. In Conference on Learning Theory, pages 228–251, 2013.
 Kaufmann et al. [2015] Emilie Kaufmann, Olivier Cappé, and Aurélien Garivier. On the complexity of best arm identification in multiarmed bandit models. The Journal of Machine Learning Research, 2015.
 Maurer and Pontil [2009] Andreas Maurer and Massimiliano Pontil. Empirical bernstein bounds and sample variance penalization. arXiv preprint arXiv:0907.3740, 2009.
 Radlinski et al. [2008] Filip Radlinski, Robert Kleinberg, and Thorsten Joachims. Learning diverse rankings with multiarmed bandits. In Proceedings of the 25th international conference on Machine learning, pages 784–791. ACM, 2008.
 Raman et al. [2012] Karthik Raman, Pannaga Shivaswamy, and Thorsten Joachims. Online learning to diversify from implicit feedback. In Proceedings of the 18th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’12, pages 705–713, New York, NY, USA, 2012. ACM. ISBN 9781450314626. doi: 10.1145/2339530.2339642. URL http://doi.acm.org/10.1145/2339530.2339642.
 Streeter and Golovin [2009] Matthew Streeter and Daniel Golovin. An online algorithm for maximizing submodular functions. In Advances in Neural Information Processing Systems, pages 1577–1584, 2009.
 Vazirani [2013] Vijay V Vazirani. Approximation algorithms. Springer Science & Business Media, 2013.
 Yue and Guestrin [2011] Yisong Yue and Carlos Guestrin. Linear submodular bandits and their application to diversified retrieval. In Advances in Neural Information Processing Systems, pages 2483–2491, 2011.
Appendix A Reduction from MaxKCoverage to BestofK
As in the main text, let be a binary reward vector, let be set of all optimal subsets of (we allow for nonuniqueness), and define the gap as the minimum gap between the rewards of an optimal and suboptimal set. We say is optimal for if , where . We formally introduce the classical MaxKCoverage problem:
Definition 2 (MaxKCoverage).
A MaxKCoverage instance is a tuple , where is a collection of subsets . We say is a solution to MaxKCoverage if and maximizes . Given , we say is an approximation if .
It is well known that MaxKCoverage in NPHard, and cannot be approximated to within under standard hardness assumptions [14]. The following theorem gives a reduction from Best of K Indentification (under any feedback model) to MaxKCoverage:
Theorem A.1.
Fix , and let be an algorithm which indentifies an optimal subset of arms probability in time polynomial in , , and , with probability at least (under any feedback model). Then there is a polynomial time approximation algorithm for MaxKCoveragewhich succeeds with probability at least . When , this implies a polynomial time algorithm for exact .
Proof.
Consider an instance of