DeepAI

# The Max K-Armed Bandit: A PAC Lower Bound and tighter Algorithms

We consider the Max K-Armed Bandit problem, where a learning agent is faced with several sources (arms) of items (rewards), and interested in finding the best item overall. At each time step the agent chooses an arm, and obtains a random real valued reward. The rewards of each arm are assumed to be i.i.d., with an unknown probability distribution that generally differs among the arms. Under the PAC framework, we provide lower bounds on the sample complexity of any (ϵ,δ)-correct algorithm, and propose algorithms that attain this bound up to logarithmic factors. We compare the performance of this multi-arm algorithms to the variant in which the arms are not distinguishable by the agent and are chosen randomly at each stage. Interestingly, when the maximal rewards of the arms happen to be similar, the latter approach may provide better performance.

• 3 publications
• 5 publications
12/23/2015

### The Max K-Armed Bandit: PAC Lower Bounds and Efficient Algorithms

We consider the Max K-Armed Bandit problem, where a learning agent is fa...
06/20/2020

### An Optimal Elimination Algorithm for Learning a Best Arm

We consider the classic problem of (ϵ,δ)-PAC learning a best arm where t...
01/24/2019

### PAC Identification of Many Good Arms in Stochastic Multi-Armed Bandits

We consider the problem of identifying any k out of the best m arms in a...
02/13/2022

### On the complexity of All ε-Best Arms Identification

We consider the problem introduced by <cit.> of identifying all the ε-op...
10/28/2018

### Exploring k out of Top ρ Fraction of Arms in Stochastic Bandits

This paper studies the problem of identifying any k distinct arms among ...
10/29/2021

### A/B/n Testing with Control in the Presence of Subpopulations

Motivated by A/B/n testing applications, we consider a finite set of dis...
08/19/2022

### Almost Cost-Free Communication in Federated Best Arm Identification

We study the problem of best arm identification in a federated learning ...

## 1 Introduction

In the classic stochastic multi-armed bandit (MAB) problem the learning agent faces a set of stochastic arms, and wishes to maximize its cumulative reward (in the regret formulation), or find the arm with the highest mean reward (the pure exploration problem). This model has been studied extensively in the statistical and learning literature, see for example [1] for a comprehensive survey.

We consider a variant of the MAB problem called the Max -Armed Bandit problem (Max-Bandit for short). In this variant, the objective is to obtain a sample with the highest possible reward (namely, the highest value in the support of the probability distribution of any arm). More precisely, considering the PAC setting, the objective is to return an -correct sample, namely a sample which its reward value is -close to the overall best possible reward with a probability larger than . In addition, we wish to minimize the sample complexity, namely the expected number of samples observed by the learning algorithm before it terminates.

For the classical MAB problem, algorithms that find the best arm (in terms of its expected reward) in the PAC sense were presented in [2, 3, 4], and lower bounds on the sample complexity were presented in [5] and [3]. The essential difference with respect to this work is in the objective, which is to find an -correct sample in our case. The scenario considered in Max-Bandit model is relevant when a single best item needs to be selected from among several (large) clustered sets of items, with each set represented as a single arm. These sets may represent parts that come from different manufacturers or produced by different processes, job candidates that are referred by different employment agencies, finding the best match to certain genetic characteristics in different populations, or choosing the best channel among different frequency bands in a cognitive radio wireless network.

The Max-Bandit problem was apparently first proposed in [6]. For reward distribution functions in a specific family, an algorithm with an upper bound on the sample complexity that increases as was provided in [7]. For the case of discrete rewards, another algorithm was presented in [8], without performance analysis. Later, a similar model in which the objective is to maximize the expected value of the largest sampled reward for a given number of samples () was studied in [9]. In that work the attained best reward is compared with the expected reward obtained by an oracle that samples the best arm time. An algorithm is suggested and shown to secure an upper bound of order on that difference, where is determined by the properties of the distribution functions and decreases as they are further away from a specific functions family.

Our basic assumption in the present paper is that a known lower bound is available on the tail distributions, namely on the probability that the reward of each given arm will be close to its maximum. A special case is when the probability densities near the maximum are larger than a given value, but we consider more general function classes. Under that assumption, we provide an algorithm for which the sample complexity increases as at most . This provides an improvement by a factor of over the result of [7], which was obtained for a more specific model. To compare with the result in [9], we observe that with a choice of in our algorithm, we obtain that the expected shortfall of the largest sample with respect to the maximal reward possible is at most of order (as compared to with ). Furthermore, we provide a lower bound on the sample complexity of every -correct algorithm, which holds when several arms posses maximal rewards that are close to that of the best arm. This lower bound is shown to coincide, up to a logarithmic term, with the upper bound derived for the proposed algorithm.

A basic feature of the Max-Bandit problem (and the associated algorithms) is the goal of quickly focusing on the best arm (in term of maximal reward), and sampling from that arm as much as possible. It should be of interest to compare the obtained results with the alternative approach, which ignores the distinction between arms, and simply draws a sample from a random arm (say, with uniform probabilities) at each round. This can be interpreted as mixing the items associated with each arm before sampling; we accordingly refer to this variant as the unified-arm problem. This problem actually coincides with the so-called infinitely-many armed bandit model studied in [10, 11, 12, 13, 14], for the specific case of deterministic arms studied in [15]. The conclusion about weather to apply the multi-arm approach or the unified-arm approach is inconclusive. However, as a rule of thumb, when the maximal possible rewards of many arms are far from the optimal, the multi-arm approach has better performance.

The paper proceeds as follows. In the next section we present our model. In Section 3 we provide a lower bound on the sample complexity of every -correct algorithm. In Section 4 we present two -correct algorithms, and we provide an upper bound on the sample complexity of one of them. The first algorithm is simple and its bound has the same order as the lower bound up to a logarithmic term in (where stands for the number of arms), the second algorithm is more complicated and we believe that its bound is larger by up to a double logarithmic term in than the lower bound. In Section 5, we consider for comparison the unified-arm case. In Section 6 we close the paper by some concluding remarks. Certain proofs are differed to the Appendix due to space limitations.

## 2 Model Definition

We consider a finite set of arms, denoted by . At each stage the learning agent chooses an arm , and a real valued reward is obtained from that arm. The rewards obtained from each arm are independent and identically distributed, with a distribution function (CDF) , . We denote the maximal possible reward of each arm by , assumed finite, and the maximal reward among all arms by .

Throughout the paper, we shall make the following assumption.

###### Assumption 1.

There exist known constants , and such that, for every and , it holds that

 P(μk>μ∗k−ϵ)≥Aϵβ,

where

stands for a random variable with distribution

.

The bound in the above assumption can also be expressed as . This condition required to have a certain mass near its maximal reward. Note that the specific case of is satisfied if the densities are lower bounded by a constant . Values of accommodate leaner tales.

The upper bound on the CDF ensures that for each arm, an -optimal reward can be observed by a finite number of samples. The bound in the above assumption is similar to those assumed in [12] and [15].

An algorithm for the Max-Bandit model samples an arm at each time step, based on the observed history so far (i.e., the previously selected arms and observed rewards). We require the algorithm to terminate after a random number of samples, which is finite with probability 1, and return a reward which is the maximal reward observed over the entire period. An algorithm is said to be -correct if

 P(V>μ∗−ϵ)>1−δ.

The expected number of samples taken by the algorithm is the sample complexity, which we wish to minimize.

## 3 A Lower Bound

Before turning to our proposed algorithm, we provide a lower bound on the sample complexity of any -correct algorithm. The bounds holds under Assumption 1 when . The case of is more complicated for analysis and it still unclear whether our lower bound holds for this case.

The following result specifies the lower bound of this section.

###### Theorem 1.

Suppose , and let and . Let denote some optimal arm, such that . Then, under Assumption 1, for every -correct algorithm, it holds that

 E[T]≥∑k∈K∖{k∗}18A(min(ϵ0,ϵ+μ∗−μ∗k))βln(316δ). (1)

This lower bound can be interpreted as summing over the minimal number of times that each arm, other than the optimal arm , needs to be sampled. It is important to observe that if there are several optimal arms, only one of them is excluded from the summation. Indeed, the bound is most effective when there are several optimal (or near-optimal) arms, as the denominator of the summand is larger for such arms. This may appear surprising at first, as more sources of good rewards are available; however, when there is a single arm that is strictly better than the others it can be quickly singled out, while if many arms have nearly optimal rewards, more samples are ”waisted” on determining which arm is best.

The proof of Theorem 1 is provided in Appendix A and proceeds by showing that if an algorithm is -correct and its sample complexity is lower than a certain threshold for some set of reward distributions, then this algorithm cannot be -correct for some related reward distributions.

## 4 Algorithms

Here we provide two -correct algorithms. The first algorithm is based on sampling the arm which has the highest upper confidence bound on its maximal reward at each time step and the second algorithm is based on arms elimination.

### 4.1 Maximal Confidence Bound

The algorithm starts by sampling a certain number of times from each arm. Then, it repeatedly calculates an index for each arm which can be interpreted as a certain upper bound on the maximal reward of this arm, and samples once from the arm with the largest index. The algorithm terminates when the number of samples from the arm with the largest index is above a certain threshold. This idea is similar to that in the UCB1 Algorithm provided in [16].

###### Theorem 2.

Under Assumption 1, for , Algorithm 1 is -correct with a sample complexity of

 E[T]≤∑k∈KL−ln(δ)A(max(ϵ,μ∗−μ∗k))β+|K|N0,

where and as defined in the algorithm.

In the following corollary we present the ratio between the lower bound presented in Theorem 1 to the upper bound in Theorem 2.

###### Corollary 1.

If there are more than one arm for which , then the upper bound on the sample complexity is of the same order as the lower bound in Theorem 1, up to a logarithmic factor in .

###### Proof.

For every it follows that

 Θ1k≜1+2β(min(ϵ0,ϵ+μ∗−μ∗k))β≥2β(ϵ+μ∗−μ∗k)β+1ϵβ0≥1(max(ϵ,μ∗−μ∗k))β+1ϵβ0≜Θ2k,

and for every two arms and for which and it is obtained that

 Θ1k′≥2−βΘ1k∗. (2)

In addition, the lower bound is of the same order as

 −ln(δ)∑k∈K∖{k∗}Θ1k, (3)

the upper bound is of the same order as

 (L−ln(δ))∑k∈KΘ2k,

Therefore, the upper bound in Theorem 2 is of the same order of the lower bound in Theorem 1 up to an order of , which is logarithmic in .

To establish Theorem 2, we first bound the probability of the event under which the upper bound of the best arm is below the maximal reward. Then, we bound the largest number of samples after which the algorithm terminates under the assumption that the upper bound of the best arm is above the maximal reward.

###### Proof (Theorem 2).

We denote the time step of the algorithm by , and the value of the counter at time step by . Recall that stands for the random final time step. By the condition in step 5 of the algorithm, for every arm , it follows that,

 CT(k)≤⌊L−ln(δ)Aϵβ⌋+1. (4)

Note that by the fact that for it follows that , and by the fact that for it follows that it is obtained that

 L′≜|K|(−ln(δ)Aϵβ+1)>6ln(|K|(−ln(δ)Aϵβ+1))=L,

for . So, by the fact that , for it follows that

 T≤|K|(L−ln(δ)Aϵβ+1)<|K|(L′−ln(δ)Aϵβ+1)≤L′2=eL3. (5)

Now, we begin with proving the -correctness property of the algorithm. Recall that for every arm the rewards are distributed according to the C.D.F. . Let assume w.l.o.g. that . Then, for and by the fact that for every , for it follows that

 P(V1N≤μ∗−ϵUB(N))=(F1(μ∗−ϵUB(N)))N≤(1−A(ϵUB(N))β)N≤δe−L, (6)

where is the largest reward observed from arm after this arm has been sampled for times. Hence, at every time step , by the definition of and Equations (5) and (6), by applying the union bound, it follows that

 P(Y1Ct(1)≤μ∗)≤P(V1Ct(1)≤μ∗−ϵUB(Ct(1)))≤exp(L3)∑t=1P(V1N≤μ∗−ϵUB(N))≤δe−2L3. (7)

Since by the condition in step 5, it is obtained that when the algorithm stops

 Vk∗Ct(k∗)>Yk∗Ct(k∗)−ϵ,

and by the fact that for every time step

 Yk∗Ct(k∗)≥Y1Ct(1),

it follows by Equation (7) that

 P(Vk∗Ct(k∗)≤μ∗−ϵ)≤P(Y1Ct(1)≤μ∗)≤δe−2L3.

Therefore, it follows that the algorithm returns a reward greater than with a probability larger than . So, it is -correct.

For proving the bound on the expected sample complexity of the algorithm we define the following sets:

 M(ϵ)={l∈K|μ∗−μ∗l<ϵ},N(ϵ)={l∈K|μ∗−μ∗l≥ϵ}.

As before, we assume w.l.o.g. that . For the case in which

 E1≜⋂1≤t

occurs, since for every , and every time step, it follows that the necessary condition for sampling from arm ,

 YkCk(1)≥Y1Ct(1),

occurs only when the event

 E2(t)≜{μ∗k+ϵUB(Ct(k))≥μ∗},

occurs. But

 E2(t)⊆⎧⎨⎩Ct(k)≤L−ln(δ)A(μ∗−μ∗k)β⎫⎬⎭.

Therefore, it is obtained that

 CT(k)≤⌊L−ln(δ)A(μ∗−μ∗k)β⌋+1. (8)

By using the bound in Equation (4) for the arms in the set , the bound in Equation (8) for the arms in the set and the bound in Equation (5), it is obtained that

 E[T]≤(1−P(E1))eL3+P(E1)Φ(ϵ), (9)

where

In addition, by Equation (7), the bound in Equation (5) and by applying the union bound, it follows that

 P(E1)≥1−T∑t=1P(Y1Ct(1)<μ∗)≥1−δe−2L3eL3=1−δe−L3.

So,

 1−P(E1)≤δe−L3. (10)

Furthermore, by the definitions of the sets and , it can be obtained that

 Φ(ϵ)≤∑k∈K⌊L−ln(δ)A(max(ϵ,μ∗−μ∗k))β⌋+1. (11)

Therefore, by Equation (9), (10) and (11) the bound on the sample complexity is obtained.

### 4.2 Maximal Eliminator

The algorithm starts by sampling a certain number of times from each arm. Then, it repeatedly calculates an index for each arm which can be interpreted as a certain upper bound on the maximal reward of this arm, and eliminates arms for which that index is below the maximal sampled reward so far. Then it sample from only the retained arms (those arms which have not been eliminated) a number of times that is doubled at each sampling phase. This idea is similar to that in the Median Elimination Algorithm provided in [2].

We do not provide performance analysis for Algorithm 2. However, since the number of times at which the confidence bounds should be correct (times at which the algorithm eliminates arms) is only logarithmic in the number of total samples, we have (where is defined in Algorithm 1 and the factor arises because of the doubling). Therefore, we believe that the upper bound on the sample complexity of Algorithm 2 would be that of Algorithm 1 multiplied by . So, the upper bound would be of the same order of the lower bound in Theorem 1 up to double logarithmic terms.

## 5 Comparison with The Unified-Arm Model

In this section, we analyze the improvement in the sample complexity obtained by utilizing the multi arm property (the ability to choose from which arm to sample at each time step) compared to a model in which all the arms are unified into a unified arm, so that the sample is effectively obtained from a random arm. In the unified-arm model, when the agent samples from this unified arm, a certain arm (among the multi arm) is chosen uniformly and a reward is sampled from this arm. We denote the CDF of the unified arm as , with . By Assumption 1, , and the corresponding maximal reward is .

In the remainder of this section, we provide a lower bound on the sample complexity and an -correct algorithm that attains the same order of this bound for the unified-arm model. (Note that the lower bound in Theorem 1 is meaningless for .) Then, we discuss which approach (multi-arm or unified-arm) is better for different model parameters, and provide examples that illustrate these cases.

### 5.1 Lower Bound

The following Theorem provides a lower bound on the sample complexity for the unified-arm model.

###### Theorem 3.

Suppose , and let , . Then, under Assumption 1, for every -correct algorithm, it holds that

 E[T]≥|K|4Aϵβln(35δ). (12)

The proof is provided in Appendix B and is based on the a similar idea to that of Theorem 1.

### 5.2 Algorithm

In Algorithm 3 a certain number of rewards is sampled, and the algorithm chooses the best one among them. In the following Theorem we provide a bound on the sample complexity achieved by Algorithm 3.

###### Theorem 4.

Under Assumption 1, Algorithm 3 is -correct, with a sample complexity bound of

 E[T]≤|K|ln(δ−1)Aϵβ+2.

The proof is provided in Appendix C. Note that the upper bound on the sample complexity is of the same order as the lower bound in Theorem 3.

### 5.3 Comparison and Examples

To find when the multi-arm algorithm is helpful, we can compare the upper bound on the sample complexity provided in Theorem 2 for Algorithm 1 (multi-arm case) with the lower bound for the unified-arm model in Theorem 3.

Case 1: Suppose first that arm 1 is best: , while all the other arms fall short significantly compared to the required accuracy : , for .
In this case , for . Hence the upper bound on sample complexity of Algorithm 1 (multi-arm case) will be smaller than the lower bound for the unified-arm model in Theorem 3. We now provide an example which illustrate case 1 numerically.

###### Example 1 (Case 1).

Let , , , and . For and the sample complexity attained by Algorithm 1 is . The lower bound for the unified-arm model is . The sample complexity attained by Algorithm 3 (for the unified-arm model model) is .

Case 2: Consider next the opposite case, where there are many optimal arms and few that are worse: say , while for all .
In this case , for . Hence, since there is a logarithmic-in- multiplicative factor in the upper bound on the sample complexity of Algorithm 1 (multi-arm case), this bound will be larger than the lower bound for the unified-arm model in Theorem 3. The following example illustrate case 2 numerically.

###### Example 2 (Case 2).

Let , , , and remain the same as in Example 1, and let and . The sample complexity of Algorithm 1 is , which is larger than the sample complexity of Algorithm 3 which is .

As shown in Example 2, in some cases the bound on the sample complexity of Algorithm 1 (multi-arm) is larger than that of Algorithm 3 (unified-arm). By comparing the upper bounds of these algorithms, we believe that the logarithmic in factor in the bound of Algorithm 1 may not be required.

As observed by comparing the lower and upper bounds for the multi-arm and the unified-arm model, the unified-arm algorithm provides a tighter upper bound (compared to the matching lower bound). Therefore, when the benefit obtained by the multi-arm model is small (i.e., when there are a lot of good arms) the profit obtained by applying the multi-arm Algorithm turns out to be loss.

## 6 Conclusion

In this paper we have developed corresponding lower and upper bounds on the sample complexity, which are essentially the same order up to a logarithmic term in for the Max -Armed Bandit problem.

These results were compared to the unified-arm model, where the learning algorithm effectively unifies the different arms into one. While the multi-arm algorithm usually performs better, in some cases, in particular when most arms are optimal, the unified arm algorithm may provide better performance. It still remains to be shown whether an algorithm that provides the performance benefits of both approaches may be devised.

Another direction for future work concerns the relaxation or generalization of our Assumption 1, which requires a known lower bound on the tail distribution of the rewards.

## References

• [1] S. Bubeck and N. Cesa-Bianchi, “Regret analysis of stochastic and nonstochastic multi-armed bandit problems,” Machine Learning, vol. 5, no. 1, pp. 1–122, 2012.
• [2]

E. Even-Dar, S. Mannor, and Y. Mansour, “PAC bounds for multi-armed bandit and markov decision processes,” in

Computational Learning Theory, pp. 255–270, 2002.
• [3] J.-Y. Audibert and S. Bubeck, “Best arm identification in multi-armed bandits,” in COLT-23th Conference on Learning Theory-2010, pp. 13–p, 2010.
• [4] V. Gabillon, M. Ghavamzadeh, and A. Lazaric, “Best arm identification: A unified approach to fixed budget and fixed confidence,” in Advances in Neural Information Processing Systems 25, pp. 3212–3220, Curran Associates, Inc., 2012.
• [5] S. Mannor and J. N. Tsitsiklis, “The sample complexity of exploration in the multi-armed bandit problem,” Journal of Machine Learning Research, vol. 5, pp. 623–648, 2004.
• [6]

V. A. Cicirello and S. F. Smith, “The max k-armed bandit: A new model of exploration applied to search heuristic selection,” in

Proceedings of the National Conference on Artificial Intelligence

, vol. 20, p. 1355, 2005.
• [7] M. J. Streeter and S. F. Smith, “An asymptotically optimal algorithm for the max k-armed bandit problem,” in Proceedings of the National Conference on Artificial Intelligence, vol. 21, p. 135, 2006.
• [8] M. J. Streeter and S. F. Smith, “A simple distribution-free approach to the max k-armed bandit problem,” in Principles and Practice of Constraint Programming-CP 2006, pp. 560–574, Springer, 2006.
• [9] A. Carpentier and M. Valko, “Extreme bandits,” in Advances in Neural Information Processing Systems 27, pp. 1089–1097, Curran Associates, Inc., 2014.
• [10] D. A. Berry, R. W. Chen, A. Zame, D. C. Heath, and L. A. Shepp, “Bandit problems with infinitely many arms,” The Annals of Statistics, pp. 2103–2116, 1997.
• [11] O. Teytaud, S. Gelly, and M. Sebag, “Anytime many-armed bandits,” in CAP, (Grenoble, France), 2007.
• [12] Y. Wang, J.-Y. Audibert, and R. Munos, “Infinitely many-armed bandits,” Advances in Neural Information Processing Systems, vol. 8, pp. 1–8, 2008.
• [13] D. Chakrabarti, R. Kumar, F. Radlinski, and E. Upfal, “Mortal multi-armed bandits,” in Advances in Neural Information Processing Systems 21, pp. 273–280, Curran Associates, Inc., 2009.
• [14] T. Bonald and A. Proutiere, “Two-target algorithms for infinite-armed bandits with Bernoulli rewards,” in Advances in Neural Information Processing Systems 26, pp. 2184–2192, Curran Associates, Inc., 2013.
• [15] Y. David and N. Shimkin, “Infinitely many-armed bandits with unknown value distribution,” in Machine Learning and Knowledge Discovery in Databases, pp. 307–322, Springer, 2014.
• [16] P. Auer, N. Cesa-Bianchi, and P. Fischer, “Finite-time analysis of the multiarmed bandit problem,” Machine Learning, vol. 47, pp. 235–256, 2002.

## 7 Appendix A

###### Proof (Theorem 1).

Let for every . Then, we define the following set of hypotheses :

 H0:fH0k(μ)=fk(μ)∀k∈K,

and, for every ,

 Hk: fHkl(μ)=fl(μ),l≠k, if ϵ0≤(μ∗−μ∗k+ϵ):fHkk(μ)=γ1kfk(μ)1(−∞,μ∗k)(μ)+Aβ(μ∗+ϵ−μ)β−11(μ∗+ϵ−ϵ0,μ∗+ϵ](μ), if ϵ0>(μ∗−μ∗k+ϵ)% :fHkk(μ)=γ2kfk(μ)1(−∞,¯¯¯μk)(μ)+γ3kfk(μ)1(μ=¯¯¯μk)+fk(μ)1(¯¯¯μk,μ∗k](μ)+Aβ(μ∗+ϵ−μ)β−11(μ∗k,μ∗+ϵ](μ),

where

is the probability density function of arm

, stand for the indicator function of the set , , and is chosen such that .

Note that since for every it follows that for , Assumption 1 holds for hypotheses .

To further bound and , note that since ,

 1−2A(μ∗−μ∗k+ϵ)β≤γ2k≤1.

Let stands for the mass of an atom in the probability function of arm at the point (if there is one), then we note that

 1=∫∞−∞fHkk(μ)dμ=γ2k(Fk(¯¯¯μk)−Pk)+γ3kPk+1−Fk(¯¯¯μk)+A(μ∗−μ∗k+ϵ)β≜Φ(γ2k,γ3k),

but, since , for it follows that . So, since increases in it is obtained that . Finally, it follows that in the case of ,

 γk≤γ1k,

and in the case of ,

 γk≤min(γ2k,γ3k),

where

 γk≜1−2A(min(ϵ0,μ∗−μ∗k+ϵ))β

If hypothesis () is true, then for all , hence the algorithm should provide a reward from arm with probability larger than . We use and to denote the expectation and probability, respectively, under the algorithm being considered and hypothesis . Further, for every let

 tk=14(1−γk)ln(316δ),

and let stands for the number of samples from arm .

Suppose now that our algorithm is -correct under , and that for some . We will show that this algorithm cannot be -correct under hypothesis . Therefore, an -correct algorithm must have for all .

Define the following events:

• . It easily follows from that if , then .

• Let stand for the event under which the chosen arm at termination is , and for its complement. Since can hold for one arm at most, it follows that for every for some .

• Let to be the event under which all the samples obtained from arm are on the interval . Clearly, .

Define now the intersection event . We have just shown that for every it holds that , and , from which it follows that for . Further, observe that for every history of samples for which the event holds, it holds that . We therefore obtain the following inequalities,

 PHk(BCk) >14γ−4tkk≥14e−ln316δ≥δ,

where in the last inequality we used the facts that .

We found that if an algorithm is -correct under hypothesis and for some , then, under hypothesis this algorithm returns a sample that is smaller by at least than the maximal possible reward with probability of or more, hence the algorithm is not -correct. Therefore, any -correct algorithm must satisfy for all of arms except possibly for one (namely, for the one for which ). In addition , where is the optimal arm (namely, ). Hence the lower bound is obtained.

## 8 Appendix B

###### Proof (Theorem 3).

First , we define the following hypotheses:

 H0:fH0(μ)=f(μ),

and

 H1:fH1(μ)=γf(μ)+A|K|β(μ∗+ϵ−μ)β−11(μ∗,μ∗+ϵ](μ),

where, as in the proof of Theorem 1, is the probability density function of the unified arm, stand for the indicator function of the set , and is chosen such that .

Note that since for every it follows that for , Assumption 1 holds for hypothesis .

To further bound , note that

 1=∫∞−∞fH1(μ)dμ=γ+Aϵβ|K|,

Therefore,

 γ=1−Aϵβ|K|.

If hypothesis is true, the algorithm should provide a reward greater than . We use and (where ) to denote the expectation and probability respectively, under the algorithm being considered and under hypothesis . Now, let

 t=14(1−γ)ln(35δ),

and recall that stands for the total number of samples from the arm.

Now, we assume we run an algorithm which is -correct under and that for this algorithm. We will show that this algorithm cannot be -correct under hypothesis . Therefore, an -correct algorithm must have .

Define the following events:

• . By the same consideration as in the proof of Theorem 1 (for the events ), it follows that if , then .

• Let stand for the event under which the chosen sample is smaller or equal to , and for its complementary. Clearly, .

• We define the event to be the event under which all the samples obtained from the unified arm are on the interval . Clearly, .

Define now the intersection event . We have shown that , and , from which it is obtained that . In addition, since for every history of samples, for which the event holds, it is obtained that , we have the following,

 P1(B) ≥P1(S)=E0[dP1dP0I(S)]≥γ−4tP0(I(S)) ≥34γ−4t≥34e−ln35δ≥δ,

where in the last inequality we used the facts that .

We found that if an algorithm is -correct under hypothesis and , then, under hypothesis this algorithm returns a sample that is smaller by at least than the maximal possible reward with a probability of or more, hence the algorithm is not -correct. Therefore, any -correct algorithm, must satisfy . Hence the lower bound is obtained.

## 9 Appendix C

###### Proof (Theorem 4).

Since sampling from the unified arm consists of choosing one arm out of the arms (with equal probability), and then, sampling from this arm, it follows that, . Also, we note that for every . Therefore, for ,

 P(V1N<μ∗−ϵ)=(F(μ∗−ϵ))N≤(1−Aϵβ|K|)N<δ, (13)

where is the largest reward observed among the first samples. Hence, the algorithm is -correct. The bound on the sample complexity is immediate from the definition of the algorithm.