1 Introduction
In the classic stochastic multiarmed bandit (MAB) problem the learning agent faces a set of stochastic arms, and wishes to maximize its cumulative reward (in the regret formulation), or find the arm with the highest mean reward (the pure exploration problem). This model has been studied extensively in the statistical and learning literature, see for example [1] for a comprehensive survey.
We consider a variant of the MAB problem called the Max Armed Bandit problem (MaxBandit for short). In this variant, the objective is to obtain a sample with the highest possible reward (namely, the highest value in the support of the probability distribution of any arm). More precisely, considering the PAC setting, the objective is to return an correct sample, namely a sample which its reward value is close to the overall best possible reward with a probability larger than . In addition, we wish to minimize the sample complexity, namely the expected number of samples observed by the learning algorithm before it terminates.
For the classical MAB problem, algorithms that find the best arm (in terms of its expected reward) in the PAC sense were presented in [2, 3, 4], and lower bounds on the sample complexity were presented in [5] and [3]. The essential difference with respect to this work is in the objective, which is to find an correct sample in our case. The scenario considered in MaxBandit model is relevant when a single best item needs to be selected from among several (large) clustered sets of items, with each set represented as a single arm. These sets may represent parts that come from different manufacturers or produced by different processes, job candidates that are referred by different employment agencies, finding the best match to certain genetic characteristics in different populations, or choosing the best channel among different frequency bands in a cognitive radio wireless network.
The MaxBandit problem was apparently first proposed in [6]. For reward distribution functions in a specific family, an algorithm with an upper bound on the sample complexity that increases as was provided in [7]. For the case of discrete rewards, another algorithm was presented in [8], without performance analysis. Later, a similar model in which the objective is to maximize the expected value of the largest sampled reward for a given number of samples () was studied in [9]. In that work the attained best reward is compared with the expected reward obtained by an oracle that samples the best arm time. An algorithm is suggested and shown to secure an upper bound of order on that difference, where is determined by the properties of the distribution functions and decreases as they are further away from a specific functions family.
Our basic assumption in the present paper is that a known lower bound is available on the tail distributions, namely on the probability that the reward of each given arm will be close to its maximum. A special case is when the probability densities near the maximum are larger than a given value, but we consider more general function classes. Under that assumption, we provide an algorithm for which the sample complexity increases as at most . This provides an improvement by a factor of over the result of [7], which was obtained for a more specific model. To compare with the result in [9], we observe that with a choice of in our algorithm, we obtain that the expected shortfall of the largest sample with respect to the maximal reward possible is at most of order (as compared to with ). Furthermore, we provide a lower bound on the sample complexity of every correct algorithm, which holds when several arms posses maximal rewards that are close to that of the best arm. This lower bound is shown to coincide, up to a logarithmic term, with the upper bound derived for the proposed algorithm.
A basic feature of the MaxBandit problem (and the associated algorithms) is the goal of quickly focusing on the best arm (in term of maximal reward), and sampling from that arm as much as possible. It should be of interest to compare the obtained results with the alternative approach, which ignores the distinction between arms, and simply draws a sample from a random arm (say, with uniform probabilities) at each round. This can be interpreted as mixing the items associated with each arm before sampling; we accordingly refer to this variant as the unifiedarm problem. This problem actually coincides with the socalled infinitelymany armed bandit model studied in [10, 11, 12, 13, 14], for the specific case of deterministic arms studied in [15]. The conclusion about weather to apply the multiarm approach or the unifiedarm approach is inconclusive. However, as a rule of thumb, when the maximal possible rewards of many arms are far from the optimal, the multiarm approach has better performance.
The paper proceeds as follows. In the next section we present our model. In Section 3 we provide a lower bound on the sample complexity of every correct algorithm. In Section 4 we present two correct algorithms, and we provide an upper bound on the sample complexity of one of them. The first algorithm is simple and its bound has the same order as the lower bound up to a logarithmic term in (where stands for the number of arms), the second algorithm is more complicated and we believe that its bound is larger by up to a double logarithmic term in than the lower bound. In Section 5, we consider for comparison the unifiedarm case. In Section 6 we close the paper by some concluding remarks. Certain proofs are differed to the Appendix due to space limitations.
2 Model Definition
We consider a finite set of arms, denoted by . At each stage the learning agent chooses an arm , and a real valued reward is obtained from that arm. The rewards obtained from each arm are independent and identically distributed, with a distribution function (CDF) , . We denote the maximal possible reward of each arm by , assumed finite, and the maximal reward among all arms by .
Throughout the paper, we shall make the following assumption.
Assumption 1.
There exist known constants , and such that, for every and , it holds that
where
stands for a random variable with distribution
.The bound in the above assumption can also be expressed as . This condition required to have a certain mass near its maximal reward. Note that the specific case of is satisfied if the densities are lower bounded by a constant . Values of accommodate leaner tales.
The upper bound on the CDF ensures that for each arm, an optimal reward can be observed by a finite number of samples. The bound in the above assumption is similar to those assumed in [12] and [15].
An algorithm for the MaxBandit model samples an arm at each time step, based on the observed history so far (i.e., the previously selected arms and observed rewards). We require the algorithm to terminate after a random number of samples, which is finite with probability 1, and return a reward which is the maximal reward observed over the entire period. An algorithm is said to be correct if
The expected number of samples taken by the algorithm is the sample complexity, which we wish to minimize.
3 A Lower Bound
Before turning to our proposed algorithm, we provide a lower bound on the sample complexity of any correct algorithm. The bounds holds under Assumption 1 when . The case of is more complicated for analysis and it still unclear whether our lower bound holds for this case.
The following result specifies the lower bound of this section.
Theorem 1.
Suppose , and let and . Let denote some optimal arm, such that . Then, under Assumption 1, for every correct algorithm, it holds that
(1) 
This lower bound can be interpreted as summing over the minimal number of times that each arm, other than the optimal arm , needs to be sampled. It is important to observe that if there are several optimal arms, only one of them is excluded from the summation. Indeed, the bound is most effective when there are several optimal (or nearoptimal) arms, as the denominator of the summand is larger for such arms. This may appear surprising at first, as more sources of good rewards are available; however, when there is a single arm that is strictly better than the others it can be quickly singled out, while if many arms have nearly optimal rewards, more samples are ”waisted” on determining which arm is best.
The proof of Theorem 1 is provided in Appendix A and proceeds by showing that if an algorithm is correct and its sample complexity is lower than a certain threshold for some set of reward distributions, then this algorithm cannot be correct for some related reward distributions.
4 Algorithms
Here we provide two correct algorithms. The first algorithm is based on sampling the arm which has the highest upper confidence bound on its maximal reward at each time step and the second algorithm is based on arms elimination.
4.1 Maximal Confidence Bound
The algorithm starts by sampling a certain number of times from each arm. Then, it repeatedly calculates an index for each arm which can be interpreted as a certain upper bound on the maximal reward of this arm, and samples once from the arm with the largest index. The algorithm terminates when the number of samples from the arm with the largest index is above a certain threshold. This idea is similar to that in the UCB1 Algorithm provided in [16].
Theorem 2.
In the following corollary we present the ratio between the lower bound presented in Theorem 1 to the upper bound in Theorem 2.
Corollary 1.
If there are more than one arm for which , then the upper bound on the sample complexity is of the same order as the lower bound in Theorem 1, up to a logarithmic factor in .
Proof.
For every it follows that
and for every two arms and for which and it is obtained that
(2) 
In addition, the lower bound is of the same order as
(3) 
the upper bound is of the same order as
Therefore, the upper bound in Theorem 2 is of the same order of the lower bound in Theorem 1 up to an order of , which is logarithmic in .
∎
To establish Theorem 2, we first bound the probability of the event under which the upper bound of the best arm is below the maximal reward. Then, we bound the largest number of samples after which the algorithm terminates under the assumption that the upper bound of the best arm is above the maximal reward.
Proof (Theorem 2).
We denote the time step of the algorithm by , and the value of the counter at time step by . Recall that stands for the random final time step. By the condition in step 5 of the algorithm, for every arm , it follows that,
(4) 
Note that by the fact that for it follows that , and by the fact that for it follows that it is obtained that
for . So, by the fact that , for it follows that
(5) 
Now, we begin with proving the correctness property of the algorithm. Recall that for every arm the rewards are distributed according to the C.D.F. . Let assume w.l.o.g. that . Then, for and by the fact that for every , for it follows that
(6) 
where is the largest reward observed from arm after this arm has been sampled for times. Hence, at every time step , by the definition of and Equations (5) and (6), by applying the union bound, it follows that
(7) 
Since by the condition in step 5, it is obtained that when the algorithm stops
and by the fact that for every time step
it follows by Equation (7) that
Therefore, it follows that the algorithm returns a reward greater than with a probability larger than . So, it is correct.
For proving the bound on the expected sample complexity of the algorithm we define the following sets:
As before, we assume w.l.o.g. that . For the case in which
occurs, since for every , and every time step, it follows that the necessary condition for sampling from arm ,
occurs only when the event
occurs. But
Therefore, it is obtained that
(8) 
By using the bound in Equation (4) for the arms in the set , the bound in Equation (8) for the arms in the set and the bound in Equation (5), it is obtained that
(9) 
where
In addition, by Equation (7), the bound in Equation (5) and by applying the union bound, it follows that
So,
(10) 
Furthermore, by the definitions of the sets and , it can be obtained that
(11) 
∎
4.2 Maximal Eliminator
The algorithm starts by sampling a certain number of times from each arm. Then, it repeatedly calculates an index for each arm which can be interpreted as a certain upper bound on the maximal reward of this arm, and eliminates arms for which that index is below the maximal sampled reward so far. Then it sample from only the retained arms (those arms which have not been eliminated) a number of times that is doubled at each sampling phase. This idea is similar to that in the Median Elimination Algorithm provided in [2].
We do not provide performance analysis for Algorithm 2. However, since the number of times at which the confidence bounds should be correct (times at which the algorithm eliminates arms) is only logarithmic in the number of total samples, we have (where is defined in Algorithm 1 and the factor arises because of the doubling). Therefore, we believe that the upper bound on the sample complexity of Algorithm 2 would be that of Algorithm 1 multiplied by . So, the upper bound would be of the same order of the lower bound in Theorem 1 up to double logarithmic terms.
5 Comparison with The UnifiedArm Model
In this section, we analyze the improvement in the sample complexity obtained by utilizing the multi arm property (the ability to choose from which arm to sample at each time step) compared to a model in which all the arms are unified into a unified arm, so that the sample is effectively obtained from a random arm. In the unifiedarm model, when the agent samples from this unified arm, a certain arm (among the multi arm) is chosen uniformly and a reward is sampled from this arm. We denote the CDF of the unified arm as , with . By Assumption 1, , and the corresponding maximal reward is .
In the remainder of this section, we provide a lower bound on the sample complexity and an correct algorithm that attains the same order of this bound for the unifiedarm model. (Note that the lower bound in Theorem 1 is meaningless for .) Then, we discuss which approach (multiarm or unifiedarm) is better for different model parameters, and provide examples that illustrate these cases.
5.1 Lower Bound
The following Theorem provides a lower bound on the sample complexity for the unifiedarm model.
Theorem 3.
Suppose , and let , . Then, under Assumption 1, for every correct algorithm, it holds that
(12) 
The proof is provided in Appendix B and is based on the a similar idea to that of Theorem 1.
5.2 Algorithm
In Algorithm 3 a certain number of rewards is sampled, and the algorithm chooses the best one among them. In the following Theorem we provide a bound on the sample complexity achieved by Algorithm 3.
The proof is provided in Appendix C. Note that the upper bound on the sample complexity is of the same order as the lower bound in Theorem 3.
5.3 Comparison and Examples
To find when the multiarm algorithm is helpful, we can compare the upper bound on the sample complexity provided in Theorem 2 for Algorithm 1 (multiarm case) with the lower bound for the unifiedarm model in Theorem 3.
Case 1: Suppose first that arm 1 is best: , while all the other arms fall short significantly
compared to the required accuracy : , for .
In this case , for . Hence the upper bound on sample complexity of Algorithm 1 (multiarm case) will be smaller than the lower bound for the unifiedarm model in Theorem 3. We now provide an example which illustrate case 1 numerically.
Example 1 (Case 1).
Case 2: Consider next the opposite case, where there are many optimal arms and few that are worse:
say , while for all .
In this case , for . Hence, since there is a logarithmicin multiplicative factor in the upper bound on the sample complexity of Algorithm 1 (multiarm case), this bound will be larger than the lower bound for the unifiedarm model in Theorem 3. The following example illustrate case 2 numerically.
Example 2 (Case 2).
As shown in Example 2, in some cases the bound on the sample complexity of Algorithm 1 (multiarm) is larger than that of Algorithm 3 (unifiedarm). By comparing the upper bounds of these algorithms, we believe that the logarithmic in factor in the bound of Algorithm 1 may not be required.
As observed by comparing the lower and upper bounds for the multiarm and the unifiedarm model, the unifiedarm algorithm provides a tighter upper bound (compared to the matching lower bound). Therefore, when the benefit obtained by the multiarm model is small (i.e., when there are a lot of good arms) the profit obtained by applying the multiarm Algorithm turns out to be loss.
6 Conclusion
In this paper we have developed corresponding lower and upper bounds on the sample complexity, which are essentially the same order up to a logarithmic term in for the Max Armed Bandit problem.
These results were compared to the unifiedarm model, where the learning algorithm effectively unifies the different arms into one. While the multiarm algorithm usually performs better, in some cases, in particular when most arms are optimal, the unified arm algorithm may provide better performance. It still remains to be shown whether an algorithm that provides the performance benefits of both approaches may be devised.
Another direction for future work concerns the relaxation or generalization of our Assumption 1, which requires a known lower bound on the tail distribution of the rewards.
References
 [1] S. Bubeck and N. CesaBianchi, “Regret analysis of stochastic and nonstochastic multiarmed bandit problems,” Machine Learning, vol. 5, no. 1, pp. 1–122, 2012.

[2]
E. EvenDar, S. Mannor, and Y. Mansour, “PAC bounds for multiarmed bandit and markov decision processes,” in
Computational Learning Theory, pp. 255–270, 2002.  [3] J.Y. Audibert and S. Bubeck, “Best arm identification in multiarmed bandits,” in COLT23th Conference on Learning Theory2010, pp. 13–p, 2010.
 [4] V. Gabillon, M. Ghavamzadeh, and A. Lazaric, “Best arm identification: A unified approach to fixed budget and fixed confidence,” in Advances in Neural Information Processing Systems 25, pp. 3212–3220, Curran Associates, Inc., 2012.
 [5] S. Mannor and J. N. Tsitsiklis, “The sample complexity of exploration in the multiarmed bandit problem,” Journal of Machine Learning Research, vol. 5, pp. 623–648, 2004.

[6]
V. A. Cicirello and S. F. Smith, “The max karmed bandit: A new model of exploration applied to search heuristic selection,” in
Proceedings of the National Conference on Artificial Intelligence
, vol. 20, p. 1355, 2005.  [7] M. J. Streeter and S. F. Smith, “An asymptotically optimal algorithm for the max karmed bandit problem,” in Proceedings of the National Conference on Artificial Intelligence, vol. 21, p. 135, 2006.
 [8] M. J. Streeter and S. F. Smith, “A simple distributionfree approach to the max karmed bandit problem,” in Principles and Practice of Constraint ProgrammingCP 2006, pp. 560–574, Springer, 2006.
 [9] A. Carpentier and M. Valko, “Extreme bandits,” in Advances in Neural Information Processing Systems 27, pp. 1089–1097, Curran Associates, Inc., 2014.
 [10] D. A. Berry, R. W. Chen, A. Zame, D. C. Heath, and L. A. Shepp, “Bandit problems with infinitely many arms,” The Annals of Statistics, pp. 2103–2116, 1997.
 [11] O. Teytaud, S. Gelly, and M. Sebag, “Anytime manyarmed bandits,” in CAP, (Grenoble, France), 2007.
 [12] Y. Wang, J.Y. Audibert, and R. Munos, “Infinitely manyarmed bandits,” Advances in Neural Information Processing Systems, vol. 8, pp. 1–8, 2008.
 [13] D. Chakrabarti, R. Kumar, F. Radlinski, and E. Upfal, “Mortal multiarmed bandits,” in Advances in Neural Information Processing Systems 21, pp. 273–280, Curran Associates, Inc., 2009.
 [14] T. Bonald and A. Proutiere, “Twotarget algorithms for infinitearmed bandits with Bernoulli rewards,” in Advances in Neural Information Processing Systems 26, pp. 2184–2192, Curran Associates, Inc., 2013.
 [15] Y. David and N. Shimkin, “Infinitely manyarmed bandits with unknown value distribution,” in Machine Learning and Knowledge Discovery in Databases, pp. 307–322, Springer, 2014.
 [16] P. Auer, N. CesaBianchi, and P. Fischer, “Finitetime analysis of the multiarmed bandit problem,” Machine Learning, vol. 47, pp. 235–256, 2002.
7 Appendix A
Proof (Theorem 1).
Let for every . Then, we define the following set of hypotheses :
and, for every ,
where
is the probability density function of arm
, stand for the indicator function of the set , , and is chosen such that .Note that since for every it follows that for , Assumption 1 holds for hypotheses .
To further bound and , note that since ,
Let stands for the mass of an atom in the probability function of arm at the point (if there is one), then we note that
but, since , for it follows that . So, since increases in it is obtained that . Finally, it follows that in the case of ,
and in the case of ,
where
If hypothesis () is true, then for all , hence the algorithm should provide a reward from arm with probability larger than . We use and to denote the expectation and probability, respectively, under the algorithm being considered and hypothesis . Further, for every let
and let stands for the number of samples from arm .
Suppose now that our algorithm is correct under , and that for some . We will show that this algorithm cannot be correct under hypothesis . Therefore, an correct algorithm must have for all .
Define the following events:

. It easily follows from that if , then .

Let stand for the event under which the chosen arm at termination is , and for its complement. Since can hold for one arm at most, it follows that for every for some .

Let to be the event under which all the samples obtained from arm are on the interval . Clearly, .
Define now the intersection event . We have just shown that for every it holds that , and , from which it follows that for . Further, observe that for every history of samples for which the event holds, it holds that . We therefore obtain the following inequalities,
where in the last inequality we used the facts that .
We found that if an algorithm is correct under hypothesis and for some , then, under hypothesis this algorithm returns a sample that is smaller by at least than the maximal possible reward with probability of or more, hence the algorithm is not correct. Therefore, any correct algorithm must satisfy for all of arms except possibly for one (namely, for the one for which ). In addition , where is the optimal arm (namely, ). Hence the lower bound is obtained.
∎
8 Appendix B
Proof (Theorem 3).
First , we define the following hypotheses:
and
where, as in the proof of Theorem 1, is the probability density function of the unified arm, stand for the indicator function of the set , and is chosen such that .
Note that since for every it follows that for , Assumption 1 holds for hypothesis .
To further bound , note that
Therefore,
If hypothesis is true, the algorithm should provide a reward greater than . We use and (where ) to denote the expectation and probability respectively, under the algorithm being considered and under hypothesis . Now, let
and recall that stands for the total number of samples from the arm.
Now, we assume we run an algorithm which is correct under and that for this algorithm. We will show that this algorithm cannot be correct under hypothesis . Therefore, an correct algorithm must have .
Define the following events:

. By the same consideration as in the proof of Theorem 1 (for the events ), it follows that if , then .

Let stand for the event under which the chosen sample is smaller or equal to , and for its complementary. Clearly, .

We define the event to be the event under which all the samples obtained from the unified arm are on the interval . Clearly, .
Define now the intersection event . We have shown that , and , from which it is obtained that . In addition, since for every history of samples, for which the event holds, it is obtained that , we have the following,
where in the last inequality we used the facts that .
We found that if an algorithm is correct under hypothesis and , then, under hypothesis this algorithm returns a sample that is smaller by at least than the maximal possible reward with a probability of or more, hence the algorithm is not correct. Therefore, any correct algorithm, must satisfy . Hence the lower bound is obtained.
∎
9 Appendix C
Proof (Theorem 4).
Since sampling from the unified arm consists of choosing one arm out of the arms (with equal probability), and then, sampling from this arm, it follows that, . Also, we note that for every . Therefore, for ,
(13) 
where is the largest reward observed among the first samples. Hence, the algorithm is correct. The bound on the sample complexity is immediate from the definition of the algorithm.
∎