In the classic stochastic multi-armed bandit (MAB) problem the learning agent faces a set of stochastic arms, and wishes to maximize its cumulative reward (in the regret formulation), or find the arm with the highest expected reward (the pure exploration problem). This model has been studied extensively in the statistical and learning literature, see for example  for a comprehensive survey.
We consider a variant of the MAB problem called the Max
-Armed Bandit problem (Max-Bandit for short). In this variant, the objective is to obtain a sample with the highest possible reward (namely, the highest value in the support of the probability distribution of any arm). More precisely, considering the PAC setting, the objective is to return an-correct sample, namely a sample whose value is -close to the overall best with a probability larger than . In addition, we wish to minimize the sample complexity, namely the expected number of samples observed by the learning algorithm before it terminates.
For the classical MAB problem, algorithms that find the best arm (in terms of its expected reward) in the PAC sense were presented in [11, 1, 12], and lower bounds on the sample complexity were presented in  and . The essential difference with respect to this work is in the objective, which is to find an -correct sample in our case. The scenario considered in the Max-Bandit model is relevant when a single best item needs to be selected from among several (large) clustered sets of items, with each set represented as a single arm. These sets may represent parts that come from different manufacturers or produced by different processes, job candidates that are referred by different employment agencies, finding the best match to certain genetic characteristics in different populations, or choosing the best channel among different frequency bands in a cognitive radio wireless network.
The Max-Bandit problem was apparently first proposed in . For reward distribution functions in a specific family, an algorithm with an upper bound on the sample complexity that increases as was provided in . For the case of discrete rewards, another algorithm was presented in , without performance analysis. Later, a similar model in which the objective is to maximize the expected value of the largest sampled reward for a given number of samples () was studied in . In that work the attained best reward is compared with the expected reward obtained by an oracle that samples the best arm time. An algorithm is suggested and shown to secure an upper bound of order on that difference, where and are determined by the properties of the distribution functions and
decreases as they are further away from a specific functions family. Recently, a similar model in which the goal is to find the arm for which the value of a given quantile () is the largest was studied in . Their model can be compared to ours by allowing an error of the same size as the given quantile. In this case, the bound on the sample complexity provided in  increases as .
Our basic assumption in the present paper is that a known lower bound (, formally defined in Section 2) is available on the tail distributions, namely on the probability that the reward of each given arm will be close to its maximum. A special case is a lower bound on the probability densities near the maximum. Under that assumption, we provide an algorithm for which the sample complexity increases at most as . In the context of , and in the context of  . Therefore, the proposed algorithm provides an improvement by a factor of over the result of , which was obtained for a more specific model, and an improvement by the same factor over the result of  which was derived for a similar, but different objective. To compare with the result in 
, we note that by considering the expected maximal value as the maximal possible value, it follows that. With a choice of in our algorithm, we obtain that the expected deficit of the largest sample with respect to the maximal reward possible is at most of order (as compared to with ). Furthermore, we provide a lower bound on the sample complexity of every -correct algorithm, which is shown to coincide, up to a logarithmic term, with the upper bound derived for the proposed algorithm. To the best of our knowledge, this is the first lower bound for the present problem. In addition, we analyze the robustness of the algorithm to our choice of the tail function bound , both for the case where this choice is too optimistic (i.e., the actual distributions do not obey the assumed bound) and for the case where our choice it overly conservative.
A basic feature of the Max-Bandit problem (and the associated algorithms) is the goal of quickly focusing on the best arm (in term of maximal reward), and sampling from that arm as much as possible. It is natural to compare the obtained results with an alternative approach, which ignores the distinction between arms, and simply draws a sample from a random arm at each round. This can be interpreted as mixing the items associated with each arm before sampling; we accordingly refer to this variant as the unified-arm problem. This problem actually coincides with the so-called infinitely-many armed bandit model studied in [3, 18, 19, 8, 4], for the specific case of deterministic arms studied in . As may be expected, the unified-arm approach provides the best results when the reward distribution of all arms are identical. However, when many arms are suboptimal, the multi-armed approach provides superior performance.
The paper proceeds as follows. In the next section we present our model. In Section 3 we provide a lower bound on the sample complexity of every -correct algorithm. In Section 4 we present an -correct algorithm, and we provide an upper bound on its sample complexity. The algorithm is simple and its bound has the same order as the lower bound up to a logarithmic term in (where stands for the number of arms). Then, in Section 5, we provide an analysis of the algorithm’s performance for the case in which our assumption does not hold. In Section 6, we consider for comparison the unified-arm approach. In Section 7 we close the paper by some concluding remarks. Certain proofs are differed to the Appendix due to space limitations.
2 Model Definition
We consider a finite set of arms, denoted by . At each stage the learning agent chooses an arm , and a real valued reward is obtained from that arm. The rewards obtained from each arm are independent and identically distributed, with a distribution function (CDF) , . We denote the maximal possible reward of each arm by , assumed finite, and the maximal reward among all arms by . The tail function of each arm is defined as follows.
For every arm , the tail function is defined by
For example, when is uniform on , then . In addition, we note that CDFs are nondecreasing functions and therefore the tail functions are non-increasing. It should be observed that does not reveal the maximal value , which remains unknown.
Throughout the paper, we shall use the following assumption.
There exists a known function and a known constant such that, for every and , it holds that
We note that for every , where
stands for a random variable with distribution. Furthermore, noting that the tail functions are non-negative and non-increasing, we assume the same for their lower bound . Moreover, for convenience we shall assume that is strictly decreasing in , and denote its inverse function by .
An important special-case is when one assumes that the probability density function (pdf) of each arm is lower bounded by a certain constant, so that . We shall often use the more general bound of the form to illustrate our results.
An algorithm for the Max-Bandit model samples an arm at each time step, based on the observed history so far (i.e., the previously selected arms and observed rewards). We require the algorithm to terminate after a random number of samples, which is finite with probability 1, and return a reward which is the maximal reward observed over the entire period. An algorithm is said to be -correct if
The expected number of samples taken by the algorithm is the sample complexity, which we wish to minimize.
3 A Lower Bound
Before turning to our proposed algorithm, we provide a lower bound on the sample complexity of any -correct algorithm. The bound is established under Assumption 1, and the additional provision that is concave. The case of non-concave turns out to be more complicated for analysis, and it is currently unclear whether our lower bound holds in that case.
For example, when for some known constants and ,
The following result specifies our lower bound.
Let denote some optimal arm, such that . Let Assumption 1 holds with a concave function and let and . Then, for every -correct algorithm,
We note that the specific requirement on is not fundamental, and can be released at the cost of a smaller constant in the bound.
This lower bound can be interpreted as summing over the minimal number of times that each arm, other than the optimal arm , needs to be sampled. It is important to observe that if there are several optimal arms, only one of them is excluded from the summation. Indeed, the bound is large when there are several optimal (or near-optimal) arms, as the denominator of the summand is small for such arms. This follows since the algorithm needs to obtain more samples to verify that a given arm is -optimal.
The proof of Theorem 1 proceeds by considering any given set of reward distributions that obeys the Assumption, and showing that if an algorithm samples some suboptimal arm less than a certain number of times, it cannot be -correct for some related set of reward distributions for which this arm is optimal.
Proof of Theorem 1. We begin by defining the following set of hypotheses , where stands for the CDF of arm under hypothesis and stands for the indicator function of the set . Hypothesis is the true hypothesis, namely,
For , we define as follows. For each arm , its CDF coincides with the true one, namely,
For arm , we construct a CDF such that its maximal value is , while it still satisfies Assumption 1. To define , we use the notation
where is provided to the algorithm. We consider two cases.
Case 1: . Let
Case 2: . Define , and let
denote the value for which reaches probability . Set
If hypothesis () were true, then for all , hence the algorithm should provide a reward from arm with probability larger than . We use and to denote the expectation and probability, respectively, under the algorithm being considered and hypothesis . For every let
where if and if . In addition, we let stand for the number of samples from arm .
Suppose now that our algorithm is -correct under , and that for some . We will show that this algorithm cannot be -correct under hypothesis . Therefore, an -correct algorithm must have for all .
Define the following events , for :
. It easily follows from that if , then .
Let stand for the event under which the chosen arm at termination is , and for its complement. Since can hold for one arm at most, it follows that for every for some .
Let to be the event under which all the samples obtained from arm are on the interval . Clearly, .
For for which , is still defined as before, so (and ). Now, for every , we let denote the event under which for any number of samples from arm , the number of samples which are on the interval is bounded as follows:
where is a RV which equals to if the -th sample from arm is on that interval and otherwise. Below we upper bound using Kolmogorov’s inequality.
Kolmogorov’s inequality states that the sum of zero-mean iid random variables satisfies (Theorem 22.4, in p. 287 of ). By applying it to the RVs , we obtain
where is the complementary of .
So, for the case of , by the fact that , it follows that .
For the case of , it follows that by its definition, so, again by definition we obtain that and therefore . So it follows that since by assumption . For simplicity, we use the bound for every .
Define now the intersection event . We have just shown that for every it holds that , , and , from which it follows that for .
Now, we let to be the history of the process (the sequence of chosen arms and obtained rewards). For every , we denote the number of rewards under by . For a given history, at time , for every , the probability of choosing the next arm is the same under and under . Also, by the hypotheses definition, the reward probability is the same, unless the chosen arm is . Therefore, by the definition of the hypotheses,
where is defined before, by the fact that for , it follows that for and that otherwise ( and are defined before). In addition, represents the Contribution of samples from arm with rewards strictly larger than .
Now we assume that the intersection event occurs. Then, occurs, so . Also, occurs, so . Therefore, for ,
Now, by the fact that , we obtain the following inequalities,
We found that if an algorithm is -correct under hypothesis and for some , then, under hypothesis this algorithm returns a sample that is smaller by at least than the maximal possible reward with probability of or more, hence the algorithm is not -correct. Therefore, any -correct algorithm must satisfy for all of arms except possibly for one (namely, for the one for which ). In addition , where is the optimal arm (namely, ). Hence,
Now, by the fact that is concave, it follows that where . So, for the case of , for , by the fact that is non-negative, it follows that and for the case of , for , it follows that . Then since is a non-decreasing function, the lower bound is obtained.
Here we provide an -correct algorithm. The algorithm is based on sampling the arm which has the highest upper confidence bound on its maximal reward.
The algorithm starts by sampling a fixed number of times from each arm. Then, it repeatedly calculates an index for each arm which can be interpreted as an upper bound on the maximal reward of this arm, and samples once from the arm with the largest index. The algorithm terminates when the number of samples from the arm with the largest index is above a certain threshold. This idea is similar to that in the UCB1 Algorithm of .
As observed by comparing the bounds in Equations (3) and (4), the upper bound in Theorem 2 has the same dependence of and , up to a logarithmic term. It should be noted though that while the lower bound is currently restricted to concave tail function bounds, the algorithm and its bound are not restricted to this case.
To establish Theorem 2, we first bound the probability of the event under which the upper bound of the best arm is below the maximal reward, using an extreme value bound. Then, we bound the largest number of samples after which the algorithm terminates under the assumption that the upper bound of the best arm is above the maximal reward.
Proof of Theorem 2 We denote the time step of the algorithm by , and the value of the counter at time step by . Recall that stands for the random final time step. By the condition in step 5 of the algorithm, for every arm , it follows that,
Note that by the fact that for it follows that , and by the fact that for it follows that it is obtained that
for . So, by the fact that , for it follows that
Now, we begin with proving the -correctness property of the algorithm. Recall that for every arm the rewards are distributed according to the CDF . Let assume w.l.o.g. that . Then, for and by the fact that for every , for it follows that
where is the largest reward observed from arm after this arm has been sampled for times. Hence, at every time step , by the definition of and Equations (6) and (7), by applying the union bound, it follows that
Since by the condition in step 5, it is obtained that when the algorithm stops
and by the fact that for every time step
it follows by Equation (8) that
Therefore, it follows that the algorithm returns a reward greater than with a probability larger than . So, it is -correct.
For proving the bound on the expected sample complexity of the algorithm we define the following sets:
As before, we assume w.l.o.g. that . For the case in which
occurs, since for every , and every time step, it follows that the necessary condition for sampling from arm ,
occurs only when the event
Therefore, it is obtained that
Furthermore, by the definitions of the sets and and since , it can be obtained that
The performance bounds presented for our algorithm depend directly on the choice of the lower bound on the tail functions. A natural question is what happens if our choice of is too optimistic, so that Assumption 1 is violated. In the opposite direction, how tight is our bound when our choice of is to conservative? We address these two questions in turn.
5.1 Optimistic Tails Estimate
Here Equation (1) does not hold for , but holds for for some . The fact that Equation (1) does not hold for leads to the situation in which the probability is larger (where is the index calculated in step 4 of the algorithm) than the value on which the proof of Theorem 2 relies. In the following proposition we provide the -correctness and sample complexity of Algorithm 1.
5.2 Conservative Tails Estimate
Here, Assumption 1 holds for the provided function and also holds for for some . Therefore, in this case the probability is smaller than the value on which the proof of Theorem 2 relies. So, Algorithm 1 returns an -optimal value with a larger probability. The probability of returning a false value is given in the following proposition.
6 Comparison with the Unified-Arm Model
In this section, we analyze the improvement in the sample complexity obtained by utilizing the multi arm framework (the ability to choose from which arm to sample at each time step) compared to a model in which all the arms are unified into a single arm, so that the sample is effectively obtained from a random arm. In the unified-arm model, when the agent samples from this unified arm, one of the original arms is chosen uniformly at random, and a reward is sampled from this arm. The CDF of the unified arm is therefore , and the corresponding maximal reward is . Assumption 1, implies that .
In the remainder of this section, we provide a lower bound on the sample complexity and an -correct algorithm that attains the same order of this bound for the unified-arm model. (Note that the lower bound in Theorem 1 is meaningless for .) Then, we discuss which approach (multi-armed or unified-arm) is better for different model parameters, and provide examples that illustrate these cases.
6.1 Lower Bound
The following Theorem provides a lower bound on the sample complexity for the unified-arm model.
For every -correct algorithm, under Assumption 1, when is concave and , it holds that
In Algorithm 2, a fixed number of instances is sampled, and the algorithm chooses the best one among them. In the following Theorem we provide a bound on the sample complexity achieved by Algorithm 2.
6.3 Comparison and Examples
To find when the multi-armed algorithm is useful, we may compare the upper bound on the sample complexity provided in Theorem 2 for Algorithm 1 (multi-armed case) with the lower bound for the unified-arm model in Theorem 3. We consider two extreme cases.
Case 1: Suppose that arm 1 is best: , while all the other arms fall short significantly
compared to the required accuracy : , for .
Here , for <