1 Introduction
One of the biggest challenges in machine learning is to make learning scalable. A natural way to speed up the learning process is to introduce multiple learners/agents, and let them learn the target function collaboratively. A fundamental question in this direction is to quantify the power of collaboration under limited interaction, as interaction is expensive in many settings. In this paper we approach this general question via the study of a central problem in online learning – best arm identification (or, pure exploration) in multiarmed bandits. We present efficient collaborative learning algorithms and complement them with almost tight lower bounds.
Best Arm Identification.
In multiarmed bandits (MAB) we have alternative arms, where the th arm is associated with an unknown reward distribution with mean . Without loss of generality we assume that each has support on ; this can always be satisfied with proper rescaling. We are interested in the best arm identification problem in MAB, in which we want to identify the arm with the largest mean. In the standard setting we only have one agent, who tries to identify the best arm by a sequence of arm pulls. Upon each pull of the th arm the agent observes an i.i.d. sample/reward from . At any time step, the index of the next pull (or, the final output at the end of the game) is decided by the indices and outcomes of all previous pulls and the randomness of the algorithm (if any). Our goal is to identify the best arm using the minimum amount of arm pulls, which is equivalent to minimizing the running time of the algorithm; we can just assume that each arm pull takes a unit time.
MAB has been studied for more than half a century [37, 20], due to its wide practical applications in clinical trials [36], adaptive routings [5], financial portfolio design [39], model selection [31], computer game play [40], stories/ads display on website [2], just to name a few. In many of these scenarios we are interested in finding out the best arm (strategy, choice, etc.) as soon as possible and committing to it. For example, in the Monte Carlo Tree Search used by computer game play engines, we want to find out the best move among a huge number of possible moves. In the task of highquality website design, we hope to find out the best design among a set of alternatives for display. In almost all such applications the arm pull is the most expensive component: in the realtime decision making of computer game play, it is timeexpensive to perform a single Monte Carlo simulation; in website design tasks, having a user to test each alternative is both time and capital expensive (often a fixed monetary reward is paid for each trial a tester carries out).
In the literature of best arm identification in MAB, two variants have been considered:

Fixedtime best arm: Given a time budget
, identify the best arm with the smallest error probability.
^{1}^{1}1In the literature this is often called fixedbudget best arm. Here we use time instead of budget in order to be consistent with the collaborative learning setting, where it is easier to measure the performance of the algorithm by its running time. 
Fixedconfidence best arm: Given an error probability , identify the best arm with error probability at most using the smallest amount of time.
We will study both variants in this paper.
Collaborative Best Arm Identification.
In this paper we study best arm identification in the collaborative learning model, where we have agents who try to learn the best arm together. The learning proceeds in rounds. In each round each agent pull a (multi)set of arms without communication. For each agent at any time step, based on the indices and outcomes of all previous pulls, all the messages received, and the randomness of the algorithm (if any), the agent, if not in the wait mode, takes one of the following actions: (1) makes the next pull; (2) requests for a communication step and enters the wait mode; (3) terminates and outputs the answer. A communication step starts if all nonterminated agents are in the wait mode. After a communication step all nonterminated agents exit the wait mode and start a new round. During each communication step each agent can broadcast a message to every other agent. While we do not restrict the size of the message, in practice it will not be too large.^{2}^{2}2The information of all pull outcomes of an agent can be described by an array of size at most , with each coordinate storing a pair , where is the number of pulls on the th arm, and is sum of the rewards of the pulls. Once terminated, the agent will not make any further actions. The algorithm terminates if all agents terminate. When the algorithm terminates, each agent should agree on the same best arm. The number of rounds of computation, denoted by , is the number of communication steps plus one.
Our goal in the collaborative learning model is to minimize the number of rounds , and the running time , where is the maximum number of pulls made among the agents in round . The motivation for minimizing is that initiating a communication step always comes with a big time overhead, due to network bandwidth, latency, and protocol handshaking. Roundefficiency is one of the major concerns in all parallel/distributed computational models such as the BSP model [42] and MapReduce [16]. The total cost of the algorithm is a weighted sum of and , where the coefficients depend on the concrete applications. We are thus interested in the best roundtime tradeoffs for collaborative best arm identification.
Speedup in Collaborative Learning.
As the time complexity of the best arm identification in the centralized setting is already wellunderstood (see, e.g. [17, 30, 3, 23, 22, 24, 11, 15]), we would like to interpret the running time of a collaborative learning algorithm as the speedup over that of the best centralized algorithm, which also expresses the power of collaboration. Intuitively speaking, if the running time of the best centralized algorithm is , and that of a proposed collaborative learning algorithm is , then we say the speedup of is . However, due to the parameters in the definition of the best arm identification and the instance dependent bounds for the best centralized algorithms, the definition of the speedup of a collaborative learning algorithm needs to be a bit more involved.
Given an algorithm and input instance , let be the error probability of on given time budget . Given an algorithm and an error probability , let be the smallest time needed for to succeed on with probability at least . Given two algorithms , and two time horizons , we say dominates , denoted by , if for any input instance , we have . We define the speedup of collaborative learning algorithms for the two variants of the best arm identification problem separately.

Fixedtime: we define the speedup of a collaborative learning algorithm as
That is, for each centralized algorithm , we define the ratio of and to be where is the smallest time horizon such that dominates . We then define the speedup to be the worstcase ratio running over all centralized algorithm .

Fixedconfidence: we define the speedup of a collaborative learning algorithm as
That is, for each centralized algorithm , we define the ratio of and to be the worstcase ratio of the running time of for achieving error probability on and that of for achieving error probability on running over all possible input . We then define the speedup to be the worstcase ratio running over all centralized algorithm .
In both cases, let where the is taken over all round algorithms for the collaborative learning model with agents.^{3}^{3}3A similar concept of speedup was introduce in the previous work [21]. However, no formal definition was given in [21].
Clearly there is a tradeoff between and : When (i.e., there is no communication step), each agent needs to solve the problem by itself, and thus . When increases, may increase. On the other hand we always have . Our goal is to find the best roundspeedup tradeoffs, which is essentially equivalent to the roundtime tradeoffs that we mentioned earlier.
As one of our goals is to understand the scalability of the learning process, we are particularly interested in one end of the tradeoff curve: What is the smallest such that ? In other words, how many rounds are needed to make best arm identification fully scalable in the collaborative learning model? In this paper we will address this question by giving almost tight roundspeedup tradeoffs.
Our Contributions.
problem  number of rounds^{4}^{4}4We note again that the number of rounds equals to the number of communication steps plus one.  UB/LB  ref.  
fixedtime  1  1  –  trivial 
UB  [21]  
LB  [21]  
UB  new  
when  LB  new  
fixedconfidence  UB  [21]  
LB  new 
Our results are shown in Table 1. For convenience we use the ‘’ notation on to hide logarithmic factors, which will be made explicit in the actual theorems. Our contributions include:

Almost tight roundspeedup tradeoffs for fixedtime. In particular, we show that any algorithm for the fixedtime best arm identification problem in the collaborative learning model with agents that achieves speedup needs at least rounds. We complement this lower bound with an algorithm that runs in rounds and achieves speedup.

Almost tight roundspeedup tradeoffs for fixedconfidence. In particular, we show that any algorithm for the fixed confidence best arm identification problem in the collaborative learning model with agents that achieves speedup needs at least rounds, which almost matches an algorithm in [21] that runs in rounds and achieves speedup.

A separation for two problems. The two results above give a separation on the round complexity of fully scalable algorithms between the fixedtime case and the fixedconfidence case. In particular, the fixedtime case has smaller round complexity for input instances with , which indicates that knowing the “right” time budget is useful to reduce the number of rounds of the computation.

A generalization of the roundelimination technique. In the lower bound proof for the fixedtime case, we develop a new technique which can be seen as a generalization of the standard roundelimination technique: we perform the round reduction on classes of input distributions. We believe that this new technique will be useful for proving roundspeedup tradeoffs for other problems in collaborative learning.

A new technique for instancedependent round complexity. In the lower bound proof for the fixedconfidence case, we develop a new technique for proving instancedependent lower bound for round complexity. The distribution exchange lemma we introduce for handling different input distributions at different rounds may be of independent interest.
Related Works.
There are two main research directions in literature for MAB in the centralized setting, regret minimization and pure exploration. In the regret minimization setting (see e.g. [4, 9, 27]), the player aims at maximizing the total reward gained within the time horizon, which is equivalent to minimizing the regret which is defined to be the difference between the total reward achieved by the offline optimal strategy (where all information about the input instance is known beforehand) and the total reward by the player. In the pure exploration setting (see, e.g. [17, 18, 3, 23, 22, 15]), the goal is to maximize the probability to successfully identify the best arm, while minimizing the number of sequential samples used by the player. Motivated by various applications, other exploration goals were also studied, e.g., to identify the top best arms [10, 46, 13], and to identify the set of arms with means above a given threshold [29].
The collaborative learning model for MAB studied in this paper was first proposed by [21], and has proved to be practically useful – authors of [44] and [25] applied the model to distributed wireless network monitoring and collective sensemaking.
Agarwal et al. [1] studied the problem of minimum adaptivity needed in pure exploration. Their model can be viewed as a restricted collaborative learning model, where the agents are not fully adaptive and have to determine their strategy at the beginning of each round. Some solid bounds on the round complexity are proved in [1], including a lower bound using the round elimination technique. As we shall discuss shortly, we develop a generalized round elimination framework and prove a much better round complexity lower bound for a more sophisticated hard instance.
There are other works studying the regret minimization problem under various distributed computing settings. For example, motivated by the applications in cognitive radio network, a line of research (e.g., [28, 38, 7]) studied the regret minimization problem where the radio channels are modeled by the arms and the rewards represent the utilization rates of radio channels which could be deeply discounted if an arm is simultaneously played by multiple agents and a collision occurs. Regret minimization algorithms were also designed for the distributed settings with an underlying communication network for the peertopeer environments (e.g., [41, 26, 43]). In [6, 12], the authors studied distributed regret minimization in the adversarial case. Authors of [34] studied the regret minimization problem in the batched setting.
Blum et al. [8] studied PAC learning of a general function in the collaborative setting, and their results were further strengthened by [14, 33]. However, in the collaborative learning model they studied, each agent can only sample from one particular distribution, and is thus different from the model this paper focuses on.
2 Techniques Overview
In this section we summarize the high level ideas of our algorithms and lower bounds. For convenience, the parameters used in this overview are only for illustration purposes.
Lower bound for fixedtime algorithms.
A standard technique for proving round lower bounds in communication/sample complexity is the round elimination [32]. Roughly speaking, we show that if there exists an round algorithm with error probability and sample complexity on an input distribution , then there also exists an round algorithm with error probability and sample complexity on an input distribution . Finally, we show that there is no round algorithm with error probability on a nontrivial input distribution .
In [1] the authors used the round elimination technique to prove an round lower bound for the best arm identification problem under the total pull budget .^{5}^{5}5 is the number of times the logarithm function must be iteratively applied before the result is less than or equal to 1. In their hard input there is a single best arm with mean , and arms with means . This “onespike” structure makes it relatively easy to perform the standard round elimination. The basic arguments in [1] go as follows: Suppose the best arm is chosen from the arms uniformly at random. If the agents do not make enough pulls in the first round, then conditioned on the pull outcomes of the first round, the posterior distribution of the index of the best arm can be written as a convex combination of a set of distributions, each of which has support size at least and is close (in terms of the total variation distance) to the uniform distribution on its support, and is thus again hard for a round algorithm. This step can be seen as an input embedding.
However, since our goal is to prove a much higher logarithmic round lower bound, we have to restrict the total pull budget within the instance dependent parameter ( is the difference between the mean of the best arm and that of the th best arm in the input), and create a hard input distribution with logarithmic levels of arms in terms of their means.^{6}^{6}6 is a standard parameter for describing the pull complexity of algorithms in the multiarmed bandits literature (see, e.g., [9]). Roughly speaking, we take random arms and assign them with mean , random arms with mean , and so on. With such a “pyramidlike” structure, it seems difficult to take the same path of arguments as that for the onespike structure in [1]. In particular, it is not clear how to decompose the posterior distribution of the means of arms into a convex combination of a set of distributions, each of which is close to the same pyramidlike distribution. We note that such a decomposition is nontrivial even for the onespike structure. Now with a pyramidlike structure we have to guarantee that arms of the th level (denoted by ) are chosen randomly from for each level , which looks to be technically challenging.
We take a different approach. We perform the round elimination on classes of input distributions. More precisely, we show that if there is no round algorithm with error probability and pull complexity on any distribution in distribution class , then there is no round algorithm with error probability and pull complexity on any distribution in distribution class . When working with a class of distributions, we do not need to show that the posterior distribution of some input distribution is close to a particular distribution, but only that .
Although we now have more flexibility on selecting hard input distribution, we still want to find classes of distributions that are easy to work with. To this end we introduce two more ideas. First, at the beginning we sample the mean of each arm independently from the same distribution, in which the pyramidlike structure is encoded. We found that making the means of arms independent of each other at any time (conditioned on the observations obtained so far) can dramatically simplify the analysis. Second, we choose to publish some arms after each round to make the posterior distribution of the set of unpublished arms stay within the distribution class . By publishing an arm we mean to exploit the arm and learn its mean exactly. With the ability of publishing arms we can keep the classes of distributions relatively simple for the round elimination process.
Further different from [1] in which the set of arms pulled by each agent in each round is predetermined at the beginning (i.e., the pulls are oblivious in each round), we allow the agents to act adaptively in each round. Allowing adaptivity inside each round adds another layer of technical challenge to our lower bound proof. Using a couplinglike argument, we manage to show that when the number of arms is smaller than the number of agents , adaptive pulls do not have much advantage against oblivious pulls in each round. We note that such an argument does not hold when , and this is why we can only prove a round lower bound of in the adaptive case compared with a round lower bound of in the oblivious case when the speedup . Surprisingly, this is almost the best that we can achieve – our next result shows that there is an speedup adaptive algorithm using rounds of computation.
Upper bound for fixedtime algorithms.
Our algorithm is conceptually simple, and goes by two phases. The goal of the first phase is to eliminate most of the suboptimal arms and make sure that the number of the remaining arms is at most , which is the number of agents. This is achieved by assigning each arm to a random agent, and each agent uses time budget to identify the best arm among its assigned arms using the startoftheart centralized algorithm. Note that no communication is needed in this phase, and there are still rounds left for the second phase. We allow each of the rounds to use time budget. The goal of the th round in the second phase is to reduce the number of arms to at most , so that after the th round, only the optimal arm survives. To achieve this, we uniformly spend the time budget on each remaining arm. We are able to prove that this simple strategy works, and our analysis crucially relies on the the guarantee that there are at most arms at the beginning of the th round.
We note that when , the speedup of our algorithm is , matching that of the round algorithm presented in [21]. Our algorithm also provides the optimal speedup guarantee for , matching our lower bound result mentioned above.
The algorithm mentioned above only guarantees to identify the best arm with constant error probability. When the input time horizon is larger, one would expect an algorithm with an error probability that diminishes exponentially in . To this end, we strengthen our basic algorithm to a metaalgorithm that invokes the basic algorithm several times in parallel and returns the plurality vote. One technical difficulty here is that the optimal error probability depends on the input instance and is not known beforehand. One has to guess the right problem complexity and make sure that the basic algorithm does not consistently return the same suboptimal arm when the given time horizon is less than the problem complexity (otherwise the meta algorithm would recognize the suboptimal arm as the best arm with high confidence).
We manage to resolve this issue via novel algorithmic ideas that may be applied to strengthen fixedtime bandit algorithms in general. In particular, in the first phase of our basic algorithm, we assign a random time budget (instead of the fixed as described above) to the centralized algorithm invoked by each agent, and this proves to be useful to prevent the algorithm from identifying a suboptimal arm with overwhelmingly high probability. We note that in [21], the authors got around this problem by allowing the algorithm to have access to both the time horizon and the confidence parameters, which does not fall into the standard fixedtime category.
Lower bound for fixedconfidence algorithms.
We first reduce the lower bound for best arm identification algorithms to the task of showing round lower bound for a closely related problem, SignId, which has proved to be a useful proxy in studying the lower bounds for bandit exploration in the centralized setting [19, 22, 15]. The goal of SignId is to identify (with fixed confidence) whether the mean reward of the only input arm is greater or less than . The difference between and the mean of the arm, denoted by , corresponds to in the best arm identification problem, and our new task becomes to show a round lower bound for the SignId problem that increases as approaches .
While our lower bound proof for fixedtime setting can be viewed as a generalization of the round elimination technique, our lower bound for the SignId problem in the fixedconfidence setting uses a completely different approach due to the following reasons. First, the online learning algorithm that our lower bound is against aims at achieving an instance dependent optimal time complexity as it gradually learns the underlying distribution. In other words, the hardness stems from the fact that the algorithm does not know the underlying distribution beforehand, while traditional round elimination proofs do not utilize this property. Second, our lower bound proof introduces a sequence of arm distributions and inductively shows that any algorithm needs at least rounds on the th input distribution. While traditional round elimination manages to achieve this induction via embedding the st input distribution into the th input distribution, it is not clear how to perform such an embedding in our proof, as our distributions are very different.
Intuitively, in our inductive proof we set the th input distribution to be the Bernoulli arm with and depends on (the number of agents) and (the speedup of the algorithm). We hope to show that any algorithm needs rounds on the th input distribution. Suppose we have shown the lower bound for the th input distribution. Since the algorithm has speedup, it performs at most pulls for the th instance. We will show via a distribution exchange lemma (which will be explained in details shortly) that this amount of pulls is not sufficient to tell from . Hence the algorithm also uses at most pulls during the first rounds on the st instance, which is not sufficient to decide the sign of the st instance. Therefore the algorithm needs at least rounds on the st instance, completing the induction for the st instance.
To make the intuition rigorous, we need to strengthen our inductive hypothesis as follows. The goal of the th inductive step is to show that for , any algorithm needs at least rounds and makes at most pulls across the agents during the first rounds. While the th inductive step holds straightforwardly as the induction basis, we go from the th inductive step to the st inductive step via a progress lemma and the distribution exchange lemma mentioned above.
Given the hypothesis for the th inductive step, the progress lemma guarantees that the algorithm has to proceed to the st round and perform more pulls. Thanks to the strengthened hypothesis, the total number of pulls performed in the first rounds is . Hence the statistical difference between the pulls drawn from the th input distribution and its negated distribution (where the outcomes and are flipped) is at most due to Pinsker’s inequality, and this is not enough for the algorithm to correctly decide the sign of the arm.
The distribution exchange lemma guarantees that the algorithm performs no more than pulls across the agents during the first rounds on the st input distribution. By setting , one can verify that , and the hypothesis for the st inductive step is proved. The intuition behind the distribution exchange lemma is as follows. While the algorithm needs rounds on the th input distribution (by the progress lemma), we know that the algorithm cannot use more than pulls by the speedup constraint. These many pulls are not enough to tell the difference between the th and the st distribution, and hence we can change the underlying distribution and show that the same happens for the st input distribution.
However, this intuition is not easy to be formalized. If we simply use the statistical difference between the distributions induced by and to upper bound the probability difference between each agent’s behavior for the two input arms, we will face a probability error of for each agent. In total, this becomes a probability error of throughout all agents, which is too much. To overcome this difficulty, we need to prove a more refined probabilistic upper bound on the behavior discrepancy of each agent for different arms. This is achieved via a technical lemma that provides a much better upper bound on the difference between the probabilities that two product distributions assign to the same event, given that the event does not happen very often. This technical lemma may be of independent interest.
3 Lower Bounds for FixedTime Distributed Algorithms
In this section we prove a lower bound for the fixedtime collaborative learning algorithms. We start by considering the nonadaptive case, where in each round each agent fixes the (multi)set of arms to pull as well as the order of the pulls at the very beginning. We will then extend the proof to the adaptive case.
When we write we mean is in the range of .
3.1 Lower Bound for NonAdaptive Algorithms
We prove the following theorem in this section.
Theorem 1.
For any , any speedup randomized nonadaptive algorithm for the fixedtime best arm identification problem in the collaborative learning model with agents and arms needs rounds in expectation.
Parameters.
We list a few parameters to be used in the proof. Let be the parameter in the statement of Theorem 1. Set (thus ), , , and .
3.1.1 The Class of Hard Distributions
We first define a class of distributions which is hard for the best arm identification problem.
Let be a parameter to be chosen later (in (7)). Define to be the class of distributions with support
such that if , then

(only defined for )

For any , , where is a normalization factor (to make ).
Note that when , only contains a single distribution; slightly abusing the notation, define to denote that particular distribution. For , define . That is, we set by default.
We introduce a few threshold parameters: , , . It is easy to see that .
The following lemma gives some basic properties of pulling from an arm with mean . We leave the proof to Appendix B.
Lemma 2.
Consider an arm with mean . We pull the arm times. Let be the pull outcomes, and let . We have the followings.

If for , then with probability at least .

If for , then with probability at least .

If for , then with probability at least .
The next lemma states important properties of distributions in classes . Intuitively, if the mean of an arm is distributed according to some distribution in class , then after pulling it times, we can learn by Lemma 2 that at least one of the followings hold: (1) the sequence of pull outcomes is very rare; (2) very likely the mean of the arm is at most ; (3) very likely the mean of the arm is more than . In the first two cases we publish the arm, that is, we fully exploit the arm and learn its mean exactly. We will show that if the arm is not published, then the posterior distribution of the mean of the arm (given the outcomes of the pulls) belongs to class .
Lemma 3.
Consider an arm with mean where for some . We pull the arm times. Let be the pull outcomes, and let . If , then we publish the arm. Let be the posterior distribution of after observing . If the arm is not published, then we must have .
Proof.
We analyze the posterior distribution of after observing for any with .
Let denote the event that , and let denote the event that . Since , we have
(1) 
For the convenience of writing, let . Thus where . Let , and .
For any with , we have
(2)  
where
(3) 
We next analyze . For small enough , we have , and . Taking the natural logarithm on both sides of (3) and using two inequalities for and above, we have
(4)  
Plugging (4) back to (2), we have
(5) 
where the last inequality holds since and . Therefore satisfies the first condition of the distribution class .
For any with and , we have
(6)  
where is a normalization factor, and in the last equality, since , and , we can set . Therefore satisfies the second condition of the distribution class .
3.1.2 The Hard Input Distribution
Input Distribution :
We pick the hard input distribution for the best arm identification problem as follows: the mean of each of the arms is , where .
Set , where is the normalization factor of the distribution . This implies
(7) 
We try to use the running time of a good deterministic sequential algorithm as an “upper bound” for that of any collaborative learning algorithm we consider.
Lemma 4.
Given budget , the deterministic sequential algorithm in [3] has expected error at most on input distribution .
Proof.
We first bound the probability that there is only one best arm with mean when . Denote this event by .
(8) 
Given budget , the error of the algorithm in [3] (denoted by ) on an input instance is bounded by
(9) 
where
(10) 
where is the difference between the mean of the best arm and that of the th best arm in . We try to upper bound when conditioned on event .
Recall that in the distribution , for where is a normalization factor. Let be the number of arms with mean . By ChernoffHoeffding bound and union bound, we have that with probability , for all ,
Thus for a large enough universal constant , with probability ,
(11) 
Pluggingin (11) to (9), we get
(12) 
where the equality holds since and . Therefore, conditioned on event and under time budget , the expected error of on input distribution is at most . By (8) and (12), the expected error of under time budget on input distribution is at most . ∎
3.1.3 Proof of Theorem 1
We say a collaborative learning algorithm is cost if the total number of pulls made by agents is . By Yao’s Minimax Lemma [45], and the fact that if there is a speedup collaborative learning algorithm, then by Lemma 4 we have an cost collaborative learning algorithm, Theorem 1 follows immediately from the following lemma.
Lemma 5.
Any deterministic cost nonadaptive algorithm that solves the best arm identification problem in the collaborative learning model with agents and arms with error probability at most on input distribution needs rounds.
Let . In the rest of this section we prove Lemma 5 by induction.
The Induction Step.
The following lemma intuitively states that if there is no good round cost nonadaptive algorithm, then there is no good round cost nonadaptive algorithm.
Lemma 6.
For any , if there is no round cost deterministic nonadaptive algorithm with error probability on any input distribution in for any , then there is no round cost deterministic nonadaptive algorithm with error probability on any input distribution in for any .
Proof.
Consider any round cost deterministic nonadaptive algorithm that succeeds with probability on any input distribution in for any . Since we are considering a nonadaptive algorithm, at the beginning of the first round, the total number of pulls by the agents on each of the arms in the first round are fixed. Let be such a pull configuration, where denotes the number of pulls on the th arm. For an cost algorithm, by a simple counting argument, at least fraction of satisfies . Let be the set of arms with . Since
we have .
We augment the first round of Algorithm as follows.
Algorithm Augmentation.
We publish all arms in .
For the rest of the arms , we keep pulling them until the total number of pulls reaches . Let be the pull outcomes. If , we publish the arm.
If the number of unpublished arms is not in the range of , or there is a published arm with mean , then we return “error”.
We note that the first two steps will only help the algorithm, and thus will only lead to a stronger lower bound. We will show that the extra error introduced by the last step is small, which will be counted in the error probability increase in the induction.
The following claim bounds the number of arms that are not published after the first round.
Claim 7.
For any , with probability at least , the number of unpublished arms after the first round is in the range .
Proof.
For each arm , let be its mean where . Let be the indicator variable of the event that arm is not published. By Lemma 2,
By ChernoffHoeffding bound, and the fact that we publish all arms in , we have
with probability . Plugging the fact that , we have that with probability over distribution ,
Therefore, if , then with probability , . ∎
The following claim shows that the best arm is not likely to be published in the first round.
Claim 8.
For any , the probability that there is a published arm with mean is at most .
Proof.
Since the input distribution to belongs to the class , the probability that contains an arm with mean , conditioned on , can be upper bounded by
For each arm arms, by Lemma 2 we have that if arm has mean , then with probability at least we have . The lemma follows by a union bound. ∎
By Claim 7, Claim 8 and Lemma 3 (which states that if an arm is not published, then its posterior distribution belongs to ), for , if there is no round cost algorithm with error probability on any input distribution in for any , then there is no round cost algorithm with error probability on any input distribution in for any , which proves Lemma 6. ∎
The Base Case.
Recall that in our collaborative learning model, if an algorithm uses round then it needs to output the answer immediately (without any further arm pull). We have the following lemma.
Lemma 9.
Any round deterministic algorithm must have error probability at least on any distribution in for any .
Proof.
First we have
(13) 
Thus the probability that there exists at least one arm with mean is
Comments
There are no comments yet.