# Collaborative Learning with Limited Interaction: Tight Bounds for Distributed Exploration in Multi-Armed Bandits

Best arm identification (or, pure exploration) in multi-armed bandits is a fundamental problem in machine learning. In this paper we study the distributed version of this problem where we have multiple agents, and they want to learn the best arm collaboratively. We want to quantify the power of collaboration under limited interaction (or, communication steps), as interaction is expensive in many settings. We measure the running time of a distributed algorithm as the speedup over the best centralized algorithm where there is only one agent. We give almost tight round-speedup tradeoffs for this problem, along which we develop several new techniques for proving lower bounds on the number of communication steps under time or confidence constraints.

## Authors

• 11 publications
• 14 publications
• 55 publications
• ### Quantum exploration algorithms for multi-armed bandits

Identifying the best arm of a multi-armed bandit is a central problem in...
07/14/2020 ∙ by Daochen Wang, et al. ∙ 0

• ### Collaborative Top Distribution Identifications with Limited Interaction

We consider the following problem in this paper: given a set of n distri...
04/20/2020 ∙ by Nikolai Karpov, et al. ∙ 0

• ### Fundamental Limits of Online and Distributed Algorithms for Statistical Learning and Estimation

Many machine learning approaches are characterized by information constr...
11/14/2013 ∙ by Ohad Shamir, et al. ∙ 0

• ### Diversity-Driven Selection of Exploration Strategies in Multi-Armed Bandits

We consider a scenario where an agent has multiple available strategies ...
08/23/2018 ∙ by Fabien C. Y. Benureau, et al. ∙ 0

• ### Social Learning in Multi Agent Multi Armed Bandits

In this paper, we introduce a distributed version of the classical stoch...
10/04/2019 ∙ by Abishek Sankararaman, et al. ∙ 0

• ### Robust Multi-Agent Multi-Armed Bandits

There has been recent interest in collaborative multi-agent bandits, whe...
07/07/2020 ∙ by Daniel Vial, et al. ∙ 0

• ### Resource Allocation in Multi-armed Bandit Exploration: Overcoming Nonlinear Scaling with Adaptive Parallelism

We study exploration in stochastic multi-armed bandits when we have acce...
10/31/2020 ∙ by Brijen Thananjeyan, et al. ∙ 11

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

One of the biggest challenges in machine learning is to make learning scalable. A natural way to speed up the learning process is to introduce multiple learners/agents, and let them learn the target function collaboratively. A fundamental question in this direction is to quantify the power of collaboration under limited interaction, as interaction is expensive in many settings. In this paper we approach this general question via the study of a central problem in online learning – best arm identification (or, pure exploration) in multi-armed bandits. We present efficient collaborative learning algorithms and complement them with almost tight lower bounds.

##### Best Arm Identification.

In multi-armed bandits (MAB) we have alternative arms, where the -th arm is associated with an unknown reward distribution with mean . Without loss of generality we assume that each has support on ; this can always be satisfied with proper rescaling. We are interested in the best arm identification problem in MAB, in which we want to identify the arm with the largest mean. In the standard setting we only have one agent, who tries to identify the best arm by a sequence of arm pulls. Upon each pull of the -th arm the agent observes an i.i.d. sample/reward from . At any time step, the index of the next pull (or, the final output at the end of the game) is decided by the indices and outcomes of all previous pulls and the randomness of the algorithm (if any). Our goal is to identify the best arm using the minimum amount of arm pulls, which is equivalent to minimizing the running time of the algorithm; we can just assume that each arm pull takes a unit time.

MAB has been studied for more than half a century [37, 20], due to its wide practical applications in clinical trials [36], adaptive routings [5], financial portfolio design [39], model selection [31], computer game play [40], stories/ads display on website [2], just to name a few. In many of these scenarios we are interested in finding out the best arm (strategy, choice, etc.) as soon as possible and committing to it. For example, in the Monte Carlo Tree Search used by computer game play engines, we want to find out the best move among a huge number of possible moves. In the task of high-quality website design, we hope to find out the best design among a set of alternatives for display. In almost all such applications the arm pull is the most expensive component: in the real-time decision making of computer game play, it is time-expensive to perform a single Monte Carlo simulation; in website design tasks, having a user to test each alternative is both time and capital expensive (often a fixed monetary reward is paid for each trial a tester carries out).

In the literature of best arm identification in MAB, two variants have been considered:

1. Fixed-time best arm: Given a time budget

, identify the best arm with the smallest error probability.

111In the literature this is often called fixed-budget best arm. Here we use time instead of budget in order to be consistent with the collaborative learning setting, where it is easier to measure the performance of the algorithm by its running time.

2. Fixed-confidence best arm: Given an error probability , identify the best arm with error probability at most using the smallest amount of time.

We will study both variants in this paper.

##### Collaborative Best Arm Identification.

In this paper we study best arm identification in the collaborative learning model, where we have agents who try to learn the best arm together. The learning proceeds in rounds. In each round each agent pull a (multi)set of arms without communication. For each agent at any time step, based on the indices and outcomes of all previous pulls, all the messages received, and the randomness of the algorithm (if any), the agent, if not in the wait mode, takes one of the following actions: (1) makes the next pull; (2) requests for a communication step and enters the wait mode; (3) terminates and outputs the answer. A communication step starts if all non-terminated agents are in the wait mode. After a communication step all non-terminated agents exit the wait mode and start a new round. During each communication step each agent can broadcast a message to every other agent. While we do not restrict the size of the message, in practice it will not be too large.222The information of all pull outcomes of an agent can be described by an array of size at most , with each coordinate storing a pair , where is the number of pulls on the -th arm, and is sum of the rewards of the pulls. Once terminated, the agent will not make any further actions. The algorithm terminates if all agents terminate. When the algorithm terminates, each agent should agree on the same best arm. The number of rounds of computation, denoted by , is the number of communication steps plus one.

Our goal in the collaborative learning model is to minimize the number of rounds , and the running time , where is the maximum number of pulls made among the agents in round . The motivation for minimizing is that initiating a communication step always comes with a big time overhead, due to network bandwidth, latency, and protocol handshaking. Round-efficiency is one of the major concerns in all parallel/distributed computational models such as the BSP model [42] and MapReduce [16]. The total cost of the algorithm is a weighted sum of and , where the coefficients depend on the concrete applications. We are thus interested in the best round-time tradeoffs for collaborative best arm identification.

##### Speedup in Collaborative Learning.

As the time complexity of the best arm identification in the centralized setting is already well-understood (see, e.g. [17, 30, 3, 23, 22, 24, 11, 15]), we would like to interpret the running time of a collaborative learning algorithm as the speedup over that of the best centralized algorithm, which also expresses the power of collaboration. Intuitively speaking, if the running time of the best centralized algorithm is , and that of a proposed collaborative learning algorithm is , then we say the speedup of is . However, due to the parameters in the definition of the best arm identification and the instance dependent bounds for the best centralized algorithms, the definition of the speedup of a collaborative learning algorithm needs to be a bit more involved.

Given an algorithm and input instance , let be the error probability of on given time budget . Given an algorithm and an error probability , let be the smallest time needed for to succeed on with probability at least . Given two algorithms , and two time horizons , we say dominates , denoted by , if for any input instance , we have . We define the speedup of collaborative learning algorithms for the two variants of the best arm identification problem separately.

• Fixed-time: we define the speedup of a collaborative learning algorithm as

 βA=infTinfcentralized alg OsupT′:(A,T′)≽(O,T)TT′.

That is, for each centralized algorithm , we define the ratio of and to be where is the smallest time horizon such that dominates . We then define the speedup to be the worst-case ratio running over all centralized algorithm .

• Fixed-confidence: we define the speedup of a collaborative learning algorithm as

That is, for each centralized algorithm , we define the ratio of and to be the worst-case ratio of the running time of for achieving error probability on and that of for achieving error probability on running over all possible input . We then define the speedup to be the worst-case ratio running over all centralized algorithm .

In both cases, let where the is taken over all -round algorithms for the collaborative learning model with agents.333A similar concept of speedup was introduce in the previous work [21]. However, no formal definition was given in [21].

Clearly there is a tradeoff between and : When (i.e., there is no communication step), each agent needs to solve the problem by itself, and thus . When increases, may increase. On the other hand we always have . Our goal is to find the best round-speedup tradeoffs, which is essentially equivalent to the round-time tradeoffs that we mentioned earlier.

As one of our goals is to understand the scalability of the learning process, we are particularly interested in one end of the tradeoff curve: What is the smallest such that ? In other words, how many rounds are needed to make best arm identification fully scalable in the collaborative learning model? In this paper we will address this question by giving almost tight round-speedup tradeoffs.

##### Our Contributions.

Our results are shown in Table 1. For convenience we use the ‘’ notation on to hide logarithmic factors, which will be made explicit in the actual theorems. Our contributions include:

1. Almost tight round-speedup tradeoffs for fixed-time. In particular, we show that any algorithm for the fixed-time best arm identification problem in the collaborative learning model with agents that achieves -speedup needs at least rounds. We complement this lower bound with an algorithm that runs in rounds and achieves -speedup.

2. Almost tight round-speedup tradeoffs for fixed-confidence. In particular, we show that any algorithm for the fixed confidence best arm identification problem in the collaborative learning model with agents that achieves -speedup needs at least rounds, which almost matches an algorithm in [21] that runs in rounds and achieves -speedup.

3. A separation for two problems. The two results above give a separation on the round complexity of fully scalable algorithms between the fixed-time case and the fixed-confidence case. In particular, the fixed-time case has smaller round complexity for input instances with , which indicates that knowing the “right” time budget is useful to reduce the number of rounds of the computation.

4. A generalization of the round-elimination technique. In the lower bound proof for the fixed-time case, we develop a new technique which can be seen as a generalization of the standard round-elimination technique: we perform the round reduction on classes of input distributions. We believe that this new technique will be useful for proving round-speedup tradeoffs for other problems in collaborative learning.

5. A new technique for instance-dependent round complexity. In the lower bound proof for the fixed-confidence case, we develop a new technique for proving instance-dependent lower bound for round complexity. The distribution exchange lemma we introduce for handling different input distributions at different rounds may be of independent interest.

##### Related Works.

There are two main research directions in literature for MAB in the centralized setting, regret minimization and pure exploration. In the regret minimization setting (see e.g. [4, 9, 27]), the player aims at maximizing the total reward gained within the time horizon, which is equivalent to minimizing the regret which is defined to be the difference between the total reward achieved by the offline optimal strategy (where all information about the input instance is known beforehand) and the total reward by the player. In the pure exploration setting (see, e.g. [17, 18, 3, 23, 22, 15]), the goal is to maximize the probability to successfully identify the best arm, while minimizing the number of sequential samples used by the player. Motivated by various applications, other exploration goals were also studied, e.g., to identify the top- best arms [10, 46, 13], and to identify the set of arms with means above a given threshold [29].

The collaborative learning model for MAB studied in this paper was first proposed by [21], and has proved to be practically useful – authors of [44] and [25] applied the model to distributed wireless network monitoring and collective sensemaking.

Agarwal et al. [1] studied the problem of minimum adaptivity needed in pure exploration. Their model can be viewed as a restricted collaborative learning model, where the agents are not fully adaptive and have to determine their strategy at the beginning of each round. Some solid bounds on the round complexity are proved in [1], including a lower bound using the round elimination technique. As we shall discuss shortly, we develop a generalized round elimination framework and prove a much better round complexity lower bound for a more sophisticated hard instance.

There are other works studying the regret minimization problem under various distributed computing settings. For example, motivated by the applications in cognitive radio network, a line of research (e.g., [28, 38, 7]) studied the regret minimization problem where the radio channels are modeled by the arms and the rewards represent the utilization rates of radio channels which could be deeply discounted if an arm is simultaneously played by multiple agents and a collision occurs. Regret minimization algorithms were also designed for the distributed settings with an underlying communication network for the peer-to-peer environments (e.g., [41, 26, 43]). In [6, 12], the authors studied distributed regret minimization in the adversarial case. Authors of [34] studied the regret minimization problem in the batched setting.

Blum et al. [8] studied PAC learning of a general function in the collaborative setting, and their results were further strengthened by [14, 33]. However, in the collaborative learning model they studied, each agent can only sample from one particular distribution, and is thus different from the model this paper focuses on.

## 2 Techniques Overview

In this section we summarize the high level ideas of our algorithms and lower bounds. For convenience, the parameters used in this overview are only for illustration purposes.

##### Lower bound for fixed-time algorithms.

A standard technique for proving round lower bounds in communication/sample complexity is the round elimination [32]. Roughly speaking, we show that if there exists an -round algorithm with error probability and sample complexity on an input distribution , then there also exists an -round algorithm with error probability and sample complexity on an input distribution . Finally, we show that there is no -round algorithm with error probability on a nontrivial input distribution .

In [1] the authors used the round elimination technique to prove an round lower bound for the best arm identification problem under the total pull budget .555 is the number of times the logarithm function must be iteratively applied before the result is less than or equal to 1. In their hard input there is a single best arm with mean , and arms with means . This “one-spike” structure makes it relatively easy to perform the standard round elimination. The basic arguments in [1] go as follows: Suppose the best arm is chosen from the arms uniformly at random. If the agents do not make enough pulls in the first round, then conditioned on the pull outcomes of the first round, the posterior distribution of the index of the best arm can be written as a convex combination of a set of distributions, each of which has support size at least and is close (in terms of the total variation distance) to the uniform distribution on its support, and is thus again hard for a -round algorithm. This step can be seen as an input embedding.

However, since our goal is to prove a much higher logarithmic round lower bound, we have to restrict the total pull budget within the instance dependent parameter ( is the difference between the mean of the best arm and that of the -th best arm in the input), and create a hard input distribution with logarithmic levels of arms in terms of their means.666 is a standard parameter for describing the pull complexity of algorithms in the multi-armed bandits literature (see, e.g., [9]). Roughly speaking, we take random arms and assign them with mean , random arms with mean , and so on. With such a “pyramid-like” structure, it seems difficult to take the same path of arguments as that for the one-spike structure in [1]. In particular, it is not clear how to decompose the posterior distribution of the means of arms into a convex combination of a set of distributions, each of which is close to the same pyramid-like distribution. We note that such a decomposition is non-trivial even for the one-spike structure. Now with a pyramid-like structure we have to guarantee that arms of the -th level (denoted by ) are chosen randomly from for each level , which looks to be technically challenging.

We take a different approach. We perform the round elimination on classes of input distributions. More precisely, we show that if there is no -round algorithm with error probability and pull complexity on any distribution in distribution class , then there is no -round algorithm with error probability and pull complexity on any distribution in distribution class . When working with a class of distributions, we do not need to show that the posterior distribution of some input distribution is close to a particular distribution, but only that .

Although we now have more flexibility on selecting hard input distribution, we still want to find classes of distributions that are easy to work with. To this end we introduce two more ideas. First, at the beginning we sample the mean of each arm independently from the same distribution, in which the pyramid-like structure is encoded. We found that making the means of arms independent of each other at any time (conditioned on the observations obtained so far) can dramatically simplify the analysis. Second, we choose to publish some arms after each round to make the posterior distribution of the set of unpublished arms stay within the distribution class . By publishing an arm we mean to exploit the arm and learn its mean exactly. With the ability of publishing arms we can keep the classes of distributions relatively simple for the round elimination process.

Further different from [1] in which the set of arms pulled by each agent in each round is pre-determined at the beginning (i.e., the pulls are oblivious in each round), we allow the agents to act adaptively in each round. Allowing adaptivity inside each round adds another layer of technical challenge to our lower bound proof. Using a coupling-like argument, we manage to show that when the number of arms is smaller than the number of agents , adaptive pulls do not have much advantage against oblivious pulls in each round. We note that such an argument does not hold when , and this is why we can only prove a round lower bound of in the adaptive case compared with a round lower bound of in the oblivious case when the speedup . Surprisingly, this is almost the best that we can achieve – our next result shows that there is an -speedup adaptive algorithm using rounds of computation.

##### Upper bound for fixed-time algorithms.

Our algorithm is conceptually simple, and goes by two phases. The goal of the first phase is to eliminate most of the suboptimal arms and make sure that the number of the remaining arms is at most , which is the number of agents. This is achieved by assigning each arm to a random agent, and each agent uses time budget to identify the best arm among its assigned arms using the start-of-the-art centralized algorithm. Note that no communication is needed in this phase, and there are still rounds left for the second phase. We allow each of the rounds to use time budget. The goal of the -th round in the second phase is to reduce the number of arms to at most , so that after the -th round, only the optimal arm survives. To achieve this, we uniformly spend the time budget on each remaining arm. We are able to prove that this simple strategy works, and our analysis crucially relies on the the guarantee that there are at most arms at the beginning of the -th round.

We note that when , the speedup of our algorithm is , matching that of the -round algorithm presented in [21]. Our algorithm also provides the optimal speedup guarantee for , matching our lower bound result mentioned above.

The algorithm mentioned above only guarantees to identify the best arm with constant error probability. When the input time horizon is larger, one would expect an algorithm with an error probability that diminishes exponentially in . To this end, we strengthen our basic algorithm to a meta-algorithm that invokes the basic algorithm several times in parallel and returns the plurality vote. One technical difficulty here is that the optimal error probability depends on the input instance and is not known beforehand. One has to guess the right problem complexity and make sure that the basic algorithm does not consistently return the same suboptimal arm when the given time horizon is less than the problem complexity (otherwise the meta algorithm would recognize the suboptimal arm as the best arm with high confidence).

We manage to resolve this issue via novel algorithmic ideas that may be applied to strengthen fixed-time bandit algorithms in general. In particular, in the first phase of our basic algorithm, we assign a random time budget (instead of the fixed as described above) to the centralized algorithm invoked by each agent, and this proves to be useful to prevent the algorithm from identifying a suboptimal arm with overwhelmingly high probability. We note that in [21], the authors got around this problem by allowing the algorithm to have access to both the time horizon and the confidence parameters, which does not fall into the standard fixed-time category.

##### Lower bound for fixed-confidence algorithms.

We first reduce the lower bound for best arm identification algorithms to the task of showing round lower bound for a closely related problem, SignId, which has proved to be a useful proxy in studying the lower bounds for bandit exploration in the centralized setting [19, 22, 15]. The goal of SignId is to identify (with fixed confidence) whether the mean reward of the only input arm is greater or less than . The difference between and the mean of the arm, denoted by , corresponds to in the best arm identification problem, and our new task becomes to show a round lower bound for the SignId problem that increases as approaches .

While our lower bound proof for fixed-time setting can be viewed as a generalization of the round elimination technique, our lower bound for the SignId problem in the fixed-confidence setting uses a completely different approach due to the following reasons. First, the online learning algorithm that our lower bound is against aims at achieving an instance dependent optimal time complexity as it gradually learns the underlying distribution. In other words, the hardness stems from the fact that the algorithm does not know the underlying distribution beforehand, while traditional round elimination proofs do not utilize this property. Second, our lower bound proof introduces a sequence of arm distributions and inductively shows that any algorithm needs at least rounds on the -th input distribution. While traditional round elimination manages to achieve this induction via embedding the -st input distribution into the -th input distribution, it is not clear how to perform such an embedding in our proof, as our distributions are very different.

Intuitively, in our inductive proof we set the -th input distribution to be the Bernoulli arm with and depends on (the number of agents) and (the speedup of the algorithm). We hope to show that any algorithm needs rounds on the -th input distribution. Suppose we have shown the lower bound for the -th input distribution. Since the algorithm has -speedup, it performs at most pulls for the -th instance. We will show via a distribution exchange lemma (which will be explained in details shortly) that this amount of pulls is not sufficient to tell from . Hence the algorithm also uses at most pulls during the first rounds on the -st instance, which is not sufficient to decide the sign of the -st instance. Therefore the algorithm needs at least rounds on the -st instance, completing the induction for the -st instance.

To make the intuition rigorous, we need to strengthen our inductive hypothesis as follows. The goal of the -th inductive step is to show that for , any algorithm needs at least rounds and makes at most pulls across the agents during the first rounds. While the -th inductive step holds straightforwardly as the induction basis, we go from the -th inductive step to the -st inductive step via a progress lemma and the distribution exchange lemma mentioned above.

Given the hypothesis for the -th inductive step, the progress lemma guarantees that the algorithm has to proceed to the -st round and perform more pulls. Thanks to the strengthened hypothesis, the total number of pulls performed in the first rounds is . Hence the statistical difference between the pulls drawn from the -th input distribution and its negated distribution (where the outcomes and are flipped) is at most due to Pinsker’s inequality, and this is not enough for the algorithm to correctly decide the sign of the arm.

The distribution exchange lemma guarantees that the algorithm performs no more than pulls across the agents during the first rounds on the -st input distribution. By setting , one can verify that , and the hypothesis for the -st inductive step is proved. The intuition behind the distribution exchange lemma is as follows. While the algorithm needs rounds on the -th input distribution (by the progress lemma), we know that the algorithm cannot use more than pulls by the -speedup constraint. These many pulls are not enough to tell the difference between the -th and the -st distribution, and hence we can change the underlying distribution and show that the same happens for the -st input distribution.

However, this intuition is not easy to be formalized. If we simply use the statistical difference between the distributions induced by and to upper bound the probability difference between each agent’s behavior for the two input arms, we will face a probability error of for each agent. In total, this becomes a probability error of throughout all agents, which is too much. To overcome this difficulty, we need to prove a more refined probabilistic upper bound on the behavior discrepancy of each agent for different arms. This is achieved via a technical lemma that provides a much better upper bound on the difference between the probabilities that two product distributions assign to the same event, given that the event does not happen very often. This technical lemma may be of independent interest.

## 3 Lower Bounds for Fixed-Time Distributed Algorithms

In this section we prove a lower bound for the fixed-time collaborative learning algorithms. We start by considering the non-adaptive case, where in each round each agent fixes the (multi-)set of arms to pull as well as the order of the pulls at the very beginning. We will then extend the proof to the adaptive case.

When we write we mean is in the range of .

### 3.1 Lower Bound for Non-Adaptive Algorithms

We prove the following theorem in this section.

###### Theorem 1.

For any , any -speedup randomized non-adaptive algorithm for the fixed-time best arm identification problem in the collaborative learning model with agents and arms needs rounds in expectation.

##### Parameters.

We list a few parameters to be used in the proof. Let be the parameter in the statement of Theorem 1. Set (thus ), , , and .

#### 3.1.1 The Class of Hard Distributions

We first define a class of distributions which is hard for the best arm identification problem.

Let be a parameter to be chosen later (in (7)). Define to be the class of distributions with support

 {B−1,…,B−(j−1),B−j,…,B−L},

such that if , then

1. (only defined for )

2. For any , , where is a normalization factor (to make ).

Note that when , only contains a single distribution; slightly abusing the notation, define to denote that particular distribution. For , define . That is, we set by default.

We introduce a few threshold parameters: , , . It is easy to see that .

The following lemma gives some basic properties of pulling from an arm with mean . We leave the proof to Appendix B.

###### Lemma 2.

Consider an arm with mean . We pull the arm times. Let be the pull outcomes, and let . We have the followings.

1. If for , then with probability at least .

2. If for , then with probability at least .

3. If for , then with probability at least .

The next lemma states important properties of distributions in classes . Intuitively, if the mean of an arm is distributed according to some distribution in class , then after pulling it times, we can learn by Lemma 2 that at least one of the followings hold: (1) the sequence of pull outcomes is very rare; (2) very likely the mean of the arm is at most ; (3) very likely the mean of the arm is more than . In the first two cases we publish the arm, that is, we fully exploit the arm and learn its mean exactly. We will show that if the arm is not published, then the posterior distribution of the mean of the arm (given the outcomes of the pulls) belongs to class .

###### Lemma 3.

Consider an arm with mean where for some . We pull the arm times. Let be the pull outcomes, and let . If , then we publish the arm. Let be the posterior distribution of after observing . If the arm is not published, then we must have .

###### Proof.

We analyze the posterior distribution of after observing for any with .

Let denote the event that , and let denote the event that . Since , we have

 Pr[χ>j]≥1/(10B2). (1)

For the convenience of writing, let . Thus where . Let , and .

For any with , we have

 Pr[χ≤j | Θ=θ] = Pr[Θ=θ | χ≤j]⋅Pr[χ≤j]Pr[Θ=θ] (2) = Pr[Θ=θ | χ≤j]⋅Pr[χ≤j]Pr[Θ=θ | χ≤j]⋅Pr[χ≤j]+Pr[Θ=θ | χ>j]⋅Pr[χ>j] ≤ Pr[Θ=θ | X=ϵ]⋅10+Pr[Θ=θ | X=ϵ′]⋅1/(10B2)(by (???)% and monotonicity) = ≤ 10B2⋅(1/2−ϵ)ζ1(1/2+ϵ)m−ζ1(1/2−ϵ′)ζ1(1/2+ϵ′)m−ζ1 (by monotonicity) = 10B2⋅Am,

where

 A=(1−2ϵ)1/2−z(1+2ϵ)1/2+z(1−2ϵ′)1/2−z(1+2ϵ′)1/2+z. (3)

We next analyze . For small enough , we have , and . Taking the natural logarithm on both sides of (3) and using two inequalities for and above, we have

 lnA ≤ (1/2−z)(−2ϵ−2ϵ2+2ϵ′+2(ϵ′)2+8(ϵ′)3)+(1/2+z)(2ϵ−2ϵ2+8ϵ′−2(ϵ′)2−2(ϵ′)3) (4) = 1/2⋅(−4ϵ2+8ϵ3+4(ϵ′)2+8(ϵ′)3)+z(4ϵ+8ϵ3−4(ϵ′)+8(ϵ′)3) ≤ −2B−2j+B−j(B−1+√10lnnγ)4B−j+O(B−2j−1) ≤ −B−2j.

Plugging (4) back to (2), we have

 Pr[χ≤j | Θ=θ]≤2B2⋅e−B−2j⋅γB2j≤n−9. (5)

where the last inequality holds since and . Therefore satisfies the first condition of the distribution class .

For any with and , we have

 Pr[X=B−ℓ | Θ=θ] (6) = Pr[Θ=θ | X=B−ℓ]⋅Pr[X=B−ℓ]Pr[Θ=θ] = 1Pr[Θ=θ]⋅(12√2πγB2j⋅1√1−4B−2ℓ⋅(1±B−ℓ)Bj+1100)⋅λjB−2ℓ(1±ρ−ℓη) = (1Pr[Θ=θ]⋅12√2πγB2j⋅λj)⋅1√1−4B−2ℓ⋅(1±B−ℓ)Bj+1100⋅B−2ℓ(1±ρ−ℓη) = λ′j⋅(1±3B−2ℓ)⋅(1±B−ℓ+j+1)150⋅B−2ℓ(1±ρ−ℓη) = λ′j⋅B−2ℓ(1±ρ−ℓη′),

where is a normalization factor, and in the last equality, since , and , we can set . Therefore satisfies the second condition of the distribution class .

By (2) and (6), we have . ∎

#### 3.1.2 The Hard Input Distribution

##### Input Distribution σ:

We pick the hard input distribution for the best arm identification problem as follows: the mean of each of the arms is , where .

Set , where is the normalization factor of the distribution . This implies

 L=ln(nλ1)/(2lnB)=Θ(lnn/(lnlnn+lnα)). (7)

We try to use the running time of a good deterministic sequential algorithm as an “upper bound” for that of any collaborative learning algorithm we consider.

###### Lemma 4.

Given budget , the deterministic sequential algorithm in [3] has expected error at most on input distribution .

###### Proof.

We first bound the probability that there is only one best arm with mean when . Denote this event by .

 Pr[E0]=n⋅λ1B−2L(1−λ1B−2L)n−1=(1−1/n)n−1≥1/e. (8)

Given budget , the error of the algorithm in [3] (denoted by ) on an input instance is bounded by

 err(I)≤n2⋅exp(−W2lnn⋅H(I)), (9)

where

 H(I)=n∑i=21Δ2i , (10)

where is the difference between the mean of the best arm and that of the -th best arm in . We try to upper bound when conditioned on event .

Recall that in the distribution , for where is a normalization factor. Let be the number of arms with mean . By Chernoff-Hoeffding bound and union bound, we have that with probability , for all ,

 kℓ=Θ(λ1B−2ℓn)=Θ(B2L−2ℓ).

Thus for a large enough universal constant , with probability ,

 H(I)=L−1∑ℓ=1kℓ⋅1(B−ℓ−B−L)2≤cHLB2L. (11)

Plugging-in (11) to (9), we get

 err(I)≤n2⋅exp(−nln3n⋅B22lnn⋅cHLB2L)=o(1), (12)

where the equality holds since and . Therefore, conditioned on event and under time budget , the expected error of on input distribution is at most . By (8) and (12), the expected error of under time budget on input distribution is at most . ∎

#### 3.1.3 Proof of Theorem 1

We say a collaborative learning algorithm is -cost if the total number of pulls made by agents is . By Yao’s Minimax Lemma [45], and the fact that if there is a -speedup collaborative learning algorithm, then by Lemma 4 we have an -cost collaborative learning algorithm, Theorem 1 follows immediately from the following lemma.

###### Lemma 5.

Any deterministic -cost non-adaptive algorithm that solves the best arm identification problem in the collaborative learning model with agents and arms with error probability at most on input distribution needs rounds.

Let . In the rest of this section we prove Lemma 5 by induction.

##### The Induction Step.

The following lemma intuitively states that if there is no good -round -cost non-adaptive algorithm, then there is no good -round -cost non-adaptive algorithm.

###### Lemma 6.

For any , if there is no -round -cost deterministic non-adaptive algorithm with error probability on any input distribution in for any , then there is no -round -cost deterministic non-adaptive algorithm with error probability on any input distribution in for any .

###### Proof.

Consider any -round -cost deterministic non-adaptive algorithm that succeeds with probability on any input distribution in for any . Since we are considering a non-adaptive algorithm, at the beginning of the first round, the total number of pulls by the agents on each of the arms in the first round are fixed. Let be such a pull configuration, where denotes the number of pulls on the -th arm. For an -cost algorithm, by a simple counting argument, at least fraction of satisfies . Let be the set of arms with . Since

 ακWnj≤ακnln3nB2((1−1L)B−2)j−1n≤γB2j,

we have .

We augment the first round of Algorithm as follows.

Algorithm Augmentation.

1. We publish all arms in .

2. For the rest of the arms , we keep pulling them until the total number of pulls reaches . Let be the pull outcomes. If , we publish the arm.

3. If the number of unpublished arms is not in the range of , or there is a published arm with mean , then we return “error”.

We note that the first two steps will only help the algorithm, and thus will only lead to a stronger lower bound. We will show that the extra error introduced by the last step is small, which will be counted in the error probability increase in the induction.

The following claim bounds the number of arms that are not published after the first round.

###### Claim 7.

For any , with probability at least , the number of unpublished arms after the first round is in the range .

###### Proof.

For each arm , let be its mean where . Let be the indicator variable of the event that arm is not published. By Lemma 2,

 Pr[Yz=1] = ∑ℓ>jPr[X=B−ℓ]±n−9 = (1±1B)B2j⋅B−2(j+1)(1±ρ−(j+1)⋅ρj)±n−9 = (1±1L2)⋅B−2.

By Chernoff-Hoeffding bound, and the fact that we publish all arms in , we have

 ∑z∈[nj]Yz=(1±2L2)B−2(nj−|S|)

with probability . Plugging the fact that , we have that with probability over distribution ,

 ∑z∈[nj]Yz=(1±2L2)(1±1κ)B−2nj=(1±1L)B−2nj.

Therefore, if , then with probability , . ∎

The following claim shows that the best arm is not likely to be published in the first round.

###### Claim 8.

For any , the probability that there is a published arm with mean is at most .

###### Proof.

Since the input distribution to belongs to the class , the probability that contains an arm with mean , conditioned on , can be upper bounded by

 1−(1−λjB−2L⋅(1+ρ−L+j))njκ ≤ 1−(1−λjB−2L⋅(1+ρ−L+j))((1+1L)B−2)j−1⋅nκ = 1−(1−λjB2L⋅(1+ρ−L+j))((1+1L)B−2)j−1⋅B2Lλ11κ = O(1κ).

For each arm arms, by Lemma 2 we have that if arm has mean , then with probability at least we have . The lemma follows by a union bound. ∎

By Claim 7, Claim 8 and Lemma 3 (which states that if an arm is not published, then its posterior distribution belongs to ), for , if there is no -round -cost algorithm with error probability on any input distribution in for any , then there is no -round -cost algorithm with error probability on any input distribution in for any , which proves Lemma 6. ∎

##### The Base Case.

Recall that in our collaborative learning model, if an algorithm uses round then it needs to output the answer immediately (without any further arm pull). We have the following lemma.

###### Lemma 9.

Any -round deterministic algorithm must have error probability at least on any distribution in for any .

###### Proof.

First we have

 nL2 = ((1±1L)B−2)L2−1n=((1±1L)B−2)L2−1B2LB2=Θ(BL). (13)

Thus the probability that there exists at least one arm with mean is

 1−(1−(1±1