# Distributed Bandit Learning: How Much Communication is Needed to Achieve (Near) Optimal Regret

We study the communication complexity of distributed multi-armed bandits (MAB) and distributed linear bandits for regret minimization. We propose communication protocols that achieve near-optimal regret bounds and result in optimal speed-up under mild conditions. We measure the communication cost of protocols by the total number of communicated numbers. For multi-armed bandits, we give two protocols that require little communication cost, one is independent of the time horizon T and the other is independent of the number of arms K . In particular, for a distributed K-armed bandit with M agents, our protocols achieve near-optimal regret O(√(MKT T)) with O(M T) and O(MK M) communication cost respectively. We also propose two protocols for d-dimensional distributed linear bandits that achieve near-optimal regret with O(M^1.5d^3) and O((Md+d d) T) communication cost respectively. The communication cost can be independent of T, or almost linear in d.

## Authors

• 10 publications
• 5 publications
• 32 publications
• 111 publications
02/14/2020

### Coordination without communication: optimal regret in two players multi-armed bandits

We consider two agents playing simultaneously the same stochastic three-...
05/18/2015

### Simple regret for infinitely many armed bandits

We consider a stochastic bandit problem with infinitely many arms. In th...
02/08/2021

### Near-optimal Representation Learning for Linear Bandits and Linear RL

This paper studies representation learning for multi-task linear bandits...
01/04/2021

### Be Greedy in Multi-Armed Bandits

The Greedy algorithm is the simplest heuristic in sequential decision pr...
11/01/2021

### Decentralized Cooperative Reinforcement Learning with Hierarchical Information Structure

Multi-agent reinforcement learning (MARL) problems are challenging due t...
09/26/2020

### Near-Optimal MNL Bandits Under Risk Criteria

We study MNL bandits, which is a variant of the traditional multi-armed ...
09/29/2021

### Batched Bandits with Crowd Externalities

In Batched Multi-Armed Bandits (BMAB), the policy is not allowed to be u...
##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Bandit learning is a central topic in online learning, and has various real-world applications, including clinical trials [23], model selection [16] and recommendation systems [3, 15, 2]. In many tasks using bandit algorithms, it is appealing to employ more agents to learn collaboratively and concurrently in order to speed up the learning process. In many other tasks, the sequential decision making is distributed by nature. For instance, multiple spatially separated labs may be working on the same clinical trial. In such distributed applications, communication between agents is critical, but may also be expensive or time-consuming. This motivates us to consider efficient protocols for distributed learning in bandit problems.

A straightforward communication protocol for bandit learning is immediate sharing: each agent shares every new sample immediately with others. Under this scheme, agents can have good collaborative behaviors close to that in a centralized setting. However, the amount of communicated data is directly proportional to the total size of collected samples. When the bandit is played for a long timescale, the cost of communication would render this scheme impractical. A natural question to ask is: How much communication is actually needed for near-optimal performance? In this work, we show that the answer is somewhat surprising: The required communication cost has almost no dependence on the time horizon.

In this paper, we consider the distributed learning of stochastic multi-armed bandits (MAB) and stochastic linear bandits. There are agents interacting with the same bandit instance in a synchronized fashion. In time steps , each agent pulls an arm and observes the associated reward. Between time steps, agents can communicate via a server-agent network. Following the typical formulation of single-agent bandit learning, we consider the task of regret minimization [13, 9, 7]. The total regret of all agents is used as the performance criterion of a communication protocol. The communication cost is measured by the total amount of data communicated in the network. Our goal is to minimize communication cost while maintaining near-optimal performance, that is, regret comparable to the optimal regret of a single agent in interactions with the bandit instance.

For multi-armed bandits, we propose the DEMAB protocol, which achieves near-optimal regret. The amount of transmitted data per agent in DEMAB is independent of , and is logarithmic with respect to other parameters. For linear bandits, we propose the DELB protocol, which achieves near-optimal regret, and has communication cost with at most logarithmic dependence on .

### 1.1 Problem Setting

#### Communication Model

The communication network we consider consists of a server and several agents. Agents can communicate with the server by sending or receiving packets. Each data packet contains an integer or a real number. We define the communication cost of a protocol as the number of integers or real numbers communicated between server and agents111In our protocols, the number of bits each integer or real number uses is only logarithmic w.r.t. instance scale. Using the number of bits as the definition of communication complexity instead will only result in an additional logarithmic factor. The number of communicated bits is analyzed in appendix.. Several previous works consider the total number of communication rounds [10, 21], while we are more interested in the total amount of data transmitted among all rounds. We assume that communication between server and agents has zero latency. Note that protocols in our model can be easily adapted to a network without a server, by designating an agent as the server.

#### Distributed Multi-armed Bandits

In distributed multi-armed bandits, there are agents, labeled ,…,. Each agent is given access to the same stochastic -armed bandit instance. Each arm in the instance is associated with a reward distribution . is supported on with mean . Without loss of generality, we assume that arm 1 is the best arm (i.e. , ). At each time step , each agent chooses an arm , and receives reward independently sampled from . The goal of the agents is to minimize their total regret, which is defined as

 REG(T)=T∑t=1M∑i=1(μ(1)−μ(at,i)).

For single-agent MAB, i.e., , the optimal regret bound is  [4].

#### Distributed Linear Bandits

In distributed linear bandits, the agents are given access to the same -dimensional stochastic linear bandits instance. In particular, we assume that at time step , agents are given an action set . Agent chooses action and observes reward . We assume that the mean of the reward is decided by an unknown parameter : , where are independent and have zero mean. For simplicity, we assume . For distributed linear bandits, the cumulative regret is defined as the sum of individual agent’s regrets:

 REG(T)=T∑t=1M∑i=1(maxx∈DxTθ∗−xTt,iθ∗).

Here, we assume that the action set is fixed. A more general setting considers a time-varying action set . In both cases, algorithms with regret have been proposed [1], while a regret lower bound of is shown in [9].

For both distributed multi-armed bandits and distributed linear bandits, our goal is to use as little communication as possible to achieve near-optimal regret. Since any -agent protocol running for steps can be simulated by a single-agent bandit algorithm running for time steps, the regret of any protocol is lower bounded by the optimal regret of a single-agent algorithm running for time steps. Therefore, we consider regret for multi-armed bandits and regret for linear bandits to be near-optimal.

We are mainly interested in the case where the time horizon is the dominant factor (compared to or ). Unless otherwise stated, we assume that in the multi-armed bandits case and in the linear bandits case.

### 1.2 Our Contribution

Now we give an overview of our results. In both settings, we present communication-efficient protocols that achieve near-optimal regret. Our results are summarized in Table 1.

Our results are compared with a naive baseline solution called immediate sharing in Table 1: each agent sends the index of the arm he pulled and the corresponding reward he received to every other agent via the server immediately. This protocol can achieve near-optimal regret for both MAB and linear bandits ( and ), but comes with high communication cost ( and ).

#### Distributed MAB

For distributed multi-armed bandits, we propose DEMAB (Distributed Elimination for MAB) protocol, which achieves near optimal regret () with communication cost. The communication cost is independent of the number of time steps and grows only logarithmically w.r.t. the number of arms. We also prove the following lower bound: When expected communication cost is less than ( is a universal constant), the total regret is trivially . That is, in order to achieve near-optimal regret, the communication cost of DEMAB matches the lower bound except for logarithmic factors.

#### Distributed Linear Bandits

We propose DELB (Distributed Elimination for Linear Bandits), an elimination based protocol for distributed linear bandits which achieves near-optimal regret bound () with communication cost . The communication cost of DELB enjoys nearly linear dependence on both and , and has at most logarithmic dependence on . For the more general case where the action set is time-varying, the DisLinUCB (Distributed LinUCB) protocol still achieves near-optimal regret, but requires communication cost.

## 2 Related Work

There has been growing interest in bandits problems with multiple players. One line of research considers the challenging problem of multi-armed bandits with collisions [19, 6, 11], in which the reward for an arm is 0 if it is chosen by more than one player. The task is to minimize regret without communication. Their setting is motivated by problems in cognitive radio networks, and is fairly different from ours.

In [20] and [12], the authors consider the distributed learning of MAB and linear bandits with restriction on the communication network. In [20], motivated by fully decentralized applications, the authors consider P2P communication networks, where an agent can communicate with only two other agents at each time step. A gossip-based -greedy algorithm is proposed for distributed MAB. Their algorithm achieves a speedup linear in in terms of error rate, but the communication cost is linear in . The work of [12] uses a gossip protocol for regret minimization in distributed linear bandits. The main difference between their setting and ours is that each agent is only allowed to communicate with one agent at each time step in [12] 222Our algorithms can be modified to meet this restriction with almost no change in performance.. Their algorithm achieves near-optimal () total regret using communication cost.

Another setting in literature concerns about distributed pure exploration in multi-armed bandits [10, 21], where the communication model is the most similar one to ours. These works use elimination based protocols for collaborative exploration, and establish tight bounds for communication-speedup tradeoff. In the fixed confidence setting, near optimal () speedup is achieved with only communication rounds when identifying an -optimal arm [10]. In the fixed-time setting, near optimal speedup in exploration can be achieved with only communication rounds [21]. However, their task (speedup in pure exploration) is not directly comparable to ours (i.e. are not reducible to each other). Moreover, in [10, 21], the number of communication rounds is used as the measure of communication, while we use the amount of transmitted data.

## 3 Main Results for Multi-armed Bandits

In this section, we first summarize the single-agent elimination algorithm [5], and then present our Distributed Elimination for MAB (DEMAB) protocol. The regret and communication efficiency of the protocol is then analyzed in Sec. 3.3. A communication lower bound is presented in Sec. 3.4.

### 3.1 Elimination Algorithm for Single-agent MAB

The elimination algorithm [5] is a near-optimal algorithm for single-agent MAB. The agent acts in phases , and maintains a set of active arms . Initially, (all arms). In phase , each arm in is pulled for times; arms with average reward lower than the maximum are then eliminated from .

For each arm , define its suboptimality gap to be

. In the elimination algorithm, with high probability arm

will be eliminated after approximately phases, in which it is pulled for at most times. It follows that regret is , which is almost instance-optimal. The regret is in the worst case.

### 3.2 The DEMAB Protocol

The DEMAB protocol executes in two stages. In the first stage, each agent directly runs the single-agent elimination algorithm for time steps. The remaining arms of agent are denoted as . In time steps, an agent completes at least phases. The purpose of this separate burn-in period is to eliminate the worst arms quickly without communication, so that in the second stage, elimination can begin with a small threshold of . and is chosen so that the total regret within the first stage is .

Between the two stages, the remaining arms are randomly allocated to agents. Public randomness is used to allocate the remaining arms to save communication. Agents first generate , uniformly random numbers in , from a public random number generator. Agent then computes . By doing so, agent keeps each arm in with probability , and the resulting sets are disjoint. Meanwhile, every arm in is kept in , so that the best arm remains in with high probability333 may not be a subset of , which is not a problem in the regret analysis..

In the second stage, agents start to simulate a single-agent elimination algorithm starting from phase . Initially, the arm set is . In phase , each arm in will be pulled for at least times. Denote the average reward of arm in phase by . If , it will be eliminated; the arm set after the elimination is .

This elimination in the second stage is performed over agents in two ways: In distributed mode or in centralized mode. Let be the number of remaining arms at the start of phase . If is larger than , the elimination is performed in distributed mode. That is, agent keeps a set of arms , and pulls each arm in for times in phase . Each agent only needs to send the highest average reward to the server, who then computes and broadcasts . Agent then eliminates low-rewarding arms from on its local copy.

When , the elimination is performed in centralized mode. That is, will be kept and updated by the server444In the conversion from distributed mode to centralized mode, agents send their local copy to the server, which has communication cost.. In phase , the server assigns an arm in to agents, and asks each of them to pull it times555The indivisible case is handled in Appendix  A .. The server waits for the average rewards to be reported, and then performs elimination on .

One critical issue here is load balancing, especially in distributed mode. Suppose that , . Then the length of phase is determined by . Agent would need to keep pulling arms for times until the start of the next communication round. This will cause an arm to be pulled for much more than times in phase , and can hurt the performance. Therefore, it is vital that at the start of phase , is balanced666

By saying a vector of numbers to be balanced, we mean the maximum is at most twice the minimum.

.

The subroutine Reallocate is designed to ensure this by reallocating arms when is not balanced. First, the server announces the total number of arms; then, agents with more-than-average number of arms donate surplus arms to the server; the server then distributes the donated arms to the other agents, so that every agent has the same number of arms. However, calling Reallocate is communication-expensive: it takes communication cost, where is the current phase and is the last phase where Reallocate is called. Fortunately, since are generated randomly, it is unlikely that one of them contain too many good arms or too many bad arms. By exploiting shared randomness, we greatly reduce the expected communication cost needed for load balancing.

Detailed descriptions of the single-agent elimination algorithm, the Reallocate subroutine, and the DEMAB protocol are provided in Appendix A.

Access to a public random number generator, which is capable of generating and sharing random numbers with all agents with negligible communication cost, is assumed in DEMAB. This is not a strong assumption, since it is well known that a public random number generator can be replaced by private random numbers with a little additional communication [17]. In our case, only additional bits of communication, or additional communication cost, are required for all of our theoretical guarantees to hold. See Appendix B for detailed discussion.

### 3.3 Regret and Communication Efficiency of DEMAB

In this subsection, we show that the DEMAB protocol achieves near-optimal regret with efficient communication, as captured by the following theorem.

###### Theorem 1.

The DEMAB protocol incurs regret, communication cost with probability , and communication cost in expectation.

The worst-case regret bound above can be improved to an instance-dependent near-optimal regret bound by changing the choice of and to . In that case the communication cost is , which is a small increase. See Theorem 5 in Appendix B for detailed discussion.

We now give a sketch of the proof of Theorem 1.

#### Regret

In the first stage, each agent runs a separate elimination algorithm for timesteps, which has regret . Total regret for all agents in this stage is . After the first stage, each agent must have completed at least phases. Hence, with high probability, before the second stage, contains the optimal arm and only arms with suboptimality gap less than .

In the second stage, if , it will be pulled at most times in phase because of our load balancing effort. Therefore, if arm has suboptimality gap , it will be pulled for times. It follows that regret in the second stage is , and that total regret is .

#### Communication

In the first stage and the random allocation of arms, no communication is needed. The focus is therefore on the second stage.

During a phase, apart from the potential cost of calling Reallocate, communication cost is . The communication cost of calling Reallocate in phase is at most , where is the last phase where Reallocate is called. Therefore, total cost for calling Reallocate in one execution is at most , where is the first phase in which Reallocate is called. From the definition of and , we can see that there are at most phases in the second stage. Therefore in the worst case, communication cost is since .

However, in expectation, is much smaller than . Because of the random allocation, when is large enough, would be balanced with high probability. In fact, with probability , . Setting , we can show that the expected communication complexity is .

### 3.4 Lower Bound

Intuitively, in order to avoid a scaling of regret, amount of communication cost is necessary; otherwise, most of the agents can hardly do better than a single-agent algorithm. We prove this intuition in the following theorem.

###### Theorem 2.

For any protocol with expected communication cost less than , there exists a MAB instance such that total regret is .

The theorem is proved using a reduction from single-agent bandits to multi-agent bandits, i.e. a mapping from protocols to single-agent algorithms.

One can trivially achieve regret with communication cost by running an optimal MAB algorithm separately. Therefore, Theorem 2 essentially gives a lower bound on communication cost for achieving non-trivial regret. The communication cost of DEMAB is only slightly larger than this lower bound, but DEMAB achieves near-optimal regret. This suggests that the communication-regret trade-off for distributed MAB is a steep one: with communication cost, regret can be near-optimal; with slightly less communication, regret necessarily deteriorates to the trivial case.

## 4 Main Results for Linear Bandits

In this section, we summarize the single-agent elimination algorithm (algorithm 12 in [14]), and present the Distributed Elimination for Linear Bandits (DELB) protocol. This protocol is designed for the case where the action set is fixed, and has communication cost with almost linear dependence on and . Our results for linear bandits with time-varying action set is presented in Sec. 4.4. For convenience, we assume is a finite set, which is without loss of generality777When is infinite, we can replace with an -net of , and only take actions in the -net. If , this will not influence the regret. This is a feasible approach, but may not be efficient..

### 4.1 Elimination Algorithm for Single-agent Linear Bandit

The elimination algorithm for linear bandits [14] also iteratively eliminates arms from the initial action set. In phase , the algorithm maintains an active action set . It computes a distribution over and pulls arms according to . Suppose pulls are made in this phase according to

. We use linear regression to estimate the mean reward of each arm based on these pulls. Arms with estimated rewards

lower than the maximum are eliminated at the end of the phase.

To eliminate arms with suboptimality gap in phase with high probability, the estimation error in phase needs to be less than . On the other hand, to achieve tight regret, the number of pulls we make in phase needs to be as small as possible. Let and . According to the analysis in [14] (Chapter 21), if we choose each arm exactly times, the estimation error for any arm is at most with high probability. This means that we need to find a distribution that minimizes , which is equivalent to a well-known problem called -optimal design [18]. One can find a distribution minimizing with . The support set of (a.k.a. the core set) has size at most . As a result, only pulls are needed in phase .

### 4.2 The DELB Protocol

In this protocol, we parallelize the data collection part of each phase by sending instructions to agents in a communication-efficient way. In phase , the server and the agents both locally solve the same -optimal design problem on , the remaining set of actions. We only need to find a -approximation to the optimal . That is, we only need to find satisfying . On the other hand, we require the solution to have a support smaller than . This is feasible since the Frank-Wolfe algorithm under appropriate initialization can find such an approximate solution for finite action sets (see Proposition 3.17 [22]). After that server assigns arms to agents. Since both the server and agents obtain the same core set by solving -optimal design, the server only needs to send the index among arms to identify and allocate each arm. After pulling arms, agents send the results to the server, who summarizes the results with linear regression. Agents and the server then eliminate low rewarding arms from their local copy of .

For convenience, we define

### 4.3 Regret and Communication Efficiency of DELB

We state our results for the elimination-based protocol for distributed linear bandits. The full proof is given in Appendix F.

###### Theorem 3.

The DELB protocol achieves expected regret with communication cost .

Proof sketch: In round , the number of pulls is at most . Based on the analysis for elimination-based algorithm, we can show that the suboptimality gap is at most with probability for any arm pulled in phase . Suppose there are at most phases, we can prove that

In each phase, communication cost comes from three parts: assigning arms to agents, receiving average rewards of each arm and sending to agents. In the first and second part, each arm is designated to as few agents as possible. We can show that the communication cost of these parts is . In the third part, the cost of sending is . Since is at most , the total communication is

### 4.4 Protocol for Linear Bandits with Time-varying Action Set

In some previous work on linear bandits [8, 1], the action set available at timestep may be time-varying. That is, players can only choose actions from at time , while regret is defined against the optimal action in . The DELB protocol does not apply in this scenario. To handle this setting, we propose a different protocol DisLinUCB (Distributed LinUCB) based on LinUCB [1]. We only state the main results here. Detailed description of the protocol and the proof are given in Appendix G and Appendix H.

###### Theorem 4.

DisLinUCB protocol achieves expected regret of with communication cost.

Although the regret bound is still near-optimal, the communication cost has worse dependencies on and compared to that of DELB.

## References

• [1] Yasin Abbasi-Yadkori, Dávid Pál, and Csaba Szepesvári. Improved algorithms for linear stochastic bandits. In Advances in Neural Information Processing Systems, pages 2312–2320, 2011.
• [2] Naoki Abe, Alan W Biermann, and Philip M Long. Reinforcement learning with immediate rewards and linear hypotheses. Algorithmica, 37(4):263–293, 2003.
• [3] Deepak Agarwal, Bee-Chung Chen, Pradheep Elango, Nitin Motgi, Seung-Taek Park, Raghu Ramakrishnan, Scott Roy, and Joe Zachariah. Online models for content optimization. In Advances in Neural Information Processing Systems, pages 17–24, 2009.
• [4] Jean-Yves Audibert and Sébastien Bubeck. Minimax policies for adversarial and stochastic bandits. In COLT, pages 217–226, 2009.
• [5] Peter Auer and Ronald Ortner. Ucb revisited: Improved regret bounds for the stochastic multi-armed bandit problem. Periodica Mathematica Hungarica, 61(1-2):55–65, 2010.
• [6] Ilai Bistritz and Amir Leshem. Distributed multi-player bandits-a game of thrones approach. In Advances in Neural Information Processing Systems, pages 7222–7232, 2018.
• [7] Sébastien Bubeck, Nicolo Cesa-Bianchi, et al. Regret analysis of stochastic and nonstochastic multi-armed bandit problems.

Foundations and Trends® in Machine Learning

, 5(1):1–122, 2012.
• [8] Wei Chu, Lihong Li, Lev Reyzin, and Robert Schapire. Contextual bandits with linear payoff functions. In

Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics

, pages 208–214, 2011.
• [9] Varsha Dani, Thomas P. Hayes, and Sham M. Kakade. Stochastic linear optimization under bandit feedback. In 21st Annual Conference on Learning Theory - COLT 2008, Helsinki, Finland, July 9-12, 2008, pages 355–366, 2008.
• [10] Eshcar Hillel, Zohar S Karnin, Tomer Koren, Ronny Lempel, and Oren Somekh. Distributed exploration in multi-armed bandits. In Advances in Neural Information Processing Systems, pages 854–862, 2013.
• [11] Dileep Kalathil, Naumaan Nayyar, and Rahul Jain. Decentralized learning for multiplayer multiarmed bandits. IEEE Transactions on Information Theory, 60(4):2331–2345, 2014.
• [12] Nathan Korda, Balázs Szörényi, and Li Shuai. Distributed clustering of linear bandits in peer to peer networks. In Journal of machine learning research workshop and conference proceedings, volume 48, pages 1301–1309. International Machine Learning Societ, 2016.
• [13] Tze Leung Lai et al. Adaptive treatment allocation and the multi-armed bandit problem. The Annals of Statistics, 15(3):1091–1114, 1987.
• [14] Tor Lattimore and Csaba Szepesvári. Bandit algorithms. preprint, 2019.
• [15] Lihong Li, Wei Chu, John Langford, and Robert E Schapire. A contextual-bandit approach to personalized news article recommendation. In Proceedings of the 19th international conference on World wide web, pages 661–670. ACM, 2010.
• [16] Oded Maron and Andrew W Moore. Hoeffding races: Accelerating model selection search for classification and function approximation. In Advances in neural information processing systems, pages 59–66, 1994.
• [17] Ilan Newman. Private vs. common random bits in communication complexity. Information processing letters, 39(2):67–71, 1991.
• [18] Friedrich Pukelsheim. Optimal design of experiments. SIAM, 2006.
• [19] Jonathan Rosenski, Ohad Shamir, and Liran Szlak. Multi-player bandits–a musical chairs approach. In International Conference on Machine Learning, pages 155–163, 2016.
• [20] Balázs Szörényi, Róbert Busa-Fekete, István Hegedűs, Róbert Ormándi, Márk Jelasity, and Balázs Kégl. Gossip-based distributed stochastic bandit algorithms. In Journal of Machine Learning Research Workshop and Conference Proceedings, volume 2, pages 1056–1064. International Machine Learning Societ, 2013.
• [21] Chao Tao, Qin Zhang, and Yuan Zhou. Collaborative learning with limited interaction: Tight bounds for distributed exploration in multi-armed bandits. arXiv preprint:1904.03293, 2019.
• [22] Michael J Todd. Minimum-volume ellipsoids: Theory and algorithms, volume 23. SIAM, 2016.
• [23] You-Gan Wang. Sequential allocation in clinical trials. Communications in Statistics-Theory and Methods, 20(3):791–805, 1991.

## Appendix A A Detailed Description of DEMAB

In this section, we give a detailed description of the DEMAB protocol and some subroutines used in the protocol.

Eliminate: Eliminate executes the single-agent elimination algorithm. In this function, each agent runs the single-agent elimination algorithm for time steps, then return the remaining arms.

Reallocate: In Reallocate, the server announces the average number of arms; agents with more-than-average arms then donate surplus arms to the server; the server then distributes the donated arms to the other agents, so that every agent has nearly the same number of arms. After calling Reallocate, becomes balanced again. The function contains the following two parts: One running on the server, and the other running on each agent.

Centralize: When the number of arms drops below , the subroutine Centralize is called, in which agents send their local copy of remaining arms, , to the server, and server receives .

Assignment Strategy: In centralized mode, server assigns arms to agents in the following way. Let . If is exactly divisible by , for each arm in , server asks separate agents to play it for times. If not, we allocate pulls to agents in the following way: Let denote the average pulls each agent needs to perform. Our assignment starts from the arm with the smallest index and agent 1. For arm and agent , if agent has been assigned pulls, we turn to agent . If we have finished allocating pulls for arm , we continue designating arm . The assignment is finished until all pulls are scheduled to agents.

## Appendix B B Proof of Theorem 1

In this section, we give a full proof of Theorem 1, which bounds the total regret and communication cost of the DEMAB protocol. In the analysis below, we will use to represent (in the distributed mode) or (in the centralized mode). It refers to the set of remaining arms at the start of the -th phase at stage 2, either held separately by the agents or held by the server.

Suppose that the protocol terminates when . We also let be the number of times arm is pulled before time step . Without loss of generality, we assume that arm is the best arm, and define .

We first state a few facts and lemmas.

###### Fact 1.

We state some facts regarding the execution of the algorithm.

1. At line of the server’s part in Reallocate, ;

2. After Reallocate is called, is balanced;

3. For any player , the number of completed phases in stage 1 is at least ;

4. The number of phases at stage 2 is at most .

###### Proof.

1. Let and . At line 3 of server’s reallocate code, server receives arms. At line 5, arms are removed. So at line 6 .

2. Let . If , the reallocation procedure will do nothing, and is by definition balanced. If , then at the end of the reallocation procedure, every player has a new set of arms such that . This implies that the number of arms is balanced, since when reallocation is called, .

3. The length of the -th phase at stage is at most . After phases, the number of timesteps satisfies . Set . We can see that the number of phases at stage 1 is at least .

4. Suppose that phase is completed. Since at least pulls are made in phase , we can show that . On the same time, . Therefore the number of phases at stage 2 satisfies

 l−l0≤⌈log4(C2M2K2)⌉≤4+1.5log(MK)=L=O(log(MK)).

Via a direct application of Hoeffding’s inequality and union bound, we have the following lemma.

###### Lemma B.1.

Let be the maximum number of phases for all agents at stage 1. Denote the average rewards computed by agent in phase of stage 1 be . With probability at least , for all phases , any agent , any arm ,

 |^μi,l(a)−μ(a)|≤2−l−1
###### Proof.

Observe that and by Fact 1.3.

For any agent , any phase , denote the empirical mean for arm in phase by . By a direct application of Hoeffding’s bound and union bound, we can observe that for any fixed , by Hoeffding’s bound,

 Pr[∣∣^μi,l(a)−μ(a)∣∣>2−l−1]≤2exp{−12ml⋅4−l−1}≤2(MKT)2.

Take a union bound for agents , arms , and phases , the desired result is proved. ∎

###### Lemma B.2.

At the end of stage 1, the following holds with probability :

1. , for all , (If exists);

2. , for all , , (If exists);

3. ;

4. , ;

5. ;

Denote the event that the above holds by .

###### Proof.

These results are direct implications of lemma B.1.

1. Notice that

 ^μi,l(1)≥μ(1)−2−l−1≥μ(a)−2−l−1≥^μi,l(a)−2−l.

with probability . Thus, arm 1 will never be eliminated throughout the first phases.

2. For any ,

 ^μi,l(a)≥maxk∈A(i)l^μi,l(k)−2−l≥^μi,l(1)−2−l

with probability , which means

 μ(a)≥^μi,l(a)−