# Multiplayer bandits without observing collision information

We study multiplayer stochastic multi-armed bandit problems in which the players cannot communicate, and if two or more players pull the same arm, a collision occurs and the involved players receive zero reward. We consider two feedback models: a model in which the players can observe whether a collision has occurred, and a more difficult setup when no collision information is available. We give the first theoretical guarantees for the second model: an algorithm with a logarithmic regret, and an algorithm with a square-root regret type that does not depend on the gaps between the means. For the first model, we give the first square-root regret bounds that do not depend on the gaps. Building on these ideas, we also give an algorithm for reaching approximate Nash equilibria quickly in stochastic anti-coordination games.

There are no comments yet.

## Authors

• 21 publications
• 8 publications
• ### New Algorithms for Multiplayer Bandits when Arm Means Vary Among Players

We study multiplayer stochastic multi-armed bandit problems in which the...
02/04/2019 ∙ by Emilie Kaufmann, et al. ∙ 0

• ### An Optimal Algorithm in Multiplayer Multi-Armed Bandits

The paper addresses the Multiplayer Multi-Armed Bandit (MMAB) problem, w...
09/28/2019 ∙ by Alexandre Proutiere, et al. ∙ 26

• ### Non-Stochastic Multi-Player Multi-Armed Bandits: Optimal Rate With Collision Information, Sublinear Without

We consider the non-stochastic version of the (cooperative) multi-player...
04/28/2019 ∙ by Sébastien Bubeck, et al. ∙ 4

• ### SIC-MMAB: Synchronisation Involves Communication in Multiplayer Multi-Armed Bandits

We consider the stochastic multiplayer multi-armed bandit problem, where...
09/21/2018 ∙ by Etienne Boursier, et al. ∙ 0

• ### Multi-Player Bandits: A Trekking Approach

We study stochastic multi-armed bandits with many players. The players d...
09/17/2018 ∙ by Manjesh K. Hanawal, et al. ∙ 0

• ### Decentralized Multi-player Multi-armed Bandits with No Collision Information

The decentralized stochastic multi-player multi-armed bandit (MP-MAB) pr...
02/29/2020 ∙ by Chengshuai Shi, et al. ∙ 0

• ### Selfish Robustness and Equilibria in Multi-Player Bandits

Motivated by cognitive radios, stochastic multi-player multi-armed bandi...
02/04/2020 ∙ by Etienne Boursier, et al. ∙ 0

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

The stochastic multi-armed bandit problem is a well-studied problem of machine learning: consider an agent that has to choose among several actions in each round of a game. To each action

is associated a real-valued parameter . Whenever the player performs the -th action, she receives a random reward with mean . If the player knew the means associated to the actions before starting the game, she would play an action with the highest mean during all rounds. The problem is to design a strategy for the player to maximize her reward in the setting where she does not know the means. The regret of the strategy is the difference between the accumulated rewards in the two scenarios.

This problem encapsulates the well-known exploration/exploitation trade-off: the player never learns the means exactly, but can estimate them. As the game proceeds, she learns that some of the actions

probably have better means, so she can ‘exploit’ these actions to obtain a better reward, but at the same time she has to ‘explore’ other actions as well, since they might have higher means. We refer the reader to Bubeck and Cesa-Bianchi (2012) for a survey on this problem. Traditionally, actions are called ‘arms’ and ‘pulling an arm’ refers to performing an action.

We study a multiplayer version of this game, in which each player pulls an arm in each round, and if two or more players pull the same arm, a collision occurs and all players pulling that arm receive zero reward. The players’ goal is to maximize the collective received reward.

One application for this model is opportunistic spectrum access with multiple users in a cognitive radio network: we have a radio network with several channels (corresponding to the arms) that have been purchased by primary users. There are also secondary users (the players) that can try to use these channels during the rounds when the primary users are not transmitting. Successfully using a channel to transmit a message means a unit reward, and not transmitting means zero reward. If more than one secondary users try to use the same channel in the same round, a collision occurs and none of them can transmit. If a unique secondary user tries to use a channel, she will succeed if the primary user owning that channel happens to be idle in that round, which happens with a certain probability. Thus, the reward of the secondary user is a Bernoulli random variable whose mean depends on the activity of the corresponding primary user, and whether other secondary users have tried to use the same channel. See

Liu and Zhao (2010, Section I.D) for other applications.

One may consider (at least) two possible feedback models: in the first model, whenever a player pulls an arm, she observes whether a collision has occurred on that arm, and receives a reward. In the second model the player just receives a reward, without observing whether a collision has occurred (of course, if the reward is positive, she can infer that no collision has occurred, but if the reward is zero, it is not clear whether a collision has occurred or not). The first feedback model has been studied in a series of work where theoretical guarantees have been proved. The second feedback model was introduced by Bonnefoi, Besson, Moy, Kaufmann, and Palicot (2017), motivated by large scale IoT applications, and further studied by Besson and Kaufmann (2018), but for this model no theoretical guarantees have been proved.

Our main contributions are summarized as follows.

1. We offer the first theoretical guarantees for the second model, where the players do not observe collision information. We propose an algorithm with a logarithmic regret (in terms of the number of rounds), and we also give an algorithm with a sublinear regret that does not depend on the gaps between the means.

2. For the first model, in which the players observe collision information, we prove the first sublinear regret bounds that do not depend on the gaps between the means.

3. One may also view this setup as a stochastic anti-coordination game. Using the algorithmic ideas introduced here, we give an algorithm for reaching an approximate Nash equilibrium quickly in such games.

### 1.1 The model, results, and organization

Let be a positive integer and let be nonnegative numbers corresponding to the arm means. Let be the reward of arm in round , so the are independent, identically distributed, and . We may assume, by relabelling the arms if necessary, that . The players are of course unaware of this labelling.

A set of players play the following game for rounds: in each round , player chooses an arm . Let be the collision indicator for arm in round , that is, if and only if there exist distinct with . Player receives the reward in round .

We will also consider a stronger feedback model, in which each player also observes in each round ; this is called ‘the model with collision information.’

The regret of a strategy is defined as

 Regret=T∑i∈[m]μi−∑t∈[T]∑j∈[m]μAj(t)(1−CAj(t)(t)) (1)

Note that regret is a random variable (since the strategy can randomize hence can be random) and we will bound its expected value, although ‘with high probability’ bounds can also be derived from our proofs.

To simplify the statements and proofs of our main theorems, we make three additional assumptions, which can be relaxed at the expense of getting worse bounds, as discussed in Section 4.

1. : there are at least as many arms as players.

2. is supported on so the means and the rewards are also in .

3. All players know the values of both and .

Note that we assume no communication between the players, and our algorithms are totally distributed. All of our algorithms are explicit, simple, and efficient.

We can now state our main theorems. Let . All the following results correspond to the weak feedback model (i.e., no collision information), except if stated otherwise. Certainly any regret upper bound for this model automatically carries over to the stronger feedback model as well.

###### Theorem 1.1.

There is an algorithm with expected regret .

In this theorem and throughout, the notation means there exists an absolute constant such that for all admissible parameters, .

A shortcoming of Theorem 1.1 is that it gives a vacuous bound if , and gives a very bad bound if is very small. Moreover, one may wonder if a regret of the form is possible that is independent of the gaps, as in the single player case. The following theorem shows this is possible, under some weak assumptions. Let . Observe that , and that is positive unless all arms have the same mean.

###### Theorem 1.2.

(a) Suppose all players know a lower bound for . Then there is an algorithm with expected regret .

(b) For the stronger feedback model, in which the players observe the collision information, there is an algorithm with expected regret

 O(K2mlog2(T)+Kmmin{√TlogT,log(T)/Δ′})=O(K2m√TlogT).

(c) Suppose each player has the option of leaving the game at any point; that is, she can choose not to pull from some round onwards. Then, for any , there exists an algorithm with expected regret . In particular, setting gives an algorithm whose expected regret is .

We do not know whether our regret upper bounds are tight; proving lower bounds is left for further work. Some asymptotic lower bounds for the stronger feedback model have been proved by Besson and Kaufmann (2018, Section 3).

Another interesting avenue for future research is the setting in which the rewards are not i.i.d., but are chosen by an adversary.

The three algorithms proving Theorem 1.2 are quite similar. All of our algorithms have the property that, eventually each player fixates on one arm. This can be viewed as ‘reaching an equilibrium’ in a game-theoretic framework, where the actions correspond to arms, and the outcome of each action is the mean of the arm if no two players choose that action, and zero otherwise. Games with the property that ‘if two or more players choose the same action then their outcome is zero’ are called ‘anti-coordination games.’ Using our techniques for multiplayer bandits, we also provide an algorithm for converging to an approximate Nash equilibrium quickly in such a game.

More precisely, we define a stochastic anti-coordination game as follows: for each player and action , there is an outcome , such that if player performs action while no other player performs it, she will get a random reward in with mean , while if two or more players perform the same action , all get reward 0. An assignment of players to actions is an -Nash equilibrium if, no player can improve her expected reward by more than by switching to another action, while other players’ actions are unchanged. Then, we would like to design an algorithm for each player that reaches an -Nash equilibrium quickly. We prove the following theorem in this direction.

###### Theorem 1.3.

There is a distributed algorithm that with probability at least converges to an -Nash equilibrium in any stochastic anti-coordination game within many rounds.

In proving this theorem we assume each player also has the option of choosing a ‘dummy’ action with zero reward, which is given index 0. This is a realistic assumption in most applications.

Next we review some related work. Theorems 1.1 and 1.2 are proved in Sections 2 and 3, respectively. In Section 4 we discuss how to relax Assumptions 1–3 above. Finally, the proof of Theorem 1.3 appears in Section 5.

### 1.2 Related work

There is little previous work on the model without observing collision information: the model was introduced by Bonnefoi et al. (2017) and further studied by Besson and Kaufmann (2018). These papers introduce an algorithm and study it empirically, but no theoretical guarantee is given. In particular, it is argued in Besson and Kaufmann (2018, Appendix E) that the expected regret of that algorithm is linear.

We now review previous work on the stronger feedback model with collision information available to the players. The multiplayer multi-armed bandit games were introduced by Anantharam, Varaiya, and Walrand (1987) and further studied by Komiyama, Honda, and Nakagawa (2015). They studied a centralized algorithm, that is, when there is a single centre that controls the players, and observes the rewards of all players. The distributed setting was introduced by Liu and Zhao (2010), where an algorithm was given with expected regret bounded by , with depending on the game parameters, that is, , , and the arm means. They also showed that any algorithm must have regret . The dependence of on the parameters was further improved by Anandkumar, Michael, Tang, and Swami (2011); Rosenski, Shamir, and Szlak (2016); Besson and Kaufmann (2018).

In Rosenski et al. (2016) a ‘musical chairs’ subroutine was introduced to reduce the number of collisions; we have further developed and used this subroutine in our algorithms. Their final algorithm requires the knowledge of and its regret is bounded by , which is at least as large as the bound of Theorem 1.1.

Besson and Kaufmann (2018) tightened the previous lower bounds, and also developed an algorithm whose regret is bounded by

 O(log(T))(K∑i=m+1mkl(μi,μm)+∑1≤i

where . This bound is not comparable with the bound of Theorem 1.1 in general; however if and , then their bound becomes , which is worse than our bound by a multiplicative factor of .

We emphasize that all the previously known upper bounds become vacuous if , whereas our Theorem 1.2 gives sublinear bounds in this case.

Finally, Avner and Mannor (2014); Rosenski, Shamir, and Szlak (2016) also study a dynamic version of the problem, in which the players can leave the game and new players can arrive, and prove sublinear regret bounds. We do not study such scenarios here.

##### Preliminaries.

We denote . All logarithms are in the natural base. We will use the following versions of Chernoff-Hoeffding concentration inequalities; see, e.g., McDiarmid (1998, Theorem 2.3):

###### Proposition 1.4.

Let the random variables be independent, with for each . Let and . Then we have,

(a) for any ,

 P{|ˆμ−μ|>t}<2exp(−2nt2),

(b) and, for any ,

 P{ˆμ<(1−ε)μ}

## 2 Proof of Theorem 1.1

Each of the players follow the same algorithm, which has four phases, described next. Note that the phases are not synchronized, that is, each phase may have different starting and stopping times for each player. Let .

1. The player pulls arms uniformly at random, and maintains an estimate for the mean of each arm: the estimate for arm is the average reward received from arm , divided by . Note that is precisely the probability of not

getting a conflict for each pull provided that the other players are also pulling arms uniformly at random, hence this is indeed an unbiased estimate for

. In other words, for any round that arm is pulled and reward is received, since conflicts and rewards are independent we have

 μi=EYi,t=Er(t)/E(1−Ci(t))=Er(t)/(1−1/K)m−1.

For each round , the player maintains a sorted list of estimated means, and let be the first round when . The first phase finishes at the end of round . By this time, the player has learned the best arms with high probability (as we prove later), and so has a list of arms with the highest means.

2. For rounds, the player just pulls arms uniformly at random, without updating the estimates.

3. The player plays a so-called Musical Chairs algorithm until it occupies an arm: in each round, she pulls a uniformly random arm . If she gets a positive reward (which means no other player has pulled arm ), we say the player has ‘occupied’ arm , and this phase is finished for the player. Note that, by construction, at most one player will occupy any given arm.

4. The player pulls the occupied arm forever.

The pseudocode is shown as Algorithm 1. We next analyze the regret of this algorithm, starting with some preliminary lemmas.

algocf[htbp]

###### Lemma 2.1.

Suppose . Consider any fixed player and let denote her estimated mean for arm after rounds of Phase 1. Then we have

 P{∃i∈[K],t∈[τ]:|ˆμi,t−μi|>√g/t}<3KTexp(−g/128K).
###### Proof.

Fix an arm . Observe that , so we have deterministically, so for we have . Now fix a . Let denote the number of times this player has pulled arm by round , which is a binomial random variable with mean , hence Proposition 1.4(b) implies . Thus, the union bound gives

 P{|ˆμi,t−μi|>√g/t}√g/t∣∣Ti(t)=s}.

Also, conditioned on any , is the difference between an empirical average of i.i.d. random variables bounded in and their expected value, thus Proposition 1.4(a) gives

 P{|ˆμi,t−μi|>√g/t∣∣Ti(t)=s}<2exp(−sg/8t),

giving

 P{|ˆμi,t−μi|>√g/t}

We now apply a union bound over to get

 P{∃t∈[τ]:|ˆμi,t−μi|>√g/t} <⎛⎝T∑t=⌈g/16⌉exp(−t/8K)⎞⎠+2Texp(−g/16K) <17exp(−g/128K)+2Texp(−g/16K) <3Texp(−g/128K),

since and . Applying a union bound over the arms concludes the proof of the lemma. ∎

###### Corollary 2.2.

With probability at least , the following are true:
(i) all players have learned the best arms by end of their Phase 1,
(ii) we have

 g/Δ2≤τ≤25g/Δ2

for all players, and
(iii) the first two phases are finished for all players after at most many rounds.

###### Proof.

By the choice of , Lemma 2.1 and a union bound over the players, we have that with probability at least , all players have mean estimates that are -close to the actual means, uniformly for all arms and all . If this event holds then the three parts follow.

Part (i) follows by noting that a player would stop Phase 1 when she has found a gap of size between the th the and the th arm. However, by this time she has learned the means of all arms within an error of , therefore by the triangle inequality, she has correctly determined that the th mean is larger than the th mean, whence has learned the best arms.

For part (ii), using the triangle inequality and the definition of we have

 √3gτ≤ˆμm,τ−ˆμm+1,τ≤(ˆμm,τ−μm)+(μm−μm+1)+(μm+1−ˆμm+1,τ)≤√gτ+Δ+√gτ,

whence . On the other hand, for ,

 ˆμm,t−ˆμm+1,t≥(μm−μm+1)−|ˆμm,t−μm|−|μm+1−ˆμm+1,t|≥Δ−2√g/t≥3g/√t,

whence .

Part (iii) follows from part (ii) and noting that the duration of Phase 2 is . ∎

The curious reader may wonder about the role of Phase 2 and ask why cannot a player proceed to Phase 3 right after she has learned the best arms? The reason is to help other players to find the best arms. Indeed, it is possible that a player finishes Phase 1 by round , but the algorithm asks her to continue pulling arms at random, so that the other players continue to have unbiased estimators for the means, for at least

many more rounds, at which point we are guaranteed that all players have finished their Phase 1. Otherwise, if a player switches to Phase 3 too quickly, then this would skew the collision probabilities, and the other players will not have unbiased estimators of the means.

We now proceed to analyze Phase 3, the musical chairs subroutine. By this point all players have learned the best arms, hence they just want to share these arms between themselves as quickly as possible. Note that by definition of the subroutine, once this phase is finished, each player has occupied a distinct arm.

###### Lemma 2.3.

With probability at least , Phase 3 takes at most many rounds for all players.

###### Proof.

We use the fact that, each reward is bounded in , hence . Fix any player in her Phase 3 who has not occupied an arm, and suppose there are still unoccupied arms available for her. (There are players, and each occupies at most 1 arm, hence .) Whenever she tries to occupy an unoccupied arm, her success probability is at least

 amΔ(1−1/m)m−a≥Δ/4m.

Here, is the probability that she pulls an unoccupied arm, is a lower bound for the probability that that arm produces a positive reward, and is the probability that, none of the other players pull that arm. Hence, the probability that the player has not occupied an arm after attempts can be bounded by . Letting makes this probability . Applying the union bound over all players completes the proof. ∎

###### Proof of Theorem 1.1.

By Corollary 2.2 and Lemma 2.3, with probability at least , the first three phases finish for all players after at most many rounds, and after this time, each player has occupied one of the best arms, and different players have occupied distinct arms. During each round, the regret is at most , hence the total regret incurred during the first three phases is bounded by , and the regret afterwards would be 0. On the other hand, with the remaining probability, the regret is at most . Therefore, the expected regret is at most as required. The can be replaced with , noting that

 min{mT,O(mKlog(KT)/Δ2)}=O(mKlog(T)/Δ2).

(Similar replacements of with will also be done in a few other places in the following.) ∎

## 3 Proof of Theorem 1.2

### 3.1 The modified musical chairs subroutines

We need a modified version of the musical chairs algorithm, which we call MusicalChairs2. For any ‘target’ set of arms and any number of rounds, this subroutine consists of precisely rounds as follows: in each round the player pulls a uniformly random arm . If she gets a positive reward, and , then she occupies arm , pulls arm for the remaining rounds, and outputs . Otherwise she tries again. If after rounds she has not occupied any arm, she outputs 0. See the pseudocode below. The following lemma bounds the failure probability of this subroutine.

algocf[htbp]

###### Lemma 3.1.

Let be arbitrary. Suppose a player executes MusicalChairs2(), while any other player either has occupied an arm, or is executing MusicalChairs2, or is pulling arms uniformly at random during these rounds. We say the player is ‘successful’ if after the execution of the subroutine she has occupied an arm, or each arm in with mean is occupied by other players. The probability of ‘failure’ is upper bounded by if , and by in general.

###### Proof.

At any round during the subroutine, suppose the player has not occupied an arm and that there are still unoccupied arms of mean in . Whenever she tries to occupy one of her target arms, her success probability is at least

 aKμ(1−1/K)m−1≥μexp(−2m/K)/K.

Here, is the probability that she pulls an unoccupied arm in her target set with mean , is a lower bound for the probability that that arm produces a positive reward, and is the probability that none of the other players pull the same arm. (Note that her success probability may indeed be larger than this, because she may also occupy arms in her target set with mean .) Hence, the probability that she has not occupied an arm after attempts can be bounded by . The argument when is identical, except we would use the tighter bound . ∎

The following corollary provides a guarantee when many players execute MusicalChairs2 in parallel. The proof is via applying a union bound over the participating players.

###### Corollary 3.2.

Suppose a subset of players execute MusicalChairs2 for the same number of rounds, but with potentially different target sets, while the other players are either pulling random arms or have occupied arms during these rounds. We say the subroutine is successful if all players are successful. The probability that the subroutine fails can be bounded by if , and in general.

In the stronger feedback model in which the players observe the collision information, we modify the MusicalChairs2 algorithm such that for a player to occupy an arm, instead of receiving a positive reward, she would look at the collision information, and would occupy the arm if there was no collision. Call this subroutine MusicalChairs3. We get the following corollary for the failure probability, whose statement and proof are identical to that for Corollary 3.2, except there is no parameter .

algocf[htbp]

###### Corollary 3.3.

Consider the stronger feedback model with collision information available. Suppose a subset of players execute MusicalChairs3 for the same number of rounds, but with potentially different target sets, while the other players are either pulling random arms or have occupied arms during these rounds. We say the subroutine is successful if all players are successful. The probability that the subroutine fails is at most if , and in general.

### 3.2 The whole algorithm

We focus on proving part (a), and then explain how the algorithm should be modified to prove parts (b) and (c). Recall that is a lower bound for that all players know in advance.

We describe the algorithm each player executes, first informally and then formally. The player maintains estimates

for the means, which get closer to actual means as the algorithm proceeds. She also keeps a ‘confidence interval’ for each arm

, which is centred at and has the property that lies in this interval with sufficiently high probability. If arm has been pulled times, this interval has length . Once the player makes sure that some arm is not among the best arms, she marks it as bad and puts it in a set . This can happen either if it is determined that the arm has mean , or if it is determined that there are at least arms whose confidence intervals lie strictly above this arm’s interval (we say interval lies strictly above if ). On the other hand, once the player makes sure that some arm is within the best arms, she marks it as a ‘golden’ arm, and puts it in a set . More precisely, this would happen as soon as there are at least arms that are either determined to be bad, or whose confidence intervals lie strictly below this arm. Other arms (whose status is yet unknown) are called ‘silver’ arms and kept in a set .

Initially all arms are silver. The algorithm proceeds in epochs with increasing lengths. In each epoch, the player explores all the silver arms and hopes to characterize each silver arm as golden or bad at the end of the epoch. As time proceeds, arms whose means are far away from the

th arm will be characterized as either golden or bad. Golden arms will be occupied quickly, and bad arms will not be pulled again, and this will control the algorithm’s regret.

Special care is needed to ensure all players explore all the silver arms without conflicts, and this is done via careful sequences of MusicalChairs2 subroutines. In each epoch, each player maintains a set of explored arms, which is empty when the epoch starts. The epoch has iterations. In each iteration, if there exists some arm in (i.e., an unexplored silver arm), the player tries to occupy such an arm; otherwise, the player has finished exploring the arms in , and she will try to occupy and pull an arbitrary arm from , while other players are exploring their silver arms. Say an arm is -good if its mean is at least , and is -bad otherwise. Note that by assumption, any -bad arm is bad. The length of the MusicalChairs2 subroutines are chosen such that any -good arm in , which is not marked as golden by any other player, will be explored in each epoch by each player. Thus, if an arm is not explored by the end of epoch (that is, if it lies in ), that would mean the either the arm is -bad or the arm is golden and is occupied by another player in the beginning of the epoch. The two cases can be distinguished by checking the empirical reward received from that arm.

The complete pseudocode appears as Algorithm 2 below. Note that this algorithm is synchronized: for all players the epochs and the iterations within the epochs begin and end at the same round.

To analyze the regret of the algorithm, we first define two bad events. The first bad event is that some of the MusicalChairs2 subroutines fail, and the second bad event is that some player’s confidence interval is incorrect, that is, the actual mean does not lie in the confidence interval. We start by bounding the probability of the bad events. Let and .

###### Lemma 3.4.

The probability that some bad event happens is at most .

###### Proof.

The probability that some MusicalChairs2 subroutine fails is bounded by by Corollary 3.2. Applying the union bound over the (at most ) epochs and the times MusicalChairs2 is executed in each epoch, gives the probability that some MusicalChairs2 subroutine fails is at most

 3Km×T×mexp(−αM/4K)≤1/2mT,

by the choice of .

Whenever a confidence interval for some arm is updated in some epoch (Line 2), the arm has been pulled precisely times right before that (Line 2). The probability that some confidence interval is incorrect for some player, say in epoch , is hence bounded via Proposition 1.4(a) by

 P{|ˆμj−μj|>√g/2i}<2exp(−2×2ig/2i)=2exp(−2g).

Now applying the union bound over the players, the (at most ) epochs, and the number of updating of the confidence intervals within each epoch, gives the probability of some incorrect confidence interval is at most

 m×T×Km×2exp(−2g)≤1/2mT,

by the choice of , as required. ∎

We are now ready to prove Theorem 1.2(a).

###### Proof of Theorem 1.2(a).

We bound the regret assuming no bad events happen, and the bound for the expected regret follows as in the proof of Theorem 1.1.

Note that each epoch has two types of rounds: estimation rounds (Line 2) in which each arm is pulled by at most one player, during which she updates her estimate of its mean, and other rounds during which some of the players are executing MusicalChairs2, hence we call them MusicalChairs2 rounds.

Observe that, since there are at least -good arms, we always have and since the MusicalChairs2 subroutines are always successful, there can be no conflict during the estimation rounds.

The first claim is the following: consider a player that has just executed her Line 2 in epoch , and has not occupied a golden arm by end of epoch . Consider also a -good arm that is silver, suppose this arm is not occupied by another player as a golden arm in their Line 2. Then the claim is that the player will pull arm at least times during epoch Line 2, and it will be put in at the end of the iterations. To see this, note that the epoch has iterations. In each iteration, if the player has any unexplored silver arm, in the first rounds attempts to occupy one of those (Line 2), while the other players pull random arms. By Lemma 3.5 below and since the MusicalChairs2 subroutines are successful, after the iterations, each player has explored any such arm . Therefore, the confidence interval of each such arm will have length .

The second claim is that if no bad event occurs, then the algorithm never makes a mistake in characterizing the arms as golden/bad. First, the characterizations based on confidence intervals (Lines 22) are correct because all confidence intervals are correct. Next note that if on Line 2, then is not explored, and that can be because of one of two reasons: first, its mean may be smaller than , hence it is not occupied during the MusicalChairs2 subroutines. Or, it may be a golden arm occupied by another player on her Line 2. In the latter case, let be the average reward received from this arm by that other player. Suppose the arm was marked as golden by the other player in epoch . Then we must have (see Line 2). This implies

 μj≥ˆμ′j−√g/2i′>μ+2√g/2i′≥μ+2√g/2i−1.

On the other hand, since was silver at the end of epoch , we have , hence Line 2 is executed and the player marks as golden. If this latter case has not happened, we are in the first case, so because the confidence intervals are correct, lies in the confidence interval for arm , which has length . This means , so Line 2 is executed and the player correctly marks as bad.

The third claim is that, any arm with mean has been marked as bad by all players by the end of epoch . Let be such an arm and suppose we are at the end of epoch . By definition of confidence intervals, it suffices to show there exists at least arms such that either or . In fact this holds for all , since for any such , either , or , which implies .

The fourth claim, whose proof is similar to the third claim, is that any arm with mean has been marked as golden by all players by the end of epoch . The only difference is the extra condition , which is satisfied by any such arm, since by correctness of confidence intervals.

Now we bound the algorithm’s regret. First observe that the number of epochs is fewer than . The number of iterations per epoch is , whence the total number of MusicalChairs2 rounds can be bounded by . Let us now bound the regret of the estimation rounds. The regret of the first epoch can be bounded by . Next note that any arm with mean has been put in by the end of epoch by all players (by fourth claim above), and so some player occupies it in the beginning of epoch . During epoch , each of the remaining players pull either a silver or a golden arm, which are at most away from the best available arms (by the third claim above). Since the probability that some bad event happens is (Lemma 3.4), and in this case the total regret can be bounded by , the total expected regret can be bounded by

 mT×(1/mT) +10mKαlog(T)+2Km+m×⌈log2(T)⌉∑i=2(2K×2i×18√g/2i−1) =O(K2mlog2(KT)/μ+Km√Tlog(KT)).

Recall that . Let be the smallest integer that . So after the first epochs, any remaining silver arm would have mean precisely , and the regret will be zero after epoch . The algorithm’s regret can be alternatively bounded by

 mT×(1/mT) +10Kmαlog(T)+2Km+j∑i=28√g/2i−1(2Km)2i =O(K2mlog2(KT)/μ+Kmlog(KT)/Δ′).\qed

The following lemma is the last piece in completing the proof of Theorem 1.2(a).

###### Lemma 3.5.

Fix an epoch and suppose all MusicalChairs2 subroutines of Line 2 are successful. Then, during the iterations of the epoch, each player will occupy each -good silver arm at least once.

###### Proof.

Build a bipartite graph with one part being the players and the other part the arms, with an edge between a player and an arm if the arm is silver for that player. Say an edge is good if the corresponding arm is -good. Say two edges are neighbours if they share a vertex, and the degree of an edge is its number of neighbours. Initially, the degree of each edge is at most . Observe that, whenever the MusicalChairs2 subroutine in Line 2 is executed, the set of edges corresponding to players and their occupied arms forms a matching in this graph, that is, a set of edges such that no two of them are neighbours. Moreover, since the MusicalChairs2 subroutine is successful by assumption, this matching has the property that, for any good edge , either or some neighbour of lies in . After the execution of this subroutine, we delete this matching from the graph, and hence the degree of each good edge decreases by 1. In particular, the maximum degree of good edges decrease by 1, and once this maximum degree is 0, in the next iteration all good edges will be deleted. Therefore, after at most iterations, all good edges will be deleted, as required. ∎

The proof of Theorem 1.2(b) is identical, except we would choose in the algorithm, replace MusicalChairs2 with MusicalChairs3, and use Corollary 3.3 instead of Corollary 3.2.

###### Proof of Theorem 1.2(c).

The algorithm would be similar, except that if a player has not occupied an arm when she wants to start an estimation period, she would simply leave the game and never pull any arm again. To be more precise, add the following line before line 2: ‘if then leave the game.’ This could happen if there are fewer than many -good arms, and so players may fail to find and occupy an arm. Suppose of the best arms are -bad. Once players have left the game, we will have players and many -good arms, so the algorithm will work as in part (a), and the same analysis works. We would only lose a reward of at most , due to players who have left the game. ∎

## 4 Relaxing the assumptions

### 4.1 Unknown time horizon

If is not known, we can apply a simple doubling trick: we execute the algorithm assuming , then we execute it assuming , and so on, until the actual time horizon is reached. If the expected regret of the algorithm for a known time horizon is , then the expected regret of the modified algorithm for unknown time horizon would be For example, if the players have the option of leaving the game, we can apply Theorem 1.2(c) with to get the regret upper bound

 R′(T)≤⌊log2(T)⌋∑i=0O(Kmlog(2i)√2i)≤O(Kmlog(T))×⌊log2(T)⌋∑i=0O(2i/2)≤O(Km√Tlog(T)),

which is within a constant multiplicative factor of the upper bound for .

### 4.2 Other reward distributions

In Theorems 1.1 and 1.2 we assumed the rewards are supported on . We used this assumption in three ways: first, we used that the expected regret incurred any round can be bounded by , second, that the rewards satisfy the Chernoff-Hoeffding concentration inequality (Proposition 1.4(a)), and third, for correctness proofs of the MusicalChairs1,2 subroutines we used the fact that for any random variable .

A random variable is -sub-Gaussian if , for example a standard normal random variable is -sub-Gaussian. The first two facts hold for -sub-Gaussian random variables whose means lie in a bounded interval (with appropriate adjustments). For the proofs, see Vershynin (2018, Section 2.5). The third fact also holds up to a logarithmic factor, see Lemma 4.1 below. Hence, our main theorems can be readily extended to such distributions, with appropriate adjustments.

###### Lemma 4.1.

Let be a random variable with mean that satisfies . Then we have .

###### Proof.

By dividing by we may assume . Let be a parameter to be chosen later, and define and . Note that and . We next write as

 EY =∫t+μ0P{Y>s}ds+∫∞t+μP{Y>s}ds

For , we have if and only if if and only if , whence . For the second term we have

 ∫∞t+μP{Y>s}ds<∫∞texp(−s2/2)ds

Consequently,

 μ=EY+EZ<(t+μ+1/2t)exp(−t2/2)+P{X>0}(t+μ),

which implies

 P{X>0}>μ−(t+μ+1/2t)exp(−t2/2)t+μ.

Now, if then setting gives that the right-hand side is greater than . (Here, we have used the inequality which holds for all .)

On the other hand, if , setting gives that the right/hand side is greater than , as required. (Here, we have used the inequality , which holds for any .) ∎

### 4.3 More players than arms

If , the term in the definition of regret (1) is not well defined, hence we must redefine the regret. There are two natural ways to do this.

#### 4.3.1 Original model

In the original model, the best strategy of the players had they known the means would be for of them to occupy the best arms, and for the rest to occupy the worst arm; so the regret in this case can be defined as

 Regret=T∑i∈[K−1]μi−∑t∈[T]∑j∈[m]μAj(t)(1−CAj(t)(t)).

Let . We give an algorithm with regret .

The algorithm is similar to Algorithm 1. Let be the probability of no-conflict, when the players pull arms uniformly at random, and let , for a sufficiently large constant . Each player pulls arms randomly until at some round she finds a gap of between the th and th arm, and continues for more rounds to make sure that all others have also found this gap. An argument similar to that of Corollary 2.2 gives that these two phases will take many rounds. Moreover, each player has learned that and that (see Corollary 2.2(ii)). Then the player executes MusicalChairs2 on the set of best arms, for many rounds, for a large enough constant . Since , Lemma 3.1 implies that with probability at least all players will be successful, meaning that the best arms are occupied. After MusicalChair2 is finished, if the player has occupied an arm she will pull it until the end of game, otherwise she pulls the worst arm for the rest of game. Thus the regret will be zero after at most