# Multi-Player Bandits -- a Musical Chairs Approach

We consider a variant of the stochastic multi-armed bandit problem, where multiple players simultaneously choose from the same set of arms and may collide, receiving no reward. This setting has been motivated by problems arising in cognitive radio networks, and is especially challenging under the realistic assumption that communication between players is limited. We provide a communication-free algorithm (Musical Chairs) which attains constant regret with high probability, as well as a sublinear-regret, communication-free algorithm (Dynamic Musical Chairs) for the more difficult setting of players dynamically entering and leaving throughout the game. Moreover, both algorithms do not require prior knowledge of the number of players. To the best of our knowledge, these are the first communication-free algorithms with these types of formal guarantees. We also rigorously compare our algorithms to previous works, and complement our theoretical findings with experiments.

## Authors

• 1 publication
• 60 publications
• 1 publication
• ### SIC-MMAB: Synchronisation Involves Communication in Multiplayer Multi-Armed Bandits

We consider the stochastic multiplayer multi-armed bandit problem, where...
09/21/2018 ∙ by Etienne Boursier, et al. ∙ 0

• ### Cooperative and Stochastic Multi-Player Multi-Armed Bandit: Optimal Regret With Neither Communication Nor Collisions

We consider the cooperative multi-player version of the stochastic multi...
11/08/2020 ∙ by Sébastien Bubeck, et al. ∙ 12

• ### Game of Thrones: Fully Distributed Learning for Multi-Player Bandits

We consider a multi-armed bandit game where N players compete for M arms...
10/26/2018 ∙ by Ilai Bistritz, et al. ∙ 0

• ### Multi-Player Bandits: The Adversarial Case

We consider a setting where multiple players sequentially choose among a...
02/21/2019 ∙ by Pragnya Alatur, et al. ∙ 0

• ### Multi-Player Bandits Models Revisited

Multi-player Multi-Armed Bandits (MAB) have been extensively studied in ...
11/07/2017 ∙ by Lilian Besson, et al. ∙ 0

• ### Multi-Player Bandits: A Trekking Approach

We study stochastic multi-armed bandits with many players. The players d...
09/17/2018 ∙ by Manjesh K. Hanawal, et al. ∙ 0

• ### Mechatronics-Driven Musical Expressivity for Robotic Percussionists

Musical expressivity is an important aspect of musical performance for h...
07/29/2020 ∙ by Ning Yang, et al. ∙ 0

## Code Repositories

### Bandit-Games-with-Real-People

None

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

The stochastic multi-armed bandit (MAB) problem is a classic and well-studied setting of sequential decision-making, which exemplifies the dilemma of exploration vs. exploitation (see bubeck2012regret for a comprehensive review). In this problem, a player sequentially chooses from a set of actions, denoted as ‘arms’. At every round, each arm produces a reward sampled from some unknown distribution in , and the player receives that reward, but does not observe the reward of other arms. The player’s goal, of course, is to maximize the cumulative reward. The dilemma of exploration vs. exploitation here is that the more the player ‘explores’ by trying different arms, she will have a better understanding of each machine’s expected reward. The more the player ‘exploits’ the machine which she thinks is best, the less rounds are wasted on exploring bad machines.

In this work, we study a variant of this problem, where there are many players who choose from the same set of arms. If two or more choose the same arm then there is a ‘collision’ and no reward is provided by that arm. Moreover, we assume that players may not communicate. The goal is to find a distributed algorithm for players that will maximize the sum of their rewards. One motivation for this setting (as discussed below in the Related Work section below) comes from the field of cognitive radio networks, where several users utilize the same set of channels, in a situation where the quality of the different channels varies, and direct coordination between the players is not possible. We use the standard notion of (expected) regret to measure our performance, namely the difference between the expected cumulative reward of the arm with highest mean reward, and the expected cumulative rewards of the players.

We focus on a particularly challenging situation, where the players cannot communicate, there is no central control, and the players cannot even know how many other players are also participating. At every round each player decides which arm to sample. After the round is over the player receives the reward associated with the chosen arm, or an indication that the arm was chosen by at least one other player, in which case they receive no reward. The event that more than one player chooses the same arm will be referred to as a collision.

We will consider two variants in this work - a static setting in which all players start the game simultaneously and play for T rounds, and a dynamic setting, in which players may enter and exit throughout the game. Our main results are the following:

• For the static case we propose and analyze the Musical Chairs (MC) algorithm, which achieves, with high probability and assuming a fixed gap between the mean rewards, a constant regret independent of .

• For the dynamic setting we propose the Dynamic Musical Chairs (DMC) algorithm, which achieves an regret (with high probability and assuming a fixed gap between rewards), where is a bound on the total number of players exiting and leaving.

• We study the behavior of previous algorithms for this problem, and show that in the dynamic setting, there are some reasonable scenarios leading to their regret being linear in T. For other scenarios, we show that our regret guarantees improve on previous ones.

• We present several experiments which validate our theoretical findings.

All guarantees hold assuming all players implement the algorithm, but do not require any communication or coordination during the game.

The paper is organized as follows: Section 2 provides a formal description of the problem setting. Section 3 introduces the algorithms and regret analysis for the static setting and dynamic setting. Section 4 considers previous work, and studies scenarios where the regret performance substantially differ compared to our results. Section 5 presents some experiments. Section 6 provides concluding remarks, discussion, and open questions. Finally, Appendix A contains most of the proofs.

### Related Work

Most previous work on multi-player multi-armed bandits assumed that players can communicate, and included elements such as a negotiation phase or exact knowledge of the number of players, which remains fixed throughout the game, e.g. liu2009distributed; anandkumar2011distributed; kalathil2014decentralized. However, in modeling problems such as cognitive radio networks, where players may be unable or unwilling to coordinate, these are not always realistic assumptions. For example, the algorithm proposed in liu2009distributed relies on the players agreeing on a time division schedule for sharing the best arms. Each player stays on one of the best arms for a certain time period, and at the end of the time period, players switch. The algorithm requires all players to know the number of players, which needs to be fixed.anandkumar2011distributed

provide an algorithm which is communication and cooperation free, but requires knowledge of the number of players. In order to overcome this added requirement the authors present an algorithm which estimates the number of players. The main idea of the algorithm is to estimate the number of players based on the number of collisions that were seen and call this estimate

, choose one of the best arms uniformly at random, stay on that arm until a collision occurs, at which time the player repeats the procedure. The performance guarantees of this algorithm are rather vague, and do not hold for the dynamic setting. Rather than estimate the number of players directly, as the algorithm presented in this work does, their algorithm has an estimation technique which converges to the correct number of players, given that the number of players is static. Another approach which requires communication is the algorithm proposed in kalathil2014decentralized in which players negotiate in order to reach an an agreement where every player chooses their own unique arm, and thus there are no collisions. Their algorithm, called , uses Bertsekas’ auction algorithm to have each player choose their own unique arm. The paper also proposes an algorithm which address stochastic rewards changing according to a Markov process. This algorithm does not apply to our setting in which communication is forbidden.

The work most similar to ours is avner2014concurrent , where communication is not allowed and there is no knowledge of the number of players. The proposed algorithm, named MEGA, is based on an elegant combination of the well-known greedy MAB algorithm with a collision avoidance protocol, known as the ALOHA protocol, used in signal and control processing. The greedy algorithm (see e.g. sutton1998reinforcement) is an algorithm for the setting of a single player in the multi armed bandit problem, which ensures that the majority of exploration occurs at the beginning of the game, and after accumulating sufficient information on the mean rewards, most of the remaining iterations are used to exploit the arm with the highest expected profit. This is done by having an exploration probability that decreases as where is the current iteration. In every iteration, the player chooses with probability an arm uniformly at random, and with probability she exploits by choosing the arm with the highest empirical mean reward. The ALOHA protocol is a collision avoidance protocol used in multi-player signal processing schemes. The protocol dictates that a player, in the event of a collision, should decide, by some random process, whether to persist on the same arm, or to leave the arm and not return for a time period, chosen also at random. This time period is called the ‘unavailability time’.

avner2014concurrent analyze the MEGA algorithm, and show that in the static setting, assuming parameters are chosen appropriately, it achieves regret. A full analysis of the algorithm in the dynamic setting is not provided, but it is shown that there exist scenarios in the dynamic setting which will have an expected regret of . However, as we discuss in detail in section 4, the MEGA algorithm may perform poorly in some reasonable dynamic scenarios. Essentially, this is because collision frequency decreases as the game proceeds, but never reaches zero. Although the frequency can be tuned based on the algorithm’s parameters, it is difficult to find a single combination of parameters that will work well in all scenarios.

## 2 Setting

In the standard (single-player) stochastic MAB setting there are arms, with the rewards of each arm sampled independently from some distribution on , with expected reward . Every round, a player chooses an arm and would like to receive the highest cumulative reward possibly in rounds overall. In this work, we focus for simplicity on the finite-horizon case, where is fixed and known in advance.

The multi-player MAB setting is similar, but with several players instead of a single one. In fact, we consider two cases: one where the set of players, and therefore number of players , is fixed and another where the number of players, , can change at any round . In our model we would like to minimize, or even eliminate, any central control and communication, and assume that players do not even possess knowledge of the value of . Generally, we assume , ,and are all much smaller than . We will denote by ”the top arms” the set of arms with the highest expected rewards.

The performance in the standard single-player MAB setting is usually measured by how small is the regret (where we take expectations over the rewards of the arms):

 R:=T⋅μ∗−T∑t=1μ(t)

where is the expected reward of the arm chosen by the single player at round , and is the expected reward of the arm with the highest expected reward. The regret is non-trivial if it is sub-linear in .

In the multi-player setting, we generalize this notion, and define our regret with respect to the best static allocation of players to arms (in expectation over the rewards), as follows:

 R:=T∑t=1∑k∈K∗tμk(t)−T∑t=1N∑j=1μj(t)⋅(1−ηj(t))

where is the expected reward of the arm chosen by player at round , is the number of players at round t, is the set of the highest ranked arms where the rank is taken over the expected rewards, and is a collision indicator, which equals if player j had a collision at round t, and otherwise. We define a collision as the event where more than one player chose the same arm at a given round, and assume that no reward is obtained in that case.

Since achieving sublinear regret is trivially impossible when there are more players than arms, we assume throughout that the number of players, in both the static and dynamic settings, is always less than the number of arms.

## 3 Algorithms and Analysis

### 3.1 The Musical Chairs (MC) Algorithm

We begin by considering the static case, where no players enter or leave. The MC algorithm, that we present below for this setting, is based on the idea that after a finite time of random exploration, all players learned a correct ranking of all the arms with high probability (assuming gaps between the mean rewards). If after this time all players could fix on one of the top arms and never leave, then from this point onward, there would be no regret accumulating. The algorithm we present is composed of a learning phase, with enough rounds of random exploration for all players to learn the ranking of the arms and the number of players; a ‘Musical Chairs’ phase, in which the players fix on the top arms; and a ‘fixed’ phase where all players remain fixed on their arm.

The Musical Chairs subroutine works by having each player randomly choose an arm in the top arms, until she chooses one without experiencing a collision. From that point onwards, she chooses only that arm. It can be shown that if all players implement this subroutine, then after a bounded number of rounds (in expectation), all players will fix on different arms, and there will be no more added regret. The Musical Chairs subroutine’s success depends on each player being able to accurately estimate a correct ranking of the machines (the ranking needs to be accurate enough to distinguish the best machines from the rest) and to estimate the correct value of .

### 3.2 Analysis of the MC algorithm

Let be the number of players and let denote the best ranked arms. For each player we denote by to be that player’s measured empirical mean reward of arm . We use the following definition from avner2014concurrent:

###### Definition 1.

An -correct ranking of K arms is a sorted list of empirical mean rewards of arms such that : is listed before if

###### Theorem 1.

Let be the gap between the expected reward of the th best arm and the best arm. Then for all and , with probability , the expected regret of players using the MC algorithm with arms for rounds, with parameter set to
is at most

 T0⋅N+2⋅exp(2)⋅N2.

Note that the bound we give is in expectation over the rewards and the algorithm’s randomness, conditioned on the event (occurring with probability at least ) that players learn an -correct ranking and estimate the true number of players.

The proof of theorem 1 is composed of three lemmas presented below, whose formal proof appears in appendix A.

We begin by showing that with high probability, all players will learn an -correct ranking after a time period independent of :

###### Lemma 1.

and , then after rounds of random exploration, all players have an correct ranking of the arms w.p.

We then show how estimating the number of players also requires a number of rounds independent of with high probability. Knowing the value of exactly is required in order for the players to run the Musical Chairs subroutine and choose an arm from the top arms. To estimate , each player keeps track of the number of collisions till time , denoted as , and after rounds, computes the estimate , where rounds to the nearest integer. The following lemma proves that will indeed equal with high probability:

###### Lemma 2.

Let . For , if the number of rounds used to estimate is at least , then w.p. we have that .

Finally, given that players were able to learn an -correct ranking and the number of players, we can upper bound the expected time (and hence the regret) for all the players to fix on different arms:

###### Lemma 3.

Denote by the regret accumulated due to players running the musical chairs subroutine. Conditioned on the event that all players learned an -correct ranking and that , it holds that the expected value of is at most .

Combining the three lemmas above, we get Theorem 1.

### 3.3 The Dynamic Musical Chairs (DMC) Algorithm

In this subsection we will consider the case when players can enter and leave. For the dynamic setting we suggest an extension of the MC algorithm, which simply runs the algorithm in epochs and restarts at the end of each epoch (see pseudocode below). We call this algorithm the Dynamic MC (DMC) algorithm and it requires the use of a shared clock between all players, to synchronize the epochs epochs. We note that having a shared clock is a mild assumption which has been used previously in several works (See for example

avner2015learning, shukla2014synchronization, nieminen2009time). This clock means that at any round , players know what is , where (a parameter of the algorithm) is the number of rounds in an epoch. However, communication between players is still not allowed, and the shared clock is not used for resource allocation or synchronization between players regarding which arm to choose.

We emphasize that in the dynamic setting, some restriction on the frequency at which players enter or leave is necessary for any algorithm to obtain a sub-linear regret bound. This is because if players may enter or leave at every round, then it is possible that no player stays long enough to even learn the true ranking of any arm, in which case any algorithm will result in linear regret. For this reason, we assume that the overall number of players entering and leaving is sublinear in . Moreover, since time periods are synchronized, we will allow ourselves to assume that players can only enter and leave after the learning period in each epoch. We note that according to our analysis, the proportion of rounds belonging to learning period is a vanishing portion of the total number of rounds , and therefore this assumption is not overly restrictive. Moreover, under some conditions, this assumption can be weakened to cover only leaving players, without significantly changing our regret bounds111For example, if the entering players can refrain from picking arms during the learning phase, instead accumulating regret. Since the total length of the learning phase is less than our regret bounds, this won’t affect the bounds by more than a small constant..

### 3.4 Analysis of DMC Algorithm

The main result here is the following theorem:

###### Theorem 2.

Let be an upper bound on the number of active players at any time point; the minimal gap between the best arms, with a known lower bound ; and be an upper bound on the total number of players entering and leaving during rounds. Then with arbitrarily high probability, the expected regret of the DMC algorithm (over the rewards), using epochs with learning rounds at the beginning of each epoch, is at most:

 ~O(√xT)

where the hides factors logarithmic in , and polynomial in .

As in theorem 1, the bound is in expectation over rewards and the algorithm’s randomness, conditioned on the high-probability event that in each epoch, the players learn the correct ranking and the number of players.

The bound in the lemma hides several factors to simplify the presentation. More specifically, the bound is based on the following lemma, and taking :

###### Lemma 4.

Let be the total number of players entering, the total number of players leaving, is the expected time for any player to fix on an arm (at most by lemma 3), and be a lower bound on where is the expected reward of the best arm, for any .

Then and , w.p. , the expected regret of the Dynamic MC algorithm played for rounds, with parameters:
and chosen such that , is at most

 TT1⋅(Nm⋅(T0+2⋅Tf))+e⋅2(T1−T0)+l(T1−T0).

Note that is not known to the players, however, it is always possible to upper bound it since we are in the setting where the number of players does not exceed the number of arms, i.e., and thus we can calculate a sufficient time for learning, , by replacing by .

The lemma is proven by using lemma 1 and lemma 2 with the confidence parameters set to , and taking the union bound over all epochs. This ensures that, with high probability, the players learn the true rankings and estimate the number of players correctly at each epoch. For this reason includes a factor as stated above. We then separately bound the regret arising from the learning phase and fixing on an arm, as well as regret due to entering and leaving players. The formal proof of this lemma appears in appendix A.

## 4 Comparison to the MEGA Algorithm

As discussed in the introduction, the most relevant existing algorithm for our setting (at least, that we are aware of) is the MEGA algorithm presented in avner2014concurrent. In terms of formal guarantees, the algorithm attains in the static setting. A full analysis of the algorithm in the dynamic setting is lacking, but it is shown that if a single player leaves at some time point, the system re-stabilizes at an optimal configuration, after essentially rounds. The algorithm is clever, based on well-established techniques, allows players to enter and leave at any round, and compared to our approach, is not based on repeatedly restarting the algorithm, which can be wasteful in practice (an issue we shall return to later on). On the flip side, our algorithms have fewer parameters, attain considerably better performance in the static setting, and can provably cope with the general dynamic setting. In this section, we show that this is not just a matter of analysis, and that the approach taken by the MEGA algorithm indeed has some deficiencies in the dynamic setting. We begin by outlining the MEGA algorithm at a level sufficient to understand our analysis, and then demonstrate how it may perform poorly in some natural dynamic scenarios.

### 4.1 Outline of the algorithm

The MEGA algorithm uses a well known -greedy MAB approach, augmented with a collision avoidance mechanism. Initially, players mostly explore arms in order to learn their ranking, and then gradually move to exploiting the best arms, while trying to avoid arms they have collided on. Specifically, each player has an exploration probability which scales like , where is the current round. The exploration probability also depends on two input parameters, and , where is a lower bound on the gap of the and best arms. Each player has a persistence probability, , whose initial value, , is another input parameter. is increased to for every round in which the player picks the same arm consecutively, where is another input parameter. Otherwise, if the player switches arms, is set to . In the case of a collision, the colliding players indefinitely flip a coin with their own respective probabilities, , for deciding whether to persist on the arm on which they collided. In case a player does not persist after a collision, she marks this arm unavailable until a time point sampled uniformly at random from where is another input parameter of the algorithm.

Note that both our algorithm and the MEGA algorithm require a lower bound on the gap between the best arm and the best arm.

One issue with the MEGA algorithm approach is that players never entirely stop colliding, even when . At least in the static case, it seems advantageous to fix player’s choice after a while, hence avoiding all future collisions and additional regret. The motivation for the MC algorithm is to create a procedure which guarantees that once learning completes, all players will choose one of the best arms for the rest of the game.

In the dynamic setting, however, the issue is quite the reverse: The -greedy mechanism, which the MEGA algorithm is based on, is not good at adapting to changing circumstances. In the next subsection, we illustrate two problematic ramifications: One is that players entering late in the game are not able to learn the ranking of the best arm, and the other is that when players leave, the best arms may stay vacant for a long period of time before being sampled by other players. A third issue is that if the reward distributions change over time, a rapidly decreasing exploration probability is problematic. The DMC algorithm can address this as it runs in epochs, hence any mistake in one epoch is undone in the next.

### 4.2 Problematic Scenarios for the MEGA algorithm

Below, we study the realistic situation where players both enter and leave, and demonstrate that the regret of the MEGA algorithm can be substantially worse than our regret guarantees (both in terms of regret guarantees as well as in terms of actual regret obtained), sometimes even linear in . For the proofs of the theorems presented in this section we refer the reader to appendix A.

The first scenario we wish to discuss is the simple setting of two players and two arms, where the second player enters at some round in the game and the first player then leaves at some later round. We will describe what will happen, intuitively, if the players are following the MEGA algorithm, with a formal theorem presented below. In the scenario we described, the first player will learn a correct ranking of the two arms with high probability, and will proceed to exploit the highest ranked arm, thus making his persistence probability very high and exploration probability very low. If a second player enters late in the game, then any attempt to sample the highest ranked arm will cause a collision in which the first player will stay, and the second player will fail to sample the arm, since the first players’ persistence probability is so high and the new player’s persistence is set to a lower . This means that the second player will not be able to learn the true ranking of the two arms. Thus, if the first player leaves after a period of time, such that the second player is not likely to explore, then the second player will exploit the second ranked arm, causing linear regret. This scenario can be extended to multiple players and arms, by adding players one by one in time intervals that ensure that players who entered late will not succeed in learning the true ranking, due to the collision avoidance mechanism.

The formal result regarding this scenario is the following:

###### Theorem 3.

Consider a multi-player MAB setting as described above, where the second player enters at round , and the first player leaves at round ) for some parameter . Then for all choices of the MEGA algorithm parameters , if (the parameter that controls the collision avoidance mechanism) is chosen such that , and is chosen such that , then:

• The expected regret of the MEGA algorithm is .

• The conditional expected regret of the DMC algorithm (using epochs) is .

Notice that for the DMC algorithm we have an upper bound on the expected regret conditioned on the event that all players learn an -correct ranking, which happens with arbitrarily high probability.

In particular, if we choose to be a constant in the required range, we get a scenario where the regret bounds above hold for any (and any possible values of the other parameters of the MEGA algorithm). We note that when is larger than , the persistence probability will hardly deviate from , which makes the persistence mechanism non-functional and can easily lead to large regret, even in the static setting.

We now turn to discuss a second reasonable scenario, in which players alternate between entering and exiting, at intervals of rounds. We will show in this scenario that however is chosen, the regret bound of the MEGA algorithm (as given in avner2014concurrent, using recommended parameter values, and even just counting regret due to players leaving) is worse than the regret bound of the DMC algorithm (which incorporate regret due to both players leaving and players entering). Note that unlike Theorem 3, here we compare the available regret upper bounds, rather than proving a regret lower bound.

The setting is defined as follows: one player exits (or enters, alternating) every rounds, for some . In the worst case, the player who left was occupying the highest ranked arm. In the analysis of avner2014concurrent, players following the MEGA algorithm might take up to rounds before being able to access this arm, due to the collision avoidance mechanism (where is the round at which the player exited, and is a parameter of the algorithm, whose recommended value based on the static setting analysis is ). For players following the DMC algorithm a player leaving can affect the regret of the current epoch only. Intuitively, if we set the epoch length compatible to the rate of exiting players, we can achieve a better regret bound than what is indicated by the MEGA algorithm analysis. Formally, we have the following:

###### Theorem 4.

In the multi-player MAB setting with one player leaving every rounds, and one player entering every rounds, for , and , we have that

1. The regret upper bound of the MEGA algorithm is at least

2. The expected regret upper bound of the DMC algorithm is

As in the previous theorem, the expected regret of the DMC algorithm is conditioned on the event that all players learn an -correct ranking, which happens with arbitrarily high probability. Also, the assumption is required in order to apply the existing MEGA analysis. The theorem is illustrated graphically in Figure 1, and shows how the exponent in the regret bound is uniformly superior for our algorithm, when we pick the recommended value . Note that if is chosen differently then the regret bound of avner2014concurrent will increase, even in the static setting.

## 5 Experiments

For our experiments, we implemented the DMC algorithm for the dynamic case and the MC algorithm for the static case. For comparison, we implemented the MEGA algorithm of avner2014concurrent, which is the current state-of-the-art for our problem setting.

For each experimental setup and algorithm, we repeated the experiment 20 times, and plotted the average and standard deviation of the resulting regret (the standard deviation is shown with a shaded region encompassing the average regret). In scenarios that are dynamic we mark the time that a player enters or leaves with a dashed line. In most figures, we plot the average per-round regret, as a function of the number of rounds so far.

For the parameters of the MEGA algorithm, we used the empirical values suggested in avner2014concurrent (rather than the theoretical values which are overly conservative). The only exception is the gap between the mean rewards of the and

best arms, which was taken as the actual gap rather than a rough lower bound. Note that this only gives the MEGA algorithm more power. Moreover, in all experiments, the gap is at least 0.05, which is the heuristic value suggested to be used as the lower bound in

avner2014concurrent. For the dynamic scenarios, where the players and the number of players change, we use the minimum gap between the and best arms over all rounds. For example, if at the beginning there are 2 players and the gap between the second and third arm is 0.3, but by the end there are 4 players and the gap between the fourth and fifth arm is 0.01, then we use 0.01 as the value of this gap.

For the MC and DMC algorithm, we set to be in all experiments. For the DMC parameter, , we use either the theoretically optimal value presented in this work or that value scaled by a small constant (see details below for the specific value in each experiment).

In the DMC algorithm, a potential source of waste is that newly entering players can accumulate linear regret until the next epoch begins. Therefore, in the DMC experiments, we added the following heuristic: When a player enters during the middle of the epoch, she chooses an arm with probability proportional to the empirical mean of its rewards (as observed by her so far, initially set to 1), multiplied by the empirical probability of not colliding on that arm (initially set to 1). After the epoch is over, she chooses arms by following the DMC algorithm. Intuitively this would quickly stop a large amount of collisions with players who are already fixed, and would encourage more players to exploit rather than only explore. This happens because any arm that has a player ‘fixed’ on it would always give newly entered players a zero empirical probability for colliding.

We begin with a simple scenario corresponding to the static setting. There is an initial set of 6 players, which remains fixed throughout the game, and 10 arms. The mean rewards of the arms are chosen uniformly at random in (with a gap of at least between the and arm). At every round of the game the rewards of each arm are chosen to be with probability equal to the mean reward, and zero otherwise.

In this scenario we can see the short time period where players running the MC algorithm are learning (and the average regret is constant), and then, with high probability, they all know which are the best arms and never make any more mistakes or collisions. The average regret is shown in figure 1(a). The added regret at every round after learning is zero, while in the MEGA algorithm, even though the exploration probability goes down with time, it is never zero. Also, in the MEGA algorithm every time a player has the best arm become ’available’ that player will try to exploit it, probably colliding with other players who also want to exploit that arm. Therefore, in the MEGA algorithm there will always be collisions, even though they happen less frequently with time. This is further exemplified in figure 1(b) where we can see that after the learning stage there is no further accumulated regret for the MC algorithm while the MEGA algorithm never stops accumulating regret.

Another scenario we simulate is that of section 4 in theorem 3. The game starts with one player. At round , a second player enters and after another rounds the first player leaves. The results can be seen in figure 2(a). In figure 2(b) we show a generalization of this scenario for multiple players where the game starts with one player and every rounds another player enters, until the number of players reaches 4. Then after another rounds, the first player leaves. For both of these scenarios the rewards are chosen deterministically. For figure 2(a) there is a lower bound on the gap of 0.8 and for figure 2(b) there is a gap of 0.7 between the expected reward of the and best arm. In figure 2(a), there are 4 arms and was set to , and in figure 2(b), there are 10 arms and was set to .

As discussed in section 4, in the scenario of figure 2(a) a second player who enters late is not able to learn the best arm, because the first player always exploits the best arm and has a very high persistence probability. Therefore, once the first player leaves the game, and allows the best arm to be free for the second player to use, it will take the second player time proportional to the number of rounds she has played to explore this arm. Since her exploration probability will be very low, exploring this arm will take a very long time. The DMC algorithm runs in epochs and therefore a problem of inflexibility, or an inability to change for dynamic settings, does not arise. This phenomenon can be seen in figures 2(a) and 2(b): When the first player leaves (marked by a dotted green line) the average regret of the MEGA algorithm increases dramatically, while for the DMC algorithm the decreasing trend continues. This suggests that the MEGA algorithm may be susceptible to large regret in some natural dynamic scenarios.

Another dynamic player scenario we simulate (demonstrating theorem 4 in section 4) is where the game starts with a set of five players and 10 arms and every rounds, we alternate between a player leaving and a player entering. The leaving player is chosen at random from the set of current players. Figure 3(a) shows the outcome for , and 3(b) shows the outcome for . In these scenarios is chosen to be in figure 3(a) and in figure 3(b). Although our algorithm performs better when is large enough (confirming the theoretical evidence in theorem 2), we note that this is not the case for smaller values of . We believe that this is due to the epoch-based nature of the DMC algorithm, which can be wasteful when is moderate. However, when is sufficiently large (as in figure 3(b)), the DMC algorithm outperforms the MEGA algorithm.

## 6 Discussion

In this work we propose new algorithms for the stochastic multi-player multi-armed bandit problem, with no communication or central control. We provide an analysis for the static setting, showing that the proposed MC algorithm achieves a better upper bound on the regret (as a function of the number of rounds) than the current state of the art. We also provide the DMC algorithm, which is the first (to the best of our knowledge) with formal guarantees which copes with the general dynamic setting. We also study some natural dynamic scenarios, in which the behavior of previous approaches is problematic, sometimes even leading to linear regret.

This work leaves several questions open. For example, as noted earlier, both the DMC algorithm and the earlier MEGA algorithm require knowing a lower bound on the gap between the best arm and the best arm, and it would be interesting to remove this assumption while attaining similar guarantees. Another issue with the DMC algorithm is its epoch-based nature, which in practice considerably degrades the regret (especially if the total number of rounds is not too large). Can we develop algorithms with provable guarantees for the general dynamic setting which are not epoch-based?

More generally, there are several interesting variants of the multi-player MAB setting, that are currently unexplored. For example, it would be quite interesting to develop algorithms for multi-player MAB in the non-stochastic (adversarial) setting, where the rewards are arbitrary. In the adversarial case, one cannot rely on high-reward arms to remain such in the future, and it is not clear at all what algorithmic mechanism can work here. Another interesting direction is to remove the assumption that players faithfully execute a given algorithm: In practice, players may be non-cooperative and greedy, and it would be interesting to devise algorithms which are also incentive-compatible, and study related game-theoretic questions.

#### Acknowledgments

This research is partially supported by Israel Science Foundation grant 425/13, and an FP7 Marie Curie CIG grant. We thank Nicolò Cesa-Bianchi and Yishay Mansour for several discussions which helped initiate this line of work.

## Appendix A Proofs

We will use the following standard concentration bounds:

Let

be independent random variables such that

always lies in the interval . Define and .

###### Theorem 5.

(Chernoff Bound) For any ,

 Pr(m∑j=1Xj≥(1−δ)⋅E[m∑j=1Xj])≤exp(−E[∑mj=1Xj]δ22)
###### Theorem 6.

(Hoeffding’s Inequality) For any ,

 Pr(|¯¯¯¯¯X−μ|≥δ)≤2⋅exp(−2⋅m⋅δ2)

### a.1 MC algorithm proofs

We prove the following lemmas from which Theorem 1 follows.

#### a.1.1 Proof of Lemma 1

We will first upper bound the probability that not all players have an correct ranking given that they have enough observations of each arm. Then we upper bound the probability that not all players have enough observations of each arm, given that they have played rounds. We then take a union bound over both of these bad events to upper bound the probability that not all players have an correct ranking after rounds of random exploration.
Given we define and .
Note that if for player it is true that then player must have an correct ranking.
We now calculate what is the required number of observations, , of each arm in order to get

 Pr(player i doesn't have an ϵ−correct % ranking | player i viewed ≥C observations of each arm)<δ1N.

Specifically, we have the following:

 Pr(player i does not have an ϵ−correct % ranking | player i viewed ≥C observations of each arm) ≤Pr(∃j s.t. |~μj−μj|>ϵ2 | player i viewed ≥C observations of each % arm) ≤union boundK∑j=1Pr(|~μj−μj|>ϵ2 | player i viewed% ≥C observations of each arm) =K∑j=1∞∑n=CPr(|~μj−μj|>ϵ2| \# of views =n)Pr( viewed n|n>=C)

Using Hoeffding’s inequality, this is at most

 ≤Hoeffding's InequalityK∑j=1∞∑n=C2⋅exp((−n⋅ϵ22))Pr( viewed n|n>=C) ≤K∑j=12⋅exp((−C⋅ϵ22))∞∑n=CPr( viewed n|n>=C)

Notice that we can apply Hoeffding’s inequality here since each observation of the reward of an arm is sampled independent of the number of times we view it. This is true since every player is sampling uniformly at random at every round of learning (and independent of all previous rounds).
In order for this to be we need:

 2⋅K⋅exp(−C⋅ϵ22)<δ1N ⟹C>ln(2⋅K⋅Nδ1)⋅2ϵ2

Now we show that if all players have at least observations of each arm then w.p. all players have an correct ranking:
We start by defining the following events:

• will denote the event that all players have an correct ranking ( will denote A complement)

• will denote the event that player has an correct ranking

• will denote the event that all players have observed each arm at least times ( will denote B complement)

• will denote the event that player has observed each arm at least times

 Pr(A|B) ≥1−Pr(⋁i¯¯¯¯Ai|Bi) ≥union bound1−N∑i=1Pr(¯¯¯¯Ai|Bi) ≥1−N⋅δ1N=1−δ1

Now we show that there exists a large enough so that all players have observations of each arm w.p. .
We define .
Note that for any round and any we have that .
So for any we have that

 Pr(player i has ≤12⋅T0⋅E[Ai,j(t)] observations) =Pr(T0∑t=1Ai,j(t)≤12⋅T0⋅E[Ai,j(t)]) ≤Chernoff bounde−14⋅T0⋅E[Ai,j(t)]2

Note that we can apply Chernoff bound here since for any i,j, are i.i.d across t, since all players are employing random sampling at every round of learning.
Using a union bound we get that:

 Pr(∃i,js.t.T0∑t=1Ai,j(t)≤12⋅T0⋅E[Ai,j(t)]) ≤N⋅K⋅exp⎛⎝−14⋅T0⋅E[Ai,j(t)]2⎞⎠

In order for this probability to be upper bounded by we need:

 ⟹T0>18⋅E[Ai,j(t)]⋅ln(N⋅Kδ2)

We have shown that if then w.p. we have the number of observations player has of arm , , .
We also need the total number of observations each player has of each arm to be at least ,
i.e.

 T0∑t=1Ai,j(t)>12⋅T0⋅E[Ai,j(t)]≥C>ln(2⋅K⋅Nδ1)⋅2ϵ2 ⟹T0≥2⋅1E[Ai,j(t)]⋅ln(2⋅K⋅Nδ1)⋅2ϵ2

So we have two constraints on , thus we take:
.
We remind the reader that for all .
So we take and the result holds. Using the events, and as defined above, we get that

 Pr(A)=1−Pr(¯¯¯¯A) =1−(Pr(¯¯¯¯A|B)⋅Pr(B)+Pr(¯¯¯¯A|¯¯¯¯B)⋅Pr(¯¯¯¯B)) ≥1−(Pr(¯¯¯¯A|B)+Pr(¯¯¯¯B)) ≥1−(δ1+δ2)≥1−δ

Notice that letting is only possible if one knows know . If is unknown, then one can increase and set it to

 T0=⌈max{18⋅E[Ai,j(t)]⋅ln(K2δ2),2⋅1E[Ai,j(t)]⋅ln(2⋅K2δ1)⋅2ϵ2}⌉,

and the lemma would still hold since we assume that .

#### a.1.2 Proof of Lemma 2

Fix some player , and let be the number of collisions observed by the player until time . Also, let be the true probability of a collision, when players are choosing arms uniformly at random among arms. The probability of a player not experiencing a collision is

 Pr(no collision)=K∑j=1Pr(% choose arm j)⋅Pr(no other player chooses arm j) =K∑j=11K⋅(1−1K)N−1 =1K⋅K⋅(1−1K)N−1=(1−1K)N−1

Thus the probability of a collision at any round of learning is : . Note that for any . Inverting this equation, we get

 N=log(1−p)log(1−1K)+1.

Therefore, if we let be the empirical estimate of the collision probability after rounds, it is natural to take the estimator defined as

 N∗ = round⎛⎜ ⎜⎝log(1−^pt)log(1−1K)+1⎞⎟ ⎟⎠ = round⎛⎜ ⎜⎝log(t−Ctt)log(1−1K)+1⎞⎟ ⎟⎠

Our goal will be to show that when is sufficiently large, with arbitrarily high probability. Specifically, we will upper bound the probability of the estimator being far from the true value (which also includes the unlikely case , in which case the estimator is infinite).

Recalling that , and the definition of , to ensure that it is enough to require

 ∣∣ ∣ ∣∣log(1−^pt)log(1−1K)−log(1−p)log(1−1K)∣∣ ∣ ∣∣≤γ

for some , which is equivalent to requiring

 ∣∣ ∣ ∣∣log(1−^pt1−p)log(1−1K)∣∣ ∣ ∣∣≤γ.

Let denote the actual difference between and , so that . Therefore, the above is equivalent to

 −γ≤log(1−p−β1−p)log(1−1K)≤γ ⟺  γlog(1−1K)≤log(1−p−β1−p)≤−γlog(1−1K) ⟺  (1−1K)γ≤1−p−β1−p≤(1−1K)−γ ⟺  (1−p)(1−1K)γ≤1−p−β≤(1−p)(1−1K)−γ ⟺  −1+p+(1−p)(1−1K)γ≤−β≤−1+p+(1−p)(1−1K)−γ ⟺  1−p−(1−p)(1−1K)−γ≤β≤1−p−(1−p)(1−1K)γ ⟺  (1−p)⋅(1−(1−1K)−γ)≤β≤(1−p)⋅(1−(1−1K)γ)

Therefore, if we can ensure that , where

 ϵ1=min{∣∣ ∣∣(1−p)⋅(1−(1−1K)−γ)∣∣ ∣∣,∣∣∣(1−p)⋅(1−(1−1K)γ)∣∣∣}

for some (say ), we get that as required. If is sufficiently large, this can be done using Hoeffding’s inequality: is an average of i.i.d. random variables with expectation , hence with probability at least , provided that .

We now replace the expression of above, which is a bit unwieldy, with a simpler lower bound (where we also take ). First, plugging in the expression for , we get

 ϵ1=min{∣∣ ∣∣((1−1K)N−1⋅(1−(1−1K)−0.49))∣∣ ∣∣,∣∣ ∣∣((1−1K)N−1⋅(1−(1−1K)0.49))∣∣ ∣∣}

We first lower bound the first expression:

 ∣∣ ∣∣((1−1K)N−1⋅(1−(1−1K)−0.49))∣∣ ∣∣=−((1−1K)N−1⋅(1−(1−1K)−0.49))

We use a Taylor expansion to lower bound : Considering , the first derivative is and the second derivative of is , which is non-negative for any . Therefore, for any , and replacing with we get that

Similarly, we lower bound the second expression:

 ≥(1−1K)K−1⋅(1−(1−1K)0.49)≥1exp(1)⋅(1−(1−1K)0.49)

We use Taylor expansion again to lower bound . We look at the function: . The first derivative of is and the second derivative of is . Note that . Thus we get: . Thus the lower bound is again:

Combining the above, we showed that

 ϵ1≥0.49exp(1)⋅K≥0.1K.

Taking this value for , we get that if we run the learning phase for rounds, then w.p. , we have that .

#### a.1.3 Proof of Lemma 3

We remind the reader that the musical chairs phase is when a set of players who, with high probability, have learned an correct ranking each choose an arm uniformly at random from the best arms and stay ’fixed‘ on that arm until the end of the epoch or game. Thus once a player has ’fixed‘, the only case in which she can contribute regret is if another non-fixed player collides with her.

We will denote by to be the number of players starting the musical chairs phase. let be the number of rounds since the start of the musical chairs phase (i.e. when the musical chairs phase starts, ).

Denote by the time it takes for one player running the musical chairs subroutine to become fixed. We will first bound .

will denote the maximum number of players (if the game has a dynamic player setting rather than static).

We start by fixing some player who is running the musical chairs subroutine. We will denote by the number of players that entered late and are not running the musical chairs subroutine, rather they are choosing arms uniformly at random. For any round after the musical chairs phase begins the probability for this player to become fixed is at least:

 ∑all unfixed arms1N⋅(1−1N)N−