# Online Learning for Cooperative Multi-Player Multi-Armed Bandits

We introduce a framework for decentralized online learning for multi-armed bandits (MAB) with multiple cooperative players. The reward obtained by the players in each round depends on the actions taken by all the players. It's a team setting, and the objective is common. Information asymmetry is what makes the problem interesting and challenging. We consider three types of information asymmetry: action information asymmetry when the actions of the players can't be observed but the rewards received are common; reward information asymmetry when the actions of the other players are observable but rewards received are IID from the same distribution; and when we have both action and reward information asymmetry. For the first setting, we propose a UCB-inspired algorithm that achieves O(log T) regret whether the rewards are IID or Markovian. For the second section, we offer an environment such that the algorithm given for the first setting gives linear regret. For the third setting, we show that a variation of the `explore then commit' algorithm achieves almost log regret.

## Authors

• 2 publications
• 9 publications
• 35 publications
• ### Game of Thrones: Fully Distributed Learning for Multi-Player Bandits

We consider a multi-armed bandit game where N players compete for M arms...
10/26/2018 ∙ by Ilai Bistritz, et al. ∙ 0

• ### Unifying the stochastic and the adversarial Bandits with Knapsack

This paper investigates the adversarial Bandits with Knapsack (BwK) onli...
10/23/2018 ∙ by Anshuka Rangi, et al. ∙ 0

• ### Staged Multi-armed Bandits

In this paper, we introduce a new class of reinforcement learning method...
08/04/2015 ∙ by Cem Tekin, et al. ∙ 0

• ### Restless Hidden Markov Bandits with Linear Rewards

This paper presents an algorithm and regret analysis for the restless hi...
10/22/2019 ∙ by Michal Yemini, et al. ∙ 0

• ### From Bandits to Experts: On the Value of Side-Observations

We consider an adversarial online learning setting where a decision make...
06/13/2011 ∙ by Shie Mannor, et al. ∙ 0

• ### Distributed Online Learning via Cooperative Contextual Bandits

In this paper we propose a novel framework for decentralized, online lea...
08/21/2013 ∙ by Cem Tekin, et al. ∙ 0

• ### Gradient-free Online Learning in Games with Delayed Rewards

Motivated by applications to online advertising and recommender systems,...
06/19/2020 ∙ by Amélie Héliou, et al. ∙ 0

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Multi-armed bandit (MAB) models are prototypical models for online learning to understand the exploration-exploitation tradeoff. There is huge literature on such models beginning with Bayesian bandit models [13], and non-Bayesian bandits [23] which were introduced by Lai and Robbins in [22]. A key algorithm for MAB models is the UCB algorithm introduced in [3] that spurred a lot of innovation on such Optimisim in the face of uncertainty (OFU) algorithms. This included multiplayer multi-armed bandit models introduced in a matching context in [12, 25, 1, 19, 30]. This was motivated by the problem of spectrum sharing in wireless networks wherein the users want to get matched to channels, each channel can be occupied by only one wireless user, and together they want to maximize the expected sum throughput.

In this paper, we consider a general multi-agent multi-armed bandit model wherein each agent has a different set of arms. The players have a team (common) objective, and the individual rewards depend on the arms selected by all the players. This can be seen as a multi-dimensional version of a single agent MAB model, and still the same centralized algorithm should work. The twist is that there is information asymmetry

between the players. This can be of various types. First, a player may not be able to observe the actions of the other players (action information asymmetry). Thus, despite the reward being common information, each agent is not really able to tell what arm-tuple the reward came from. Second, the players may observe the actions of the other players, but they can get different i.i.d. rewards corresponding to the distribution of the arm-tuple played (reward information asymmetry). Third, we may have both action and reward information asymmetry between the agents. The two types of information asymmetry make the problem of decentralized online learning in multi-player MAB models challenging. Furthermore, as we will see, the three problems are of increasing complexity. The rewards may be i.i.d., i.e., every time an arm-tuple is chosen, players get a i.i.d. random reward from the corresponding distribution. Or they could be Markovian, i.e., the rewards come from a Markov chain corresponding to the arm-tuple chosen. Such a Markov chain only evolves when the corresponding arm-tuple is chosen.

There have been a number of papers on decentralized MAB models beginning with [19], which gave the first sublinear (log-squared) regret algorithm, as well as [30] which gave the first decentralized algorithm with order-optimal regret, followed by a number of others such as [4] and [7]. All of these results are really about the matching bandits problem, i.e., the players in a decentralized manner want to learn the optimal matching. The setting in this paper is different, and in some sense more general. It is akin to players playing a game with the caveat that the reward depends on actions taken by all of them but is common (or from the same distribution).

We first consider the problem with action information asymmetry. We define a standard index function for each player, and propose a variant of the UCB algorithm we call mUCB. We show that it achieves regret when the gap between the means for the arms is known, and it gets when the gap is not known. Next, we consider the problem when there is both action and reward information asymmetry, and show that the mDSEE algorithm is able to achieve near log regret bound for this setting. Finally next consider the problem with reward information asymmetry wherein actions are observable but rewards to various players are IID (though come from the same distribution), and provide an environmennt such that mUCB gives linear regret. However, as this is a special case of the setting with both action and reward information asymmetry, it is clear that mDSEE will obtain the same order regret in this problem.

#### Related Work.

We now discuss related work in more detail.

The literature on multi-armed bandits is overviewed in [34, 13, 23]. Some classical papers worth mentioning are [22, 2, 3]. Interest in multi-player MAB models was triggered by the problem of opportunistic spectrum sharing and some early papers were [12, 25, 1]. Other papers motivated by similar problems in communications and networks are [26, 21, 32, 9]. These papers were either for the centralized case, or considered the symmetric user case, i.e., all users have the same reward distributions. Moreover, if two or more users choose the same arm, there is a “collision", and neither of them get a positive reward. The first paper to solve this matching problem in a general setting was [19] which obtained log-squared regret. It was then improved to log regret in [30] by employing a posterior sampling approach. These algorithms required implicit (and costly) communication between the players. Thus, there were attempts to design algorithms without it [4, 31, 7, 10]. [5] considers the lower bound on regret of the certain for multiplayer MAB algorithms. Other recent papers on decentralized learning for multiplayer matching MAB models are [38, 33].

Another related thread is the learning and games literature [40, 11, 8]. This literature deals with a strategic learning wherein each player has its own objective, and the question whether a learning process can reach some sort of an equilibrium. Some key results here are [17] which showed that learning algorithms with uncoupled dynamics do not lead to a Nash equilibrium. It was also shown that simple adaptive procedures lead to a correlated equilibrium [14, 15, 16, 18]. In fact, the model in this paper may be regarded as more closely related to this literature than the multiplayer/decentralized matching MAB models discussed above except that all players have a common objective, a lá team theory [27].

In the realm of multiplayer stochastic bandits, nearly all works allow for limited communication such as those in [28, 29, 35, 20, 36]. An exception is [6] where all the players select from the same set of arms and their goal is to avoid a collision, that is, they do not want to select the same arm as another player. Another exception is [39] where they developed online learning algorithms that enable agents to cooperatively learn how to maximize reward with noisy global feedback without exchanging information.

## 2 Preliminaries and Problem Statements

### 2.1 Multi-player MAB model with IID Rewards

#### Problem A: Action Information Asymmetry with unobserved actions, common rewards.

We first present a multi-player multi-armed bandit (MMAB) model, and then a series of problem formulations. Consider a set of players , in which player has a set of arms to pick from. At each time instant, each player picks an arm from their set with the -tuple of arms picked denoted by . This generates a random reward (without loss of generality for bounded reward functions) from a -subgaussion distribution with mean . Denote where is the highest reward mean among all arm tuples. Let be the arm chosen by player at time , and denote . If arms are pulled by the players collectively, the reward to all the players equals , i.e., the reward is common, depends on and is independent across time instants. A high-level objective is for the players to collectively identify the best set of arms corresponding to mean reward . But the players do not know the means , nor the distributions . They must learn by playing and exploring. Thus, we can capture learning efficiency of an algorithm via the notion of per player expected regret,

 RT=E[Tμ∗−T∑t=1Xa(t)(t)] (1)

where is the number of learning instances and is the random reward if arm-tuples are pulled. Note that if the players jointly learn the optimal arms eventually, per unit expected regret as . Thus, our goal is to design a multi-player decentralized learning algorithm that has sublinear (expected) regret. Note that fundamental results for single-player MAB problems [22] suggest a -regret lower bound for the multi-player MAB problem as well. If we can design a multi-player decentralized learning algorithm with such a regret order, then it would imply that such a lower bound is tight for this setting as well. A multi-player decentralized learning for an MAB problem would essentially be like a single-player (centralized) learning for an MAB problem with multi-dimensional arms if all players have the same information while making decisions at all times. But this may not hold in various applications as various players may not be able to observe for example, actions of the other players. Thus, there may be information asymmetry between the players, which makes the problem much more challenging. We specifically exclude any explicit communication between the players during learning though they may coordinate a priori.

Markovian Reward model for Problem A. Finally, we will consider rewards to be Markovian, i.e., for a fixed tuple of arms , the reward sequence for a fixed is a Markov chain. And moreover, the Markov chains for are independent and identical. The notion of per-player expected regret would be the same as defined in (1).

#### Problem B’: Reward Information Asymmetry with observed actions, independent rewards.

We will see this as a special case of Problem B below. Here we consider the setting where each agent can observe the actions of all the other agents, but the rewards to each player are different, i.e., if the arms pulled at time are , each player gets an i.i.d. copy of the rewards from the distribution , i.e., player gets . Thus, the expected regret of player is obtained by replacing in (1) with . But note that the expected regret would be the same for all players since the rewards obtained are IID. All of the players can see what arms the other players are pulling, however, the rewards are identically and independently generated for both players. This means each of the players would end up seeing different reward realizations but from the same distributions.

#### Problem B: Action and Reward Information Asymmetry with unobserved actions, independent rewards.

Next we can have asymmetry in information in terms of action observations, as well as rewards. We will assume that no agent can observe the actions of the others, as well as the rewards to each player are independent and from the same distribution if the tuple of actions taken at time is . In this case, the expected regret for each player is still defined as in (1). Note that Problem B is more general Problem B’.

#### Our Contributions.

For Problem A, we provide a simple and elegant method method for multiple players to coordinate their arms only knowing the global rewards unseen in previous literature. On Problem B, we are able to provide an algorithm that provides a gap-independent algorithm which attains the almost optimal near-log regret, whereas the disCo algorithm in [39] is gap-dependent. For special case Problem B’, we give a counterexample which shows that Algorithm 1 designed for Problem A obtains linear regret even when the players can see the other player’s arms.

## 3 Online Multi-player Bandit Learning: Algorithms and Main Results

We now present algorithms and their regret performance for the various settings introduced above.

### 3.1 Problems A: IID and Markovian Rewards

Let us first define a UCB (Upper Confidence Bound) index to be used by each player. Suppose at time , player chooses arm (action) . Then, the UCB index for player is given by

 ηia(t)=⎧⎪⎨⎪⎩∞,if na(t)=0,^μia(t,na(t))+√2log(1/δ)na(t),% otherwise. (2)

, is the average of the rewards received by player from arm after total rounds, and is the number of times arm has been selected in rounds. is a confidence parameter which shall be selected later. Note that in the setting here, since all players receive a common reward, and thus the indices are the same for all players. Further note that in this setting, while each player only gets to observe the common reward but not the actions of the player, the players can agree that in the first times, the players will play a predetermined sequence of actions such that they together have at least one reward from each arm-tuple where . This initial exploration sequence is similar to that in a single player UCB algorithm. At the end of the initial round, all players have received the same reward observations and also know the arm-tuple for each such reward from which they come. Thus, the index computed by each player for any arm-tuple is the same. Thus, while say player cannot observe the actions of the other players but since they all use the same algorithm and the information available to all is the same, they can anticipate the actions that will be taken by the other players. Thus, at each time instant, index (2) can be updated as a function of both action of player but also actions of the other players. The only difficulty arises if two indices for two arm-tuples say and is the same, in which case they must all break the tie in the same way. Thus, to state the mUCB algorithm, we define a total order on a set on as follows.

###### Definition 1.

We say that if and only if there exists an such that , , and .

We are now ready to define the multi-player UCB algorithm as follows:

###### Remark 1.

(i) We note that the index (2) and Algorithm 1 are used for both cases of IID and Markovian rewards though the regret bounds will differ as will see. (ii) While each player can only observe their own actions, the reward obtained depends on actions taken by all of them together and is common. However, because the players all have the same observations, due to the coordination in the initial exploration round, they can anticipate actions taken by the others, as well as compute the indices as function of the actions taken by all of them. Thus, despite not being able to observe the actions of the others, nor being able to communciate among themselves, they are still able to coordinate their actions.

Thus, we can prove the following upper bounds on gap-dependent expected regret for each player.

###### Theorem 1.

If each player uses the algorithm mUCB in the setting of Problem A, and the rewards are IID, then the expected regret is upper bounded as:

 RT≤3∑aΔa+∑a:Δa>0(6+4√2)logTΔa. (3)

The proof can be found in Section 4.1. Note that we can anticipate a fundamental lower bound for problem A to be at least (since that is a fundamental lower bound for the single player MAB problem). The upper bound actually matches this. Thus, we can conclude that is an achievable lower bound for problem A.

We next present a gap-independent upper bound on expected regret for problem A.

###### Theorem 2.

If each player uses the algorithm mUCB in the setting of Problem A, and the rewards are IID, then the (gap-independent) expected regret is upper bounded as:

 RT≤3K1⋯KM+(1+∑a:Δa>ϵ(6+4√2))√TlogT. (4)

The proof can be found in Appendix A.1. This also matches the known gap-independent fundamental lower bounds on expected regret for the single player MAB problem (and hence also the multi-player MAB problem A). Next, we present regret bounds for the Markovian rewards case.

###### Theorem 3.

If each player uses the algorithm mUCB in the setting of Problem A, and the rewards are Markovian, then the expected regret is upper bounded as:

 RT≤∑aΔa(2C′+αlogT)=O(logT) (5)

for some universal constant .

The proof can be found in Appendix A.1.

### 3.2 Problem B’: IID Rewards

We now consider a different type of information asymmetry, namely actions of other players are observable but each gets an IID copy of the rewards. As stated before, this is a special case of Problem B. The UCB (Upper Confidence Bound) index to be used by each player in this setting is the same as given in (2). Note that since actions of other players can be observed, each player can compute the index for each arm-tuple though unlike in problem A, these indices will have different values since the rewards are different (and a player knows only its own rewards). The algorithm we use for problem B is still the UCB algorithm given as Algorithm 1 except that the actions are directly observable.

###### Remark 2.

(i) We note that the index (2) and Algorithm 1 are used for both cases of IID and Markovian rewards though the regret bounds will differ as will see. (ii) While each player can only observe actions of others, their rewards are different (and IID). The indices they compute will also be different. Thus, unlike in Problem A, they are not able to coordinate their actions which makes problem B much more challenging.

The following result shows that mUCB no longer produces regret for this problem.

###### Theorem 4.

If each player uses the algorithm mUCB in the setting of Problem B, and the rewards are IID, then there exists a multiplayer multi armed bandit setting such that this algorithm gives linear regret.

The proof can be found in Section 4.2.

The above negative result motivates us to consider a different algorithm for Problem B (which is a generalization of Problem B’).

### 3.3 Problem B: Unobserved Actions and Independent (IID) rewards

We introduce the DSEE algorithm inspired by [37] for Problem B which gives regret for some monotonic going to infinity. This algorithm is nearly optimal, and gives the same regret for Problem B. We consider the setting where there is two types of information asymmetry between the players: actions of others are unobserved and the rewards are independent. This problem is a lot more challenging than problem A, and variation of the mUCB algorithm does not work. Thus, we provide the DSEE algorithm inspired by [37] for the setting in problem C.

Let the reward for player at time on playing arm while the other players play arms be denoted by . Then, the sample mean at time of rewards by playing arm-tuple for player is given by .

Note that since Problem B’ is a special case of Problem B, this algorithm can be also be used for Problem B’.

The players have an exploration phase wherein each arm will be pulled times before moving on to the next one. After all the arms have been pulled times they will pick the arm with the highest average reward and commit until the next power of . is chosen to be a function that goes to to compensate for the expoentially increasing intervals of commitment to a single arm. This process is rigorously justified in the proof of Theorem 5. Since the rewards are independently generated for each player, it is possible that each player obtain a different

-tuple optimal arm. However, the probability of this happening is lows since

goes to . Since we are exploring at powers of , the regret is .

###### Remark 3.

The slower grows, the better the regret.

The mDSEE algorithm achieves the following (gap-independent) expected regret on Problem B (and B’).

###### Theorem 5.

If each player uses the mDSEE algorithm in the setting of Problem B (and B’), and the rewards are IID, then the (gap-independent) expected regret satisfies .

The proof can be found in Section 4.3.

## 4 Regret Analysis: Proofs of Theorems

We need Lemma 3

for subgaussian random variables reproduced in Appendix

A. We first present the regret decomposition lemma.

###### Lemma 1.

With regret defined in (1), we have the following regret decomposition for each player:

 RiT=∑aΔaE[na(T)], (6)

where is the number of times the arm has been pulled up to round .

The proof can be found in Appendix A.

### 4.1 Problem A: Unobserved Actions, Common Rewards

Now, we present proofs of theorems for problem A.

###### Proof.

(Theorem 1) We suppose that the first arm is the optimal one for each player, that is arm has the highest average reward. For any arm , we define the following “good" event:

 Ga={μ(1,...,1)

The set above is the event that the common average reward from arm is not underestimated by the UCB index, and that the upper confidence bound of arm is below after observations. The constant will be chosen later. We show that the regret whether occurs or not is small. Letting be the indicator function we obtain:

 E[na(T)]=E[I{Ga}na(T)]+E[I{Gca}na(T)]. (8)

We need the following lemma whose proof can be found in Appendix A.1.

###### Lemma 2.

When holds, .

Thus, using equation (8) we have the bound:

 E[na(T)]≤ma+P(Gca)n. (9)

We will now bound . Since , we have the bound:

 E[I{Gca}na(T)]≤E[I{Gca}T]=nP(Gca). (10)

We have the following claim for whose proof can be found in Appendix A.1.

###### Claim 1.

where is as defined as in equation (7) satisfies:

 P(Gca)≤e−mac2Δ2a2+Tδ. (11)

where satisfies

It follows from Claim 1 and equation (9) that

 E[na(T)]≤ma+T(Tδ+e−mac2Δ2a2). (12)

Letting , we will select such that inequality (12) is optimized under the constraints given in Claim 1. We first set so that plugging this into inequality (12), we have

 E[na(T)]≤⌈4log(T)(1−c)2Δ2a⌉+1+T1−2c2(1−c)2. (13)

To minimize regret, we want which gives us since . The term is minimized when is as small as possible so let us pick . This gives us the inequality:

 E[na(T)]≤⎡⎢ ⎢ ⎢ ⎢ ⎢⎢4log(T)(2−√2)2Δ2a⎤⎥ ⎥ ⎥ ⎥ ⎥⎥+2≤2(3+2√2)log(T)Δ2a+3. (14)

Plugging this back into inequality (6) yields the desired result in Theorem 1. ∎

### 4.2 Problem B’: Observed Actions, Independent Rewards

###### Proof.

(Theorem 4) Consider the two player setting where each player has arms with means as in Figure 1. Furthermore, suppose for a square with mean , the distribution is uniform across the interval . Clearly such distributions are -subgaussian. The row correspond to the arms Player while the column correspond to the arms Player can select from. Consider the following ’Bad’ event,

 B:={^μ1(2,1)(t,1)>maxa≠(2,1)^μ1a(t,1))}∪{^μ2(1,2)(t,1)>maxa≠(1,2)^μ2a(t,1))} (15)

is the event that after taking initial sample from arms and , Player 1 determines arm as the optimal arm, while Player 2 determines arm as the optimal arm. In this event Player will pull arm 2 while Player 2 will pull arm 2, thus obtaining a sample from arm . In this situation the following cases are possible,

1. and .

2. and

3. and

4. and

In all of these cases the next round Player 1 will still pull arm 2 and Player 2 will still pull arm 2 on their next rounds. Since stays the same for all except for they will continue to pull arm for the rest of the rounds. It’s clear to see from the regret decomposition that

 RT≥P(B)E[T(2,2)]Δ(2,2)=P(B)(T−4)(.6) (16)

Since , it follows that is asymptotically linear as desired.

### 4.3 Problem B: Unobserved Actions, Independent Rewards

###### Proof.

(Theorem 5) Decompose , where is the regret incurred from the exploration sequence spaced at powers of , while is the regret coming from committing to the arm with the highest mean. In the th exploration phase each arm is pulled times, it follows that . Thus, the regret decomposition given by equation (6),

 RT,E≤∑aK(⌊log2(t)⌋)⌈log2(T)⌉Δa (17)

The following claim, gives a lower bound for the number of times each arm has been explored up to time

###### Claim 2.

for some function going to infinity.

###### Proof.

Note here that doesn’t have to map positive integers to positive integers. This statement is proved if we can show that . Suppose . Then for any , there exists an integer such that . We can increase so that it satisfies the condition . Now consider the quantity as follows

 ∑2lλ=1K(λ)2l=∑lλ=1(K(λ)+∑2lλ=l+1(K(λ)2l≥l(L−ϵ)+l(K(l)+1)2l=L+K(l)2+1−ϵ2 (18)

Since , it follows that

 ∑2lλ=1K(λ)2l≥L+1−2ϵ2 (19)

when , we clearly have . Since is an nondecreasing sequence, the contradiction gives the desired result. ∎

Using equation (6), and the assumption that is optimal,

 RT,C≤∑aΔaE(∑t∈CI[^μ(1,...,1)(n(1,...,1)(t))≯maxa≠(1,...,1)^μa(na(t))]) (20)

Let . In this problem we consider the following good event

 Ga(i)={|^μ(t,na(t))−μa|<ϵ} (21)

is the event when the observed mean of player after samples of arm is within of the true mean of arm . For any fixed in the exploitation sequence, it follows that when occurs, the optimal arm is pulled and thus the regret is . Thus,

 RT,C ≤∑t∈C∑aP{¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯⋂iGa(i)}Δa (22) ≤T∑t=1∑aM∑i=1P{¯¯¯¯¯¯¯¯¯¯¯¯¯Ga(i)}Δa (23) =T∑t=1∑aMP{|^μ(t,na(t))−μa|>ϵ}Δa (24) ≤T∑t=1∑a2Me−na(T)ϵ22Δa By Lemma 3 (25) ≤T∑t=1∑a2Me−K0(t)log2(t)ϵ22Δa using na(t)≥K0(t)log2(t) (26) ≤∑aMC′ΔaT∑t=1t−K0(t)ϵ22 (27) ≤∑aMC′Δa∞∑t=1t−K0(t)ϵ22 (28)

Since is monotonic, it is bounded by . However, as , it follows that there exists an such that . Since , it follows that . Thus is bounded below a constant less then , so our total regret as desired. ∎

## 5 Numerical Experiments

In this section we run numerical simulations on a three player stochastic bandit problem problem with arms from the set . All the plots are given in Fiure 2

. Each arm has Gaussian distribution with mean in

. The means and standard deviations were chosen this way so that the probability that a reward is in the interval is very high. The horizon is for rounds, with the entire simulations run in the same environment for times. In Figure 2, there are two regret verses log-time plots for Problem A (left) and Problem B (right) which compare Algorithm mUCB with the Agnostic Standard UCB and the disCo Algorithm in [39] with Algorithm mDSEE. For each algorithm, the mean regret across all runs is plotted, with the shaded areas of the same color being standard deviations above and below the mean. This gives a confidence interval, and from this plot it is clear to see that a time progresses, the confidence intervals for the algorithms in comparison diverge from each other. This tells us that with high probability, one algorithm has lower regret than the other.

For Problem A, we show the regret of the mUCB algorithm in black. As this curve is linear, it’s clear that the regret is in this environment. We also compared this to the simple agnostic UCB algorithm in green. More explicitly, this is where each players treat the environment as a single player stochatic multi-armed bandit environment, and picks the arms with the highest UCB index at each round as is the case in the standard UCB algorithm. As expected, this algorithm gives linear regret.

For Problem B, the a regret verses time plot for mDSEE algorithm is given in black. In this setting, we set , and thus, from Theorem 5, we expect the regret to be . From Figure 2 it is clear that the curve grows at most quadratically, which bolsters this Theorem. We also plotted the regret of the disCo algorithm in [39] in red. In the beginning the regret is linear, but then it becomes log. This is because , which is used to determine when to explore, is very large even for small values of . Thus, there is a lot of exploration that happens in the beginning that incurs linear regret. However, since grows logarithmically, eventually the exploration sequences will become more sparse, thus giving us log regret. While asymptotically disCo is better, for smaller values of , mDSEE gives much better regret as one can see in this plot.

## 6 Conclusions

In this paper, we have introduced a general framework to study decentralized online learning with team objectives in a multiplayer MAB model. If all players have the same information, this would be just like a standard single player (centralized) MAB problem. Information asymmetry is what makes the problem interesting and challenging. We introduce three types of information asymmetry: in actions, in rewards, and in both actions and rewards. We then showed that a multiplayer version of UCB algorithm is able to achieve order-optimal regret when the there is information asymmetry in actions. We then showed that we can achieve near log regret even when there is information asymmetry in both actions and rewards. Finally, we show that when there is information asymmetry in rewards, the algorithm given when there is there is information asymmetry in actions gives linear regret. For future work, considering decentralized online learning in a multiplayer MDP setting would be another interesting new direction.

## References

• [1] A. Anandkumar, N. Michael, A. K. Tang, and A. Swami (2011) Distributed algorithms for learning and cognitive medium access with logarithmic regret. IEEE Journal on Selected Areas in Communications 29 (4), pp. 731–745. Cited by: §1, §1.
• [2] V. Anantharam, P. Varaiya, and J. Walrand (1987) Asymptotically efficient allocation rules for the multiarmed bandit problem with multiple plays-part ii: markovian rewards. IEEE Transactions on Automatic Control 32 (11), pp. 977–982. External Links: Document Cited by: §1.
• [3] P. Auer, N. Cesa-Bianchi, and P. Fischer (2002) Finite-time analysis of the multiarmed bandit problem. Machine learning 47 (2), pp. 235–256. Cited by: §1, §1.
• [4] O. Avner and S. Mannor (2014) Concurrent bandits and cognitive radio networks. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases, pp. 66–81. Cited by: §1, §1.
• [5] L. Besson and E. Kaufmann (2018) Multi-player bandits revisited. In Algorithmic Learning Theory, pp. 56–92. Cited by: §1.
• [6] I. Bistritz, T. Z. Baharav, A. Leshem, and N. Bambos (2021) One for all and all for one: distributed learning of fair allocations with multi-player bandits. IEEE Journal on Selected Areas in Information Theory 2 (2), pp. 584–598. Cited by: §1.
• [7] I. Bistritz and A. Leshem (2018) Distributed multi-player bandits-a game of thrones approach. Advances in Neural Information Processing Systems (NeurIPS). Cited by: §1, §1.
• [8] N. Cesa-Bianchi and G. Lugosi (2006) Prediction, learning, and games. Cambridge university press. Cited by: §1.
• [9] M. Chakraborty, K. Y. P. Chua, S. Das, and B. Juba (2017) Coordinated versus decentralized exploration in multi-agent multi-armed bandits.. In IJCAI, pp. 164–170. Cited by: §1.
• [10] R. Féraud, R. Alami, and R. Laroche (2019) Decentralized exploration in multi-armed bandits. In International Conference on Machine Learning, pp. 1901–1909. Cited by: §1.
• [11] D. Fudenberg, F. Drew, D. K. Levine, and D. K. Levine (1998) The theory of learning in games. Vol. 2, MIT press. Cited by: §1.
• [12] Y. Gai, B. Krishnamachari, and R. Jain (2012) Combinatorial network optimization with unknown variables: multi-armed bandits with linear rewards and individual observations. IEEE/ACM Transactions on Networking 20 (5), pp. 1466–1478. Cited by: §1, §1.
• [13] J. Gittins, K. Glazebrook, and R. Weber (2011) Multi-armed bandit allocation indices. John Wiley & Sons. Cited by: §1, §1.
• [14] S. Hart and A. Mas-Colell (2000) A simple adaptive procedure leading to correlated equilibrium. Econometrica 68 (5), pp. 1127–1150. Cited by: §1.
• [15] S. Hart and A. Mas-Colell (2001) A general class of adaptive strategies. Journal of Economic Theory 98 (1), pp. 26–54. Cited by: §1.
• [16] S. Hart and A. Mas-Colell (2001) A reinforcement procedure leading to correlated equilibrium. In Economics essays, pp. 181–200. Cited by: §1.
• [17] S. Hart and A. Mas-Colell (2003) Uncoupled dynamics do not lead to nash equilibrium. American Economic Review 93 (5), pp. 1830–1836. Cited by: §1.
• [18] S. Hart and A. Mas-Colell (2013) Simple adaptive strategies: from regret-matching to uncoupled dynamics. Vol. 4, World Scientific. Cited by: §1.
• [19] D. Kalathil, N. Nayyar, and R. Jain (2014) Decentralized learning for multiplayer multiarmed bandits. IEEE Transactions on Information Theory 60 (4), pp. 2331–2345. Cited by: §1, §1, §1.
• [20] N. Karpov, Q. Zhang, and Y. Zhou (2020) Collaborative top distribution identifications with limited interaction. In 2020 IEEE 61st Annual Symposium on Foundations of Computer Science (FOCS), pp. 160–171. Cited by: §1.
• [21] N. Korda, B. Szorenyi, and S. Li (2016) Distributed clustering of linear bandits in peer to peer networks. In International conference on machine learning, pp. 1301–1309. Cited by: §1.
• [22] T. L. Lai and H. Robbins (1985) Asymptotically efficient adaptive allocation rules. Advances in applied mathematics 6 (1), pp. 4–22. Cited by: §1, §1, §2.1.
• [23] T. Lattimore and C. Szepesvári (2020) Bandit algorithms. Cambridge University Press. Cited by: Appendix A, §1, §1, Lemma 3.
• [24] P. Lezaud (1998) Chernoff-type bound for finite markov chains. Annals of Applied Probability, pp. 849–867. Cited by: Theorem 6, Theorem 6.
• [25] K. Liu and Q. Zhao (2010) Distributed learning in multi-armed bandit with multiple players. IEEE Transactions on Signal Processing 58 (11), pp. 5667–5681. Cited by: §1, §1.
• [26] S. Maghsudi and S. Stańczak (2014) Channel selection for network-assisted d2d communication via no-regret bandit learning with calibrated forecasting. IEEE Transactions on Wireless Communications 14 (3), pp. 1309–1322. Cited by: §1.
• [27] J. Marschak and R. Radner (1972) Economic theory of teams.. Cited by: §1.
• [28] D. Martínez-Rubio, V. Kanade, and P. Rebeschini (2018) Decentralized cooperative stochastic bandits. arXiv preprint arXiv:1810.04468. Cited by: §1.
• [29] D. Martínez-Rubio, V. Kanade, and P. Rebeschini (2019) Decentralized cooperative stochastic bandits. Cited by: §1.
• [30] N. Nayyar, D. Kalathil, and R. Jain (2018) On regret-optimal learning in decentralized multiplayer multiarmed bandits. IEEE Transactions on Control of Network Systems 5 (1), pp. 597–606. Cited by: §1, §1, §1.
• [31] J. Rosenski, O. Shamir, and L. Szlak (2016) Multi-player bandits–a musical chairs approach. In International Conference on Machine Learning, pp. 155–163. Cited by: §1.
• [32] S. Shahrampour, A. Rakhlin, and A. Jadbabaie (2017) Multi-armed bandits in multi-agent networks. In 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 2786–2790. Cited by: §1.
• [33] C. Shi, W. Xiong, C. Shen, and J. Yang (2020) Decentralized multi-player multi-armed bandits with no collision information. In

International Conference on Artificial Intelligence and Statistics

,
pp. 1519–1528. Cited by: §1.
• [34] R. S. Sutton and A. G. Barto (2018) Reinforcement learning: an introduction. MIT press. Cited by: §1.
• [35] B. Szorenyi, R. Busa-Fekete, I. Hegedus, R. Ormándi, M. Jelasity, and B. Kégl (2013) Gossip-based distributed stochastic bandit algorithms. In International Conference on Machine Learning, pp. 19–27. Cited by: §1.
• [36] C. Tao, Q. Zhang, and Y. Zhou (2019) Collaborative learning with limited interaction: tight bounds for distributed exploration in multi-armed bandits. In 2019 IEEE 60th Annual Symposium on Foundations of Computer Science (FOCS), pp. 126–146. Cited by: §1.
• [37] S. Vakili, K. Liu, and Q. Zhao (2013) Deterministic sequencing of exploration and exploitation for multi-armed bandit problems. IEEE Journal of Selected Topics in Signal Processing 7 (5), pp. 759–767. Cited by: §3.3.
• [38] P. Wang, A. Proutiere, K. Ariu, Y. Jedra, and A. Russo (2020) Optimal algorithms for multiplayer multi-armed bandits. In International Conference on Artificial Intelligence and Statistics, pp. 4120–4129. Cited by: §1.
• [39] J. Xu, C. Tekin, S. Zhang, and M. Van Der Schaar (2015) Distributed multi-agent online learning based on global feedback. IEEE Transactions on Signal Processing 63 (9), pp. 2225–2238. Cited by: §1, §2.1, Figure 2, §5, §5.
• [40] H. P. Young (2004) Strategic learning and its limits. OUP Oxford. Cited by: §1.

## Appendix A Appendix

###### Lemma 3.

[23] Assume that are independent, -subgaussian random variables. Then, for any ,

 P(^μ≥μ+ϵ)≤e−Tϵ22σ2andP(^μ≤μ−ϵ)≤e−Tϵ22σ2 (29)

where .

###### Proof.

The reader is encouraged to look at Corollary 5.5 of [23]. It relies on the observation that is -subgaussian. ∎

###### Proof.

(Lemma 1) Letting be the indicator function and the action taken in round , we have by definition of regret given in equation (1).

 RiT=∑a=×MiKiT∑t=1E[(μ∗−Xia(t))I{a(t)=a}]. (30)

The expected reward in round conditioned on is and thus:

 E[(μ∗−Xia(t))I{a(t)=a}|a(t)] =I{a(t)=a}E[μ∗−Xia(t)|a(t)] (31) =I{a(t)=a}(μ∗−μa(t)) (32) =I{a(t)=a}(μ∗−μa) (33) =I{a(t)=a}Δa. (34)

We can now plug this into equation (30) and the result follows. ∎

### a.1 Addendum to Section 4.1

###### Proof.

(Lemma 2) We do a proof by contradiction, supposing while , then arm was played more than times over the rounds so there must be a round such that and . We now apply the definition of to obtain:

 ηa(t−1) =^μa(t−1,⋅)+  ⎷2log(1δ)na(t−1) (35) =^μa(t−1,ma)+ ⎷2log(1δ)ma (36) <μ1,...,1 (37) <η1,...,1(t−1), (38)

and thus which gives us a contradiction. ∎

###### Proof.

(Claim 1) Taking complement of ,

 (39)

so that

 P(Gca)≤P{μ(1,...,1)≥mint∈[n]ηa(t)}+P{^μa(⋅,ma)+√2malog(1δ)≥μ(1,...,1)}. (40)

For the first term in the equation above, we have:

 P{μ(1,...,1)≥mint∈[n]ηa(t)} ≤P⎛⎝⋃s∈[n]⎧⎨⎩μ1,...,1≥^μ(1,...,1)(⋅,s)+√2log(1/δ)s⎫⎬⎭⎞⎠ (41) ≤T∑s=1P⎧⎨⎩μ1,...,1≥^μa(⋅,s)+√2log(1/δ)s⎫⎬⎭, (42)

where the summation index is the number of times arm has been pulled. Since has -subgaussian distribution, using Theorem 3, it follows that

 P⎧⎨⎩^μ1,...,1(⋅,s)≤u1,...,1−√2log(1/δ)s⎫⎬⎭≤δ (43)

and thus

 P{μ1,...,1≥mint∈[n]ηa(t)}≤T∑s=1δ=Tδ. (44)

Now, let us bound the second term on the RHS in equation (40) by first choosing so that

 Δa−√2log(1/δ)ma≥cΔa. (45)

Then, the term we wish to bound becomes:

 P{^μi(⋅,ma)+√2malog(1δ)≥μ1,...,1} =P{^μi(⋅,ma)−μi≥Δa−√2malog(1δ)} (46) ≤e−mac2Δ2a2, (47)

where the last inequality comes from using Theorem 3 again. Combining this with inequalities (40) and (42) proves Claim 1. ∎

#### Reward independent Regret Bounds.

We now present proof of Theorem 2.

###### Proof.

(Theorem 2) We first take the regret decomposition given by equation (6) and partition the tuples to those whose mean are at most away from the optimal and those whose means are more than . This gives us the following inequality:

 RT =∑aΔaE[na(T)]=∑Δa>ϵΔaE[na(T)]+∑Δa≤ϵΔaE[na(T)]≤∑Δa≥ϵΔaE[na(T)]+ϵT. (48)

Following the same reasoning as in Theorem 1 yields the following inequality:

 RT≤3∑Δa>ϵΔa+∑Δa>ϵ2(3+2√2)log(T)ϵ+ϵT≤3ab+∑Δa>ϵ2(3+2√2)log(T)ϵ+ϵT. (49)

Since inequality (49) holds for all , we can pick to obtain result in Theorem 2. ∎

#### Markovian Rewards in Problems A.

We now consider the rewards to be Markovian. Note that the regret is as defined in equation (1) and we use is Algorithm 1. The optimal arm is measured relative to the optimal stationary Markov distribution rather than the arm with the highest mean. With this definition of regret the regret decomposition as stated in equation (6) still holds and will be used to derive an upper bound for the regret.

Theorem 6 is needed below which is restated for completeness.

###### Theorem 6.

[24] Let be an irreducible, aperiodic Markov chain on a finite state space with transition probability matrix , an initial distriubtion and a stationary distribution . Denote . Let

be the eigenvalue gap,

, where is the second largest eigenvalue of he matrix , where is the multiplicative symmetrization of the transition matrix . Let be such that . If is irreducible, then for any , .

###### Proof.

The reader is encouraged to look at proof of Theorem 1.1 in [24]

###### Proof.

(Theorem 3) Suppose without loss of any generality that the arm is optimal (has the highest expected reward). We can consider the “good” event as defined by equation (7), and decompose it as in (40). We bound the term using Theorem 6 with to obtain

 P{^μa(⋅,ma)+√2malog(1δ)≥μ(1,...,1)}≤CδK2TmaeK2T ⎷log(1δ)ma (50)

for some constants . For the other term , we can use equation (40) and just bound:

 T∑s=1P{μ1,...,1≥^μa(⋅,s)+√2log(1/δ)s}=T∑s=1P⎧⎨⎩^μa(⋅,s)≤μ1,...,1−√2log(1/δ)s⎫⎬⎭. (51)

Using