# Multi-Agent Low-Dimensional Linear Bandits

We study a multi-agent stochastic linear bandit with side information, parameterized by an unknown vector θ^* ∈ℝ^d. The side information consists of a finite collection of low-dimensional subspaces, one of which contains θ^*. In our setting, agents can collaborate to reduce regret by sending recommendations across a communication graph connecting them. We present a novel decentralized algorithm, where agents communicate subspace indices with each other, and each agent plays a projected variant of LinUCB on the corresponding (low-dimensional) subspace. Through a combination of collaborative best subspace identification, and per-agent learning of an unknown vector in the corresponding low-dimensional subspace, we show that the per-agent regret is much smaller than the case when agents do not communicate. By collaborating to identify the subspace containing θ^*, we show that each agent effectively solves an easier instance of the linear bandit (compared to the case of no collaboration), thus leading to the reduced per-agent regret. We finally complement these results through simulations.

## Authors

• 2 publications
• 12 publications
• 37 publications
• ### Social Learning in Multi Agent Multi Armed Bandits

In this paper, we introduce a distributed version of the classical stoch...
10/04/2019 ∙ by Abishek Sankararaman, et al. ∙ 0

• ### The Gossiping Insert-Eliminate Algorithm for Multi-Agent Bandits

We consider a decentralized multi-agent Multi Armed Bandit (MAB) setup c...
01/15/2020 ∙ by Ronshee Chawla, et al. ∙ 0

• ### Decentralized Multi-Agent Linear Bandits with Safety Constraints

We study decentralized stochastic linear bandits, where a network of N a...
12/01/2020 ∙ by Sanae Amani, et al. ∙ 0

• ### Stochastic Linear Bandits with Protected Subspace

We study a variant of the stochastic linear bandit problem wherein we op...
11/02/2020 ∙ by Advait Parulekar, et al. ∙ 0

• ### Orthogonal Projection in Linear Bandits

The expected reward in a linear stochastic bandit model is an unknown li...
06/26/2019 ∙ by Qiyu Kang, et al. ∙ 0

• ### Distributed Online Learning for Joint Regret with Communication Constraints

In this paper we consider a distributed online learning setting for jo...
02/15/2021 ∙ by Dirk van der Hoeven, et al. ∙ 0

• ### Bilinear Bandits with Low-rank Structure

We introduce the bilinear bandit problem with low-rank structure where a...
01/08/2019 ∙ by Kwang-Sung Jun, et al. ∙ 0

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

The Multi-Armed Bandit (MAB) model features a single decision maker making sequential decisions under uncertainty. It has found a wide range of applications: advertising [10], information retrieval [25] and operation of data-centers [15] to name a few. See also books of [19, 8]. As the scale of applications increases, several decision makers (a.k.a. agents) are involved in making repeated decisions as opposed to just a single agent. For example, in internet advertising, multiple servers are typically deployed to handle the large volume of traffic [9]. Multi-agent MAB models have emerged in recent times as a framework to design algorithms accounting for this scale. However, all the multi-agent models studied thus far only consider unstructured bandits with finitely many arms.

We introduce a multi-agent version of stochastic linear bandits. The linear bandits framework allows for a continuum of arms with a shared reward structure, thereby modeling many complex online learning scenarios [14, 1]. From a practical perspective, the linear bandit framework have shown to be more appropriate than unstructured bandits in many instances (e.g. recommendations [20], clinical studies [5]). Despite its applicability, a multi-agent version of linear bandits has not been studied thus far. The key technical challenge arises from the ‘information leakage’; the reward obtained by playing an arm gives information on the reward obtained by all other arms. In a multi-agent scenario, this is further exacerbated, making design of collaborative algorithms non-trivial.

We take a step in this direction by considering a collaborative multi-agent low-dimensional linear bandit problem and propose a novel decentralized algorithm. Agents in our model have side information in the form of subspaces. In our algorithm, agents collaborate by sharing these subspaces as opposed to the linear reward in our algorithm. Our main result shows that, even with minimal communications, the regret of all agents are much lower compared with the case of no collaboration.

Model Overview: Our setup consists of a single instance of a stochastic linear bandit with unknown parameter , concurrently played by agents. At each time , each agent 444For any positive integer , plays an action vector from the set and receives a reward , where is zero mean sub-gaussian noise. The rewards obtained by the agents are only dependent on their actions and independent of actions of other agents. Agents have a common side information of disjoint -dimensional subspaces, one of which contains . The subspace containing is unknown to the agents. The agents in our model are connected through a communication graph over which they can exchange messages to collaborate. Agents have a communication budget limiting the number of times they can exchange messages. We seek decentralized algorithms for agents, i.e. the choice of action vector, communication choices and messages depend only on the observed past history (of action vectors, rewards and messages) of that agent.

Motivating Example: We first give an example of linear bandits with side information in the context of internet search advertisements and subsequently motivate the multi-agent version. In a search engine advertisement, each time a keyword is queried, an ad-server needs to make several decisions on advertising choices. The query is composed of the keyword and additional context (e.g. location of the search origin, user profile of the person). Thus, the decisions taken by the ad-server for a query can be modeled as playing an action in a high dimensional vector space.

Subspaces here arise from from historical context – different advertising choices for the categories of keywords can be clustered into distinct low-dimensional subspaces. Each subspace corresponds to decisions for a category of keywords

(e.g. shoes, watches etc.). Thus, the advertising system can be decomposed into running several low-dimensional linear bandit problems, one for each category of keywords. Now, when a new keyword starts being seen (which has not been a-priori classified), the system needs to both

(i) identify the correct category (subspace) it belongs to and (ii) make advertising choices according the rest of the query’s context within the identified category.

Further, a collection of ad-servers are typically deployed to handle the large numbers of queries. Each query is routed to one of the servers which needs to make a real-time decision. Each server can be modeled as an agent and can collaborate by exchanging messages with other servers. In such a distributed system, decentralized learning algorithms with minimal communications are desirable. Our model abstracts this setup and our algorithm demonstrates that agents can indeed benefit from collaborations with minimal communications.

Our main contributions are as follows.

1. The SubGoss Algorithm: We propose SubGoss (Algorithm 1), where agents perform pure exploration for time steps over a time horizon of and play a projected version of LinUCB on the most likely subspace containing

. The pure exploration is used to estimate the subspace containing

. Agents use pairwise communications to recommend subspaces (not samples), i.e. agents communicate the ID of the estimated best subspace and is limited by the communication budget provided as input. Our algorithm constrains agents to search for over only a small set (of cardinality ) of subspaces per agent. This set of subspaces is updated through recommendations; agents accept new recommendations and drop subspace(s) unlikely to contain , ensuring that the total number of subspaces an agent considers at all times remain small. Nevertheless, the best one spreads to all the agents through communications and thus all agents eventually identify the correct subspace. Thus, due to collaborations, each agent is individually solving an easier instance of the problem, leading to low regret.

2. Regret Guarantee and Benefit of Collaboration: Despite playing from a time-varying set of subspaces, every agent incurs a regret of at-most (Theorem 1). This scaling does not depend on either the gossip matrix or communication budget555We require to be connected and communication budget at least logarithmic. See Appendix B and we show that these communication constraints only affect the constant term in regret. The term in regret arises from projected LinUCB on the true subspace containing

, which we show every agent eventually identifies with probability

. The logarithmic term is the regret of pure exploration in order to identify the best subspace. The term in this logarithmic term is the minimum pairwise distance666Definition of distance between subspaces is given in the sequel between subspaces considered by agent . We show that the set of subspaces an agent plays from is although time-varying, “freezes" after a random time. The minimum subspace gap for this frozen set is , which dictates rate of pure exploration. We observe that is monotone in : larger leads to larger .

Thus, as the number of agents increases, the individual regret for any agent is reduced, thereby demonstrating the benefit of collaboration. This is more pronounced in high-dimensional settings, ( and is a constant), when the regret from Projected LinUCB is comparable to pure exploration. In such settings, collaboration aids in reducing the leading order term in regret. The insights regarding benefit of collaboration are further confirmed numerically.

#### Related Work:

Multi-agent MAB have become popular in recent times as the scale of applications increases. However, all of the work in this space concerns with simple unstructured bandits. The literature on multi-agent MAB can be classified into two —competitive where different agents compete for limited resources or collaborative, where agents jointly accomplish a shared objective (as in this paper). The canonical model of competitive multi-agent bandits is one wherein if multiple agents play the same arm, they all are blocked and receive no reward (colliding bandit model) for eg. [22, 3, 6, 7, 16]. Such models are motivated from applications in wireless networks [4]. The collaborative models consist of settings where if multiple agents play the same arm simultaneously, then they all receive independent rewards [23, 11, 9, 17, 21]. Such models have primarily been motivated by applications such as internet advertising [12]. However, the algorithms are all adapted to the case of unstructured bandits and cannot be applied to a linear bandit setup such as ours. Nevertheless, we adopt some of the broader principles from [23, 12] regarding the use of gossiping paradigm for communications to spread the best subspace into our algorithm design.

The stochastic linear bandit framework and the study of LinUCB algorithm was initiated by [14, 1]. From a practical perspective, the linear bandit framework has been shown to be effective for various applications: for example [2, 20] apply this framework in the context of internet advertising and [5, 24] apply in the context of clinical trials. Further, a projected version of LinUCB on low-dimensional subspaces has been recently studied in [18]. However, multi-agent version of the linear bandit framework has not been studied. We take a step in this direction by introducing a collaborative multi-agent linear bandit and proposing an algorithm that demonstrates benefit of collaboration.

## 2 Problem Setup

Our problem setup builds on similar settings considered for unstructured bandits in [23, 12], where a single instance of stochastic linear bandit is concurrently played by agents. All agents play from the same set of action vectors , which is the unit ball with respect to Euclidean () norm, i.e. . If at time , any agent plays an action vector , the reward obtained is , where is unknown and is a zero mean sub-gaussian noise, conditional on the actions and rewards accumulated only by agent (see Appendix A, (4) for details). The noise is independent across agents. The side information available to all the agents is a collection of disjoint subspaces in of dimension . These subspaces are denoted by the orthonormal matrices , where defines a -dimensional subspace in . One of these subspaces contains , but agents are unaware of the subspace containing it. Let denote the projection matrix of the subspace for all . Agents know the parameter , such that . Without loss of generality, we assume that is an integral multiple of .

Collaboration among Agents: The agents collaborate by exchanging messages over a communication network. This matrix is represented through a gossip matrix

, with rows in this matrix being probability distributions over

. At each time step, after playing an action vector and obtaining a reward, agents can additionally choose to communicate with each other. If an agent chooses to communicate, it will do so with another agent , chosen independently of everything else. However, agents have a communication budget , where, for any , the total number of times an agent can communicate is at most . Each time an agent chooses to communicate, it can exchange only a fixed number of bits.

Decentralized Algorithm: Each agent’s decisions (arm play and communication decisions) in the algorithm depend only on its own history of plays and observations, along with the the recommendation messages that it has received from others.

blackPerformance Metric: Each agent plays action vectors in order to minimize their individual cumulative regret. At any time , the instantaneous regret for an agent is given by , where . The expected cumulative regret for any agent is given by .

## 3 SubGoss Algorithm

Key Ideas and Intuition: Our setting is one where the optimal lies in one of a large number of (low-dimensional) subspaces. In our approach, agents at any time instant identify a small active set of subspaces (cardinality ) and play actions only within this set of subspaces (however, this set is time-varying). Namely, at each point of time, an agent first identifies amongst its current active set of subspaces the one likely to contain . It subsequently plays a projected version of LinUCB on this identified subspace. The communication medium is used for recommendations; whenever an agent is asked for information, it sends as message the subspace ID that it thinks most likely contains , which is then used to update the active set of the receiving agent. Thus, an agent’s algorithm has two time-interleaved parts: (a) Updating active sets through collaboration, which is akin to a distributed best-arm identification problem, and (b) Determining the optimal from within its active set of subspaces, an estimation problem in low-dimensions, similar to the classical linear bandit.

SubGoss algorithm is organized in phases with the active subspaces for each agent fixed during the phase. Within a phase, all agents solve two tasks - (i) identify the most likely subspace amongst its active subspaces to contain and (ii) within this subspace, play actions optimally to minimize regret. The first point is accomplished by agents through pure exploration. The rate of exploration (different across agents) depends on the minimum distance between subspaces in their active set, which agents can explicitly compute. Thus, in order to reduce regret, agents in our algorithm only consider a small, well separated active set of subspaces in each phase. Otherwise, agents play action vectors within their best estimated subspace containing to minimize regret. This step is achieved by playing a projected version of the LinUCB algorithm. The second step only incurs regret in the dimension of the subspace (once the true subspace is correctly identified) as opposed to the ambient dimension, thereby keeping regret low. Due to communications, the correct subspace spreads to all agents, while keeping each agents minimum subspace gap large (and thus reducing the regret due to explorations).

Notation: is the surface of unit sphere and , the distance between and subspace . The distance between subspaces is denoted by .

### 3.1 Description

Our algorithm builds on some of the ideas developed for a (non-contextual) collaborative setting for unstructured bandits in [23, 12]. For any agent our algorithm proceeds in phases, where phase is from time slots to , both inclusive (with ). During each phase , every agent only plays from an active set of subspaces such that . Agents communicate at the end of the phase to update their active set. The phase lengths are designed so as to respect communication budget constraints.

Initialization: This step sets , the initial active set also referred to as the sticky set777Choice of terminology will become apparent shortly. Let , which is obtained by partitioning the subspaces across agents to maximize the distance

 (ˆS(i))Ni=1=argmax{mini∈[N]minl1≠l2∈SiΔl1,l2:(Si)Ni=1 s.t. 1≤|Si|≤(K/N),∪Ni=1Si=[K]}. (1)

For numerical simulations, we provide a feasible partition that is fast and works well in practice.

Action Vectors Chosen in a Phase: If at time in a phase , the total number of explore samples thus far (denoted by ) satisfies , then agent plays a standard basis vector chosen according to the round robin method. Otherwise, the action vector is chosen according to the Projected LinUCB [18]. The subspace agent chooses is the one closest in (denoted by ) to the unit vector , the estimate of formed from the explore samples thus far. The precise action vector chosen is given according to the following equations [18]:

 A(i)t =argmaxa∈Atmaxθ∈C(i)t⟨θ,Pk(i)ta⟩, where , C(i)t ={θ∈Rd:||ˆθ(i)t−θ||¯Vt(λ)(i)≤βt},

and , and . is a matrix whose columns are all the action vectors (explore and exploit) played up to and including time and is a column vector of the corresponding rewards.

Communications and the Active Subspaces for the Next Phase: At the end of phase , agent asks for a subspace recommendation from an agent chosen independently. Denote by to be this recommendation. Agent if asked for a recommendation at the end of phase , recommends the subspace in closest to , i.e., using only the explore samples. The next active set is given by , where . Observe that , , and thus , is denoted sticky.

Please see Appendix A for a detailed description, and Algorithm 1 for the pseudo-code.

## 4 Main Result

In order to state the result, we make two mild technical assumptions: the gossip matrix is connected and the communication budget (detailed definitions in Appendix B

). We define a random variable

denoting the spreading time of the following process: node initially has a rumor; at each time, an agent without a rumor calls another chosen independently from the gossip matrix and learns the rumor if the other person knows the rumor. The stopping time denotes the first time when all agents know the rumor and . For ease of exposition, we assume that , which the agents are unaware of. Let be the minimum separation between all subspaces, i.e., and .

###### Theorem 1.

Consider a system consisting of agents connected by a gossip matrix , all running Algorithm 1 with input parameters , , . Furthermore, suppose the agents have a communication budget and use parameter satisfying assumption (A.2) in Appendix B. black Then the regret of any agent , after any time is bounded by

 (2)

Here, , is given in Equation (5) and , where

 j∗=2max(B−1⎛⎜ ⎜⎝(Nde18d)1α˜Δ24−3⎞⎟ ⎟⎠,min{j∈N:∀j′≥j,Bj′≥⌈32αd2˜Δ2L2logBj′⌉})

and , .

### 4.1 Discussion

In order to understand the theorem better, we consider an illustrative example and state a corollary of our main theorem. First, observe that the effect of the parameter used in the definition of in Equation (2) only affects the constant pairwise communication cost term and does not affect the scaling of regret with time horizon . We now consider a simple example. Assume that the agents are connected by a complete graph, i.e., for any , , and the communication budget , for some . Thus, from the definition of in Equation (5), we have , where used in Equation (5) is sufficiently small. In this case, it is easy to see that for all and , (for ex. Corollary in [12]), , for some universal constant . Thus, we have the following corollary.

###### Corollary 2.

Suppose the agents are connected by the complete graph, namely , for all , and the communication budget and used by the agents are such that (given in Equation (5), for all large , for some fixed . Then, when Algorithm 1 is run with the same inputs as in Theorem 1, the regret of any agent after times

 E[R(i)T]≤˜O(m√T)Projected LinUCB Regret+O(d2(Δ(i))2logT)Cost of Subspace Exploration+2β⎛⎜ ⎜⎝π26+max⎧⎪⎨⎪⎩(Nde18d)1α˜Δ24−3,(32αd2˜Δ2L2)2⎫⎪⎬⎪⎭⎞⎟ ⎟⎠g((Bx)x∈N)+(2Clog(N))βE[B2τ(P)spr], (3)

where is an universal constant.

In Equation (3), the notation only hides input constants and universal constants. Thus, we see from Equation (3), that the scaling of the regret of every agent () matches that of an oracle that knows the correct subspace in which lies in.

### 4.2 Proof Sketch

The proof of this theorem is given in Appendix C and we describe its salient ideas here. Similar to the phenomenon in unstructured bandits [12], we prove in Proposition 1 that in our linear bandit setup, there exists a freezing phase , such that all agents have the correct subspace containing . Consequently, for all phases , all agents will play Projected LinUCB [18] from the correct subspace in the exploit time instants and recommend it at the end of the phase. Therefore, the set of subspaces every agent has does not change after phase and the regret after phase can be decomposed into regret due to pure exploration and regret due to Projected LinUCB (Proposition 2).

The technical novelty of our proof lies in bounding the regret till phase , i.e. (Proposition 3) and in particular showing it to be finite. This follows from two key observations arising from pure exploration in the explore time steps. First, for any agent , the estimate of using the explore samples till time , concentrates to (in norm) even with the time varying exploration rates in different phases (Lemma 3). Second, we show in Lemma 4 that if an agent has the correct subspace containing in a phase, the probability that it will not be recommended and hence dropped at the end of the phase is small.

Combining these two observations we establish that, after a random phase denoted by in Appendix C, satisfying

, agents never recommend incorrectly at the end of a phase and thus play the Projected LinUCB on the correct subspace in the exploit time instants. The formal argument follows by multiplying the relevant random variables by indicator random variables that the system has frozen. Subsequently, the expected value is bounded by either dropping the indicator random variable, or by considering an appropriate maximization of the other variable. To conclude, after random phase

, the spreading time can be coupled with that of a standard rumor spreading [13], as once an agent learns the correct subspace, it is not dropped by the agent. This final part is similar to the one conducted for unstructured bandits in [12], giving us the desired bound on .

## 5 Benefit of Collaboration and Regret-Communication Trade-off

In this section, we derive insights into the behavior of our algorithm.

1. Benefit of Collaboration: The regret guarantee in Theorem 1 implies a collaborative gain, i.e., the regret of an individual agent is strictly better than in a system without collaboration. The key reason comes from the term in the cost of subspace exploration in the regret which increases as increases. The case when agents do not collaborate is equivalent to saying that , i.e., there is only a single agent. In this case, and . Whereas, in the presence of collaboration, the set of subspaces , and thus, the minimum gap . Even though increasing increases the cost of pairwise communications due to increased spreading time, this is only a constant and the dominant effect on regret is in reducing cost of exploration.

Hence, collaboration reduces regret by ensuring that each agent is solving an easier instance of the problem. This benefit of collaboration is more pronounced in ‘high-dimensional’ settings, when and is small (a constant). In such settings, both the projected LinUCB term and the cost of subspace exploration are comparable, and therefore collaboration directly impacts in reducing the leading order term in regret. These insights are also verified numerically in simulations.

2. Regret-Communication Tradeoff: Corollary 2 shows that the per-agent regret increases as we decrease the communication budget. This follows by noticing that the right hand side of Equation (3) is monotonically increasing in . Thus, a higher value of (reducing communication budget) leads to higher per-agent regret. Although one would expect this to intuitively hold, the monotonicity of the bound in Equation (3) with respect to makes it evident. Nevertheless, we see that communication cost only impacts second order terms in the regret as it only contributes a constant cost that does not scale with the time horizon.

## 6 Numerical Results

We evaluate the SubGoss algorithm empirically in synthetic simulations. We show the cumulative regret (after averaging across all agents) over random runs of the algorithm with confidence intervals. The communication budget budget is fixed to (i.e., ). The time horizon axis is scaled down by for clear representation. We compare its performance with two benchmarks: SubGoss algorithm with no collaborations (i.e. a single agent playing SubGoss algorithm) and a single agent playing the OFUL (classical LinUCB) algorithm of [1]. In this section, the condition for pure exploration is changed to for all . Note that as the constants in SubGoss algorithm (Algorithm 1) arise from somewhat loose tail bounds, we use different parameters here.

In our experiments, the agents are connected through a complete graph and the action set is a collection of i.i.d. vectors on surface of the unit ball. Each

-dimensional subspace is the orthogonal matrix obtained by the SVD of a random

matrix with i.i.d. standard normal entries. The vector is the projected version of a standard Gaussian vector onto subspace (the true subspace). We set in simulations. For assigning the sticky sets to agents, we re-arrange the subspaces such that and assign , for all . Figures 1 and 2 evaluate the performance of SubGoss algorithm for different values of problem parameters . blackWe observe that the SubGoss algorithm effectively utilizes collaboration to reduces regret for all agents. This benefit of collaboration is demonstrated even with a sub-optimal partition for assigning sticky sets as described above. Moreover, Figure 2 demonstrates that as the number of agents increases, per-agent regret decreases. This follows as the complexity of the instance each agent solves becomes easier.

## References

• [1] Yasin Abbasi-Yadkori, Dávid Pál, and Csaba Szepesvári. Improved algorithms for linear stochastic bandits. In Advances in Neural Information Processing Systems, pages 2312–2320, 2011.
• [2] Alekh Agarwal, Sarah Bird, Markus Cozowicz, Luong Hoang, John Langford, Stephen Lee, Jiaji Li, Dan Melamed, Gal Oshri, Oswaldo Ribas, et al. Making contextual decisions with low technical debt. arXiv preprint arXiv:1606.03966, 2016.
• [3] Orly Avner and Shie Mannor. Concurrent bandits and cognitive radio networks. In

Joint European Conference on Machine Learning and Knowledge Discovery in Databases

, pages 66–81. Springer, 2014.
• [4] Orly Avner and Shie Mannor. Multi-user lax communications: a multi-armed bandit approach. In IEEE INFOCOM 2016-The 35th Annual IEEE International Conference on Computer Communications, pages 1–9. IEEE, 2016.
• [5] Hamsa Bastani and Mohsen Bayati. Online decision making with high-dimensional covariates. Operations Research, 68(1):276–294, 2020.
• [6] Ilai Bistritz and Amir Leshem. Distributed multi-player bandits-a game of thrones approach. In Advances in Neural Information Processing Systems, pages 7222–7232, 2018.
• [7] Etienne Boursier and Vianney Perchet. Sic-mmab: synchronisation involves communication in multiplayer multi-armed bandits. In Advances in Neural Information Processing Systems, pages 12048–12057, 2019.
• [8] Sébastien Bubeck, Nicolo Cesa-Bianchi, et al. Regret analysis of stochastic and nonstochastic multi-armed bandit problems. Foundations and Trends® in Machine Learning, 5(1):1–122, 2012.
• [9] Swapna Buccapatnam, Jian Tan, and Li Zhang. Information sharing in distributed stochastic bandits. In 2015 IEEE Conference on Computer Communications (INFOCOM), pages 2605–2613. IEEE, 2015.
• [10] Deepayan Chakrabarti, Ravi Kumar, Filip Radlinski, and Eli Upfal. Mortal multi-armed bandits. In Advances in neural information processing systems, pages 273–280, 2009.
• [11] Mithun Chakraborty, Kai Yee Phoebe Chua, Sanmay Das, and Brendan Juba. Coordinated versus decentralized exploration in multi-agent multi-armed bandits. In IJCAI, pages 164–170, 2017.
• [12] Ronshee Chawla, Abishek Sankararaman, Ayalvadi Ganesh, and Sanjay Shakkottai. The gossiping insert-eliminate algorithm for multi-agent bandits. In Silvia Chiappa and Roberto Calandra, editors,

Proceedings of the Twenty Third International Conference on Artificial Intelligence and Statistics

, volume 108 of Proceedings of Machine Learning Research, pages 3471–3481, Online, 26–28 Aug 2020. PMLR.
• [13] Flavio Chierichetti, Silvio Lattanzi, and Alessandro Panconesi. Almost tight bounds for rumour spreading with conductance. In

Proceedings of the forty-second ACM symposium on Theory of computing

, pages 399–408, 2010.
• [14] Varsha Dani, Thomas P Hayes, and Sham M Kakade. Stochastic linear optimization under bandit feedback. COLT, 2008.
• [15] Eshcar Hillel, Zohar S Karnin, Tomer Koren, Ronny Lempel, and Oren Somekh. Distributed exploration in multi-armed bandits. In Advances in Neural Information Processing Systems, pages 854–862, 2013.
• [16] Dileep Kalathil, Naumaan Nayyar, and Rahul Jain. Decentralized learning for multiplayer multiarmed bandits. IEEE Transactions on Information Theory, 60(4):2331–2345, 2014.
• [17] Ravi Kumar Kolla, Krishna Jagannathan, and Aditya Gopalan. Collaborative learning of stochastic bandits over a social network. IEEE/ACM Transactions on Networking, 26(4):1782–1795, 2018.
• [18] Sahin Lale, Kamyar Azizzadenesheli, Anima Anandkumar, and Babak Hassibi. Stochastic linear bandits with hidden low rank structure. CoRR, abs/1901.09490, 2019.
• [19] Tor Lattimore and Csaba Szepesvári. Bandit algorithms. preprint, 2019.
• [20] Lihong Li, Wei Chu, John Langford, and Robert E Schapire. A contextual-bandit approach to personalized news article recommendation. In Proceedings of the 19th international conference on World wide web, pages 661–670. ACM, 2010.
• [21] David Martínez-Rubio, Varun Kanade, and Patrick Rebeschini. Decentralized cooperative stochastic bandits. In Advances in Neural Information Processing Systems, pages 4531–4542, 2019.
• [22] Jonathan Rosenski, Ohad Shamir, and Liran Szlak. Multi-player bandits–a musical chairs approach. In International Conference on Machine Learning, pages 155–163, 2016.
• [23] Abishek Sankararaman, Ayalvadi Ganesh, and Sanjay Shakkottai. Social learning in multi agent multi armed bandits. Proceedings of the ACM on Measurement and Analysis of Computing Systems, 3(3):1–35, 2019.
• [24] Ambuj Tewari and Susan A Murphy. From ads to interventions: Contextual bandits in mobile health. In Mobile Health, pages 495–517. Springer, 2017.
• [25] Yisong Yue and Thorsten Joachims. Interactively optimizing information retrieval systems as a dueling bandits problem. In Proceedings of the 26th Annual International Conference on Machine Learning, pages 1201–1208, 2009.

## Appendix A Additional Details of the SubGoss Algorithm

In this section, we provide additional details about the actions (in particular the explore step) taken by an agent at time (denoted by ) in phase . Recall that

• [leftmargin=*]

• the reward obtained is + , where for all ,

 E[exp(zη(i)t)|F(i)t−1]≤exp(z22) a.s., (4)

and .

• 888if there is no such feasible and , then . is the minimum pairwise separation between the subspaces in

• is the number of explore samples collected up to and including time ()

The action vector is chosen as follows: (i) (explore) If , agent explores, else (ii) (exploit) agent exploits and plays projected LinUCB. We now describe the explore step.

Explore: If agent explores at time , then it plays the standard basis vectors in a round robin fashion, i.e., . Subsequently, . The rewards collected during the explore time steps only is used to construct an estimate of by solving the least squares minimization problem . Here, is a matrix whose columns are the standard basis vectors played in the explore time slots up to and including time and is a column vector containing the corresponding obtained rewards. Solving the above least squares problem, for all , gives when , otherwise.

## Appendix B Technical Assumptions for Theorem 1

Building on the communication constraints considered in [12], we make the following mild assumptions:

(A.1) [12] The gossip matrix is irreducible: for any , there exists and , with and such that the product . In words, the communication graph among the agents is connected.

(A.2) [12] The communication budget and is such that for all , there exists such that for all , (i.e., ). Furthermore, .

In order to understand assumption A.2 better, here are some examples:

• [leftmargin=*]

• for all and for all

• When agents have the luxury to communicate in every time slot, i.e. for all for all

One can check that the conditions in assumption A.2 are satisfied by the above examples.

Assumption A.1 is needed because if the communication graph among the agents is not connected, the setup becomes degenerate, as there exists at least a pair of agents which cannot participate in information exchange. Assumption A.2 states that the number of times an agent performs information pulls over the time horizon is at least . Furthermore, the series in A.2 is convergent for all natural examples (such as polynomial and exponential). For example, when or , for all and , the series converges. Thus, the practical insights that can be obtained from our results are not affected by assumptions A.1 and A.2.

## Appendix C Proof of Theorem 1

In this section, we assume agents know the parameters and such that . In the paper, we set for ease of exposition. Before going through the proof, we will first set some definitions and notations.

### c.1 Definitions and Notations

We adapt the proof ideas developed in [12] for the unstructured bandit case. Recall that the communication budget is reparametrized as the sequence , where

 Bx=max(min{t∈N,˜Bt≥x},⌈x1+ε⌉). (5)

At any time in phase , Let . In words, is an indicator variable indicating whether the agent will explore (or exploit) in time instant . For every agent , let be the subspace closest to with respect to the distance measure . Let be the indicator variable for the event , i.e.

which indicates whether agent , if it has the subspace , does not recommend it at the end of phase . We now define certain random times that will be useful in the analysis, in a similar manner as done in [12]:

 ˆτ(i)stab =inf{j′≥j∗:∀j≥j′,χ(i)j=0}, ˆτstab =maxi∈[N]ˆτ(i)stab, ˆτ(i)spr =inf{j≥ˆτstab:1∈S(i)j}−ˆτstab, ˆτspr =maxi∈{1,⋯,N}ˆτ(i)spr, τ =ˆτstab+ˆτspr.

Here, is the earliest phase, such that for all subsequent phases, if agent has the subspace , it will be recommend the subspace . The term denotes the number of phases it takes after to have the subspace in its playing set. The following proposition shows that the system is frozen after phase , i.e. after phase , the set of subspaces of all agents remain fixed in the future.

###### Proposition 1.

For all agents , we have almost-surely,

 ⋂j≥τS(i)j =S(i)τ, ˆO(i)l =1 ∀l≥τ, ∀i∈{1,⋯,N}.
###### Proof.

Fix any agent and any phase . Since , we have for all ,

 χ(i)j=0. (6)

However, as , we know that

 1∈S(i)j. (7)

Equations (6) and (7) imply that . Moreover, is true for all phases and all agents , as they are arbitrarily chosen. Furthermore, the update step of the algorithm along with the above reasoning tells us that none of the agents will change their subspaces after any phase , as the agents already have the correct subspace in their respective playing sets. Thus, for all agents . ∎

Proposition 1 also tells us that for all time instants , in the exploit time slots, all agents will play Projected LinUCB from the subspace , because the algorithm picks the subspace closest to with respect to the distance measure in the exploit time slots. Combining Proposition 1 with the update step of the algorithm (line of Algorithm 1) tells that for all and thus for all .

### c.2 Intermediate Results

Before stating and proving the intermediate results, we highlight the key pieces needed to prove Theorem 1. We already showed in Proposition 1 that there exists a freezing time , such that in the subsequent phases, all agents have the correct subspace containing and recommend it henceforth. Thus, the expected cumulative regret incurred can be decomposed into two parts: the regret up to phase and the regret after phase .

The expected cumulative regret incurred up to phase is a constant independent of the time horizon (Proposition 3). It is a consequence of following important observations resulting from pure exploration in the explore time steps:

• [leftmargin=*]

• For any agent , the estimate of concentrates to in norm, despite time-varying exploration of subspaces leading to different exploration rates in different phases (Lemma 3).

• Subsequently, we show that the probability that an agent will not recommend and thus drop the correct subspace containing is small at the end of a phase (Lemma 4).

The above observations imply that after a (random) phase, denoted by , agents always recommend (and never drop) the correct subspace. After phase , we stochastically dominate (in Proposition 4) the spreading time of the correct subspace with a standard rumor spreading process [13]. Hence, the expected cumulative regret up to phase is bounded by the total number of time steps taken to reach phase and the additional number of phases taken to spread the correct subspace.

Post phase , as the active set of subspaces maintained by agents remains unchanged (as deduced in Proposition 1) and thus, the regret can be decomposed into sum of regret due to pure exploration and regret due to projected LinUCB. The regret due to projected LinUCB is adapted from the analysis of a similar algorithm conducted in [18]. This regret decomposition is further bounded (Proposition 2) by either dropping the indicator random variables corresponding to explore (exploit) time instants, or appropriately maximizing those indicator variables.

The following intermediate results will precisely characterize the intuition behind the proof of Theorem 1.

###### Proposition 2.

The regret of any agent after playing for steps is bounded by

 E[R(i)T]≤2SE[Bτ]+E[Rproj,T]+2S⌈32αd2(Δ(i))2L2logT⌉.
###### Proof.

We will first show that for any agent , the instantaneous regret . In order to obtain this bound, notice that for any

 |⟨θ∗,a⟩|≤||θ∗||2.||a||2≤S

by Cauchy-Schwarz inequality. Therefore, we have for all . From the definition of regret ,

 R(i)T =T∑t=1w