The effect of communication structure in cooperative and competitive multi-agent systems has been extensively studied in decision theory. Performance of a group of social learners can be improved by the shared information among individuals. In most real-world decision-making processes, however, information sharing between agents can be costly. As a result, directed communication, where each agent only needs to observe its neighbors, has advantages over undirected communication, where each agent sends and receives information. Even when observation costs are high, agents can keep costs to a minimum by choosing when and whom to observe as a function of their own performance. Further, in this setting costs associated with cooperation can be avoided.
Consider the problem of a group of fishermen foraging in an uncertain environment that consists of a distribution of spatial resource (fish). Because of the natural dynamics of fish, environmental conditions, and other external factors, the resource will be distributed stochastically. As a result, a fisherman will receive different reward values (number of fish harvested) at different times, even when sampling from the same patch. Thus, in order to maximize cumulative reward fishermen need to be able to exploit, i.e., forage in well sampled patches known to provide better harvest, and to explore, i.e, forage in poorly sampled patches, which is riskier but may provide even better harvest than well sampled patches. Benefiting from exploitation requires sufficient exploration and identification of the patches that yield highest rewards. More generally, optimal foraging performance comes from balancing the trade-off between exploring and exploiting. This is known as the explore-exploit dilemma.
. For the standard MAB problem the reward distributions associated with options are static. An agent estimates the expected reward of each option using the rewards it receives through sampling. The agent chooses among options by considering a trade-off between estimated expected reward (exploiting) and the uncertainty associated with the estimate (exploring). Therefore, in the frequentist setting, the natural way of estimating the expectation of the reward is to consider the sample average[7, 1, 3]. The papers [5, 12] present how to incorporate prior knowledge about reward expectation in the estimation step by leveraging the theory of conditional expectation in the Bayesian setting.
Multi-agent multi-armed bandit (MAMAB) problems consider a group of individuals facing the same MAB problem simultaneously. For an individual to maximize its own reward, it will naturally seek to observe its neighbors and use those observations to improve its performance. Individual and group performance of agents will vary according to the observation structure, i.e., who is observing whom, and the type of information they observe. For example, if the agents are cooperative and can broadcast signals, they could share their estimates of rewards. When there are constraints, such as communication costs and privacy concerns, they might instead share only their instantaneous rewards and choices. Even without the ability to broadcast, agents may still be able to use sensors to observe the instantaneous rewards and choices of neighbors. A centralized multi-agent setting is considered in  and a decentralized setting is considered in . The papers [9, 8] use a running consensus algorithm in which agents observe the reward estimates of their neighbors. In [6, 10] an MAMAB problem is studied in which agents observe instantaneous rewards and choices in a leader-follower setting.
In all of these previous works, communication between agents is assumed to be cost free. However, in real world settings observing neighbors or exchanging information with neighbors is costly. In the present paper, we propose a setting in which agents can decide when and whom to observe in order to receive maximum benefits from observations that incur a cost. An underlying undirected network graph defines neighbors and models the inherent observation constraints present in the network. Agents receive a fixed observation cost at every instance they observe a neighbor.
To account for the observation cost, we define cumulative regret to be the total cumulative regret agents receive from sampling suboptimal options (sampling regret) and from observing neighbors (observation regret). Deterministic  and probabilistic  communication strategies proposed in the MAB literature lead to a linear cumulative observation regret. Our main contribution is the design of a new strategy for which we prove a logarithmic total cumulative regret, i.e., order-optimal performance. Our design leverages the intuition that it is most useful to observe neighbors when uncertainty associated with estimations of rewards is high.
In Section II we introduce the MAMAB problem and we propose an efficient sampling rule and a communication protocol for an agent to maximize its own total expected cumulative reward. We analyze the performance of the proposed sampling rule in Section III. In Section III-A we analytically upper bound the expected cumulative regret and in Section III-B we analytically upper bound the expected observation regret. We present the upper bound for the total expected cumulative regret in section III-C. In Section IV we provide numerical simulation results and computationally validate the analytical results. We conclude in Section V and provide additional mathematical details in the Appendix.
Ii Multi-agent Multi-armed Bandit Problem
In this section we present the mathematical formulation of the MAMAB problem studied here. Let be the number of options (arms) and the number of agents. Define
as the random variable that denotes reward associated with option. In this paper we assume that all the reward distributions are sub-Gaussian. Let
be the variance proxy ofand the expected reward of option Let be the optimal option with highest expected reward Each agent chooses one option at each time step with the goal of minimizing its cumulative regret. In MAB problems, cumulative regret is typically defined as cumulative sampling regret, which is equivalent to expected number of times suboptimal options are selected. We let cumulative regret be the sum of cumulative sampling regret and a cumulative observation regret that accumulates a fixed cost for every observation of a neighbor.
We assume that the expected reward values are unknown and the variance proxy values are known to the agents. To improve its own performance, each agent observes its neighbors according to an observation protocol that we define. We use a network graph to encode hard observation constraints and this defines neighbors of agents. Let be an undirected graph. is a set of nodes, such that node in corresponds to agent for . is a set of edges between nodes in . If there is an edge between node and node , then we say that agent and agent are neighbors. Since the graph is undirected, Let be the number of neighbors of agent
Let and be random variables that denote the option chosen by agent and the reward received by agent at time , respectively. Let be a random variable that takes value 1 if option is chosen by agent at time and is 0 otherwise. Let be a random variable that takes value 1 if agent can observe agent at time and is 0 otherwise.
In order to maximize the cumulative reward in the long run, agents need to both identify the best options through exploring and sample the best options through exploiting. Observing neighbors allows an agent to receive more information about options and hence obtain better estimates about expected reward values of options. This leads to less exploring and more exploiting, which reduces the regret an agent receives due to sampling suboptimal options. However, since taking observations is costly, an agent is required to find a trade-off between the information gain and the cost associated with observations. Let be the cost incurred by agent when it observes the instantaneous reward and choice of agent at time step . In this paper we consider the case in which
Let the number of times that agent samples option until time be given by the random variable . And let the total number of times that agent observes rewards from option until time be given by the random variable , where
We define a sampling rule based on the well known UCB (Upper Confidence Bound) rule for a single agent . The UCB rule chooses the option at time that maximizes an objective function that is the sum of an exploit term, equal to the estimate of the reward mean at time , and an explore term, equal to a measure of uncertainty in that estimate at time . Our sampling rule for agent in the MAMAB problem accounts for the observations of neighbors by using them to improve its estimate and reduce its uncertainty. Let the estimate by agent of the expected reward from option at time be given by the random variable , where
and is the total reward observed by agent from option until time .
The sampling rule for agent at time is defined as
where is a tuning parameter that captures the trade-off between exploring and exploiting.
To find a balance between information gain and observation cost we define an observation rule for agents so that they choose to incur the cost of making observations of neighbors only when observations are most needed, i.e., when their own uncertainty is high. In the following observation rule, an agent observes the instantaneous rewards and choices of all of its neighbors only when it is exploring, since it explores when uncertainty is high. If agent chooses the option at time that corresponds to the maximum of its estimates of reward means, , then it is exploiting and it does not observe its neighbors.
The observing rule for agent at time and is defined as
Iii Performance Analysis
In this section we analyze the cumulative regret of agent due to sampling suboptimal options and observing neighbors when employing the sampling rule of Definition 1 and observation rule of Definition 2.
Iii-a Sampling Regret Analysis
Let be a suboptimal option. The total number of times agent samples from option can be upper bounded as
Here is an indicator function such that
Thus we have
Let be the cumulative sampling regret of agent from option until time . Recall that the cumulative regret is defined as the loss incurred by sampling suboptimal options. Define Then we have, from ,
To analyze the expected number of samples from suboptimal options until time , we first note that we have
Next we analyze concentration probability bounds on the estimates of options.
For any and for there exists a such that
For and there exists a such that
We proceed to upper bound the summation of the probabilities of the events for as follows. Using equation (3) we have that the inequality implies
This inequality does not hold for , where
Thus we have
From equation (5) the expected cumulative sampling regret of agent until time is upper bounded as
Iii-B Observation Regret Analysis
Recall that is the constant unit cost associated with observations. Let be the cumulative observation regret of agent at time step Then we have
This is equivalent to the number of observations taken by agent until time Expected cumulative observation regret can be expressed as
So expected cumulative observation regret can be upper bounded by upper bounding the expected number of observations until time :
To analyze the expected number of observation, we use
We first upper bound the expected number of times agent observes its neighbors until time when it decides to explore after sampling a suboptimal option.
For all suboptimal we have
The proof of Lemma 2 is given in the Appendix.
Next we analyze the expected number of times agent observes its neighbors until time when it decides to explore after sampling the optimal option.
Note that we have
Thus we have
From Lemma 1 we have
For all suboptimal options we have
The proof of Theorem 2 is given in the Appendix.
Now we proceed to state the main result of this paper, which is that the total expected cumulative observation regret until time for agent employing the sampling rule given by Definition 1 and the observation rule given by Definition 2 is upper bounded logarithmically in .
Expected cumulative observation regret until time for agent can be upper bounded as
For the probabilistic observation strategy of  the expected cumulative observation regret until time for agent is linear in :
where is the observation probability of agent Thus, our proposed sampling rule and observation rule outperform these strategies when there are cumulative observation costs.
Iii-C Total expected cumulative regret
Total expected cumulative regret is defined as the summation of expected cumulative sampling regret and expected cumulative observation regret until time :
Let Total expected cumulative regret until time of agent is upper bounded as
Iv Simulation Results
In this section we present numerical simulation results for a network of 6 agents with underlying observation structure defined by the star graph: the center agent observes all other agents and all other agents only observe the center agent. Agents other than the center agent are interchangeable and their average regret and individual regret are the same. We present numerical simulations to evaluate the performance of the sampling rule and observation rule given by Definitions 1 and 2.
The 6 agents play the same MAB problem with 10 options. In all simulations the reward distributions are Gaussian with variance , , and mean values:
The communication cost . We set the sampling rule parameter . We provide results for 1000 time steps with 1000 Monte Carlo simulations.
Figure 1 shows simulation results for the expected cumulative sampling regret of a group of 6 agents using the proposed sampling and observation rules. The blue dashed line shows regret of the center agent. The green dash-dot line shows the average regret of the agents not in the center. The red dotted line shows the average expected cumulative sampling regret over all agents. It can be observed that the expected cumulative sampling regret is logarithmic in time. For comparison, we plot the average expected cumulative regret of the agents when they make no observations of neighbors (solid gold line). When agents are not making observations they are interchangeable, and so the average performance and the individual performance are the same. The simulation results illustrate that the performance of every agent improves significantly when it observes neighbors according to the proposed protocol. The simulation results further show that the center agent outperforms the other agents. This is to be expected since the center agent has more neighbors than the other agents.
Figure 2 shows simulation results for expected observation regret. It can be seen that the expected observation regret is logarithmic in time, as proved in Theorem 3. Since the center agent has more neighbors than the others agents, its observation regret is the highest. However, the results illustrate that when observation cost is small, a significant performance improvement can be obtained for a small observation regret.
We studied an MAMAB problem where agents can observe the instantaneous choices and rewards of their neighbors but incur a cumulative cost each time they make an observation of a neighbor. We proposed a sampling rule and an observation rule in which an agent observes its neighbors only when it has decided to explore. We defined total expected cumulative regret to be the regret agents receive due to sampling suboptimal options and to observing neighbors. Deterministic and stochastic observation strategies for MAB protocols in the literature yield an expected cumulative observation regret that is linear in time . We analytically proved that under the proposed sampling and observation rules, expected cumulative regret of each agent is bounded logarithmically in . Accuracy of the upper bound has been verified computationally through numerical simulations.
Proof of Theorem 2
Let be a suboptimal option with highest estimated expected reward for agents at time Then we have and If the agent chooses option at time step we have Thus we have and
Note that for some we have
Let Then we have
Since we have
Then we have
-  (1995) Sample mean based index policies by o (log n) regret for the multi-armed bandit problem. Advances in Applied Probability 27 (4), pp. 1054–1078. Cited by: §I.
-  (1987) Asymptotically efficient allocation rules for the multiarmed bandit problem with multiple plays-part i: iid rewards. IEEE Transactions on Automatic Control 32 (11), pp. 968–976. Cited by: §I.
-  (2002) Finite-time analysis of the multiarmed bandit problem. Machine Learning 47 (2-3), pp. 235–256. Cited by: §I, §II.
-  (2014) Decentralized learning for multiplayer multiarmed bandits. IEEE Transactions on Information Theory 60 (4), pp. 2331–2345. Cited by: §I.
-  (2012) On bayesian upper confidence bounds for bandit problems. In Artificial intelligence and statistics, pp. 592–600. Cited by: §I.
-  (2018) Collaborative learning of stochastic bandits over a social network. IEEE/ACM Transactions on Networking 26 (4), pp. 1782–1795. Cited by: §I.
-  (1985) Asymptotically efficient adaptive allocation rules. Advances in applied mathematics 6 (1), pp. 4–22. Cited by: §I, §III-A.
-  (2016) Distributed cooperative decision-making in multiarmed bandits: frequentist and Bayesian algorithms. In IEEE Conference on Decision and Control (CDC), pp. 167–172. Cited by: §I.
-  (2016) On distributed cooperative decision-making in multiarmed bandits. In European Control Conference (ECC), pp. 243–248. Cited by: §I, Remark 1.
-  (2018) Social imitation in cooperative multiarmed bandits: partition-based algorithms with strictly local information. In IEEE Conference on Decision and Control (CDC), pp. 5239–5244. Cited by: §I, §I, Remark 1.
-  (2019) Heterogeneous stochastic interactions for multiple agents in a multi-armed bandit problem. In European Control Conference (ECC), pp. 3502–3507. Cited by: §I, §III-A, §III-A, Remark 1.
-  (2014) Modeling human decision making in generalized gaussian multiarmed bandits. Proceedings of the IEEE 102 (4), pp. 544–571. Cited by: §I.
-  (1952) Some aspects of the sequential design of experiments. Bulletin of the American Mathematical Society 58 (5), pp. 527–535. Cited by: §I.
Introduction to reinforcement learning. Vol. 135, MIT press Cambridge. Cited by: §I.