Heterogeneous Stochastic Interactions for Multiple Agents in a Multi-armed Bandit Problem

05/21/2019 ∙ by Udari Madhushani, et al. ∙ Princeton University 0

We define and analyze a multi-agent multi-armed bandit problem in which decision-making agents can observe the choices and rewards of their neighbors. Neighbors are defined by a network graph with heterogeneous and stochastic interconnections. These interactions are determined by the sociability of each agent, which corresponds to the probability that the agent observes its neighbors. We design an algorithm for each agent to maximize its own expected cumulative reward and prove performance bounds that depend on the sociability of the agents and the network structure. We use the bounds to predict the rank ordering of agents according to their performance and verify the accuracy analytically and computationally.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Animal and robotic foraging problems that involve searching over spatially distributed patches with uncertain distribution of resource (reward) result in the explore-exploit dilemma. Each animal or robotic forager chooses which patch to sample, sequentially in time. The dilemma at every time in the sequence is whether to choose a well sampled patch that is expected to reap the highest reward (exploit) or to choose a poorly sampled patch that is expected to reduce uncertainty and possibly identify an option with an even higher reward (explore). Choices to explore can be costly or risky since they may not return much reward, but successful exploiting requires sufficient information that comes from exploration. The challenge is sorting out protocols for when and where to explore versus when and where to exploit.

In decision theory, multi-armed bandit (MAB) problems serve as models that capture the salient features of the explore-exploit trade-off [1, 2]. Thus, advances in addressing MAB problems directly benefit understanding and solving foraging and search problems. The MAB problem is analogous to a scenario in which an agent is repeatedly faced with several different options, each returning uncertain reward, and aims to make a sequence of choices to maximize the cumulative reward [3]. This is equivalent to minimizing the cumulative regret [4].

In their seminal work, Lai and Robbins [4] established a lower bound for the expected cumulative regret in the finite time horizon case. Specifically, they derived a logarithmic lower bound for the expected number of times a sub-optimal option needs to be sampled by an optimal sampling rule. They also established a confidence bound and a sampling rule to achieve logarithmic cumulative regret. These results were further simplified in [5] by introducing a confidence bound using a sample mean based method. Improving on these results, a family of Upper Confidence Bound (UCB) algorithms for achieving asymptotic and uniform logarithmic cumulative regret were proposed in [6]. These UCB algorithms are based on the notion that minimizing the expected cumulative regret is realized by choosing an appropriate uncertainty model, which results in an optimal trade-off between reward gain and information gain through uncertainty.

For the standard MAB problem the reward associated with an option is considered to be an iid stochastic process. Therefore, in the frequentist setting, the natural way of estimating the expectation of the reward is to consider the sample average

[4, 5, 6]. The papers [7, 8] present how to incorporate prior knowledge about reward expectation in the estimation step by leveraging the theory of conditional expectation in the Bayesian setting.

The papers [9, 10] extend this work to the multi-agent setting where the explore-exploit problem is defined for a group of agents. The objective is to understand how the performance of individuals or the group can benefit from inter-agent observations. A centralized multi-agent setting is considered in [9] and a decentralized setting is considered in [11]. The papers [12, 10] use a running consensus algorithm in which agents observe the estimates of their neighbors. A multi-agent multi-armed bandit (MAMAB) problem, where agents observe instantaneous rewards and actions in a leader-follower setting, is considered in [13, 14].

In the above works, agents observe their neighbors, defined according to a static network graph. In this paper we consider a MAMAB problem where agents observe instantaneous rewards and actions of their neighbors through stochastic interactions. Observing rewards and actions rather than estimates is motivated by the foraging success of social groups where agents can observe neighbors even when their neighbors don’t necessarily want to share what they know [15]. Further, we assume that each agent only observes its neighbors with some probability . This provides a framework for evaluating efficiency and robustness to changes in communication in terms of the .

The setting is formulated by defining an underlying undirected network graph and imposing directed observation probabilities on each edge. The underlying graph models the inherent observation constraints present in the network. Imposed observation probabilities capture the heterogeneous social effort of agents in observing neighbors. We introduce the notion of sociability of each agent as the likelihood that the agent observes its neighbors. We derive analytical upper bounds for expected cumulative regret and propose a measure to rank agents according to their relative performance as a function of each agent’s sociability and the sociability of its neighbors. We show that our model predicts how high performance requires an agent to have both high sociability and neighbors with low sociability. This is an important result: it implies that making an investment to observe neighbors may be worthwhile only if those neighbors are sufficiently explorative.

In Section II we introduce the MAMAB problem with time-varying (stochastic) observation structure. We propose an efficient sampling rule for an agent to maximize cumulative expected reward. We analyze the performance of the proposed sampling rule in Section III. In Section III-A we analytically upper bound the expected cumulative regret and in Section III-B we propose a measure to predict the ranks of agents according to their relative performance. In Section IV we provide numerical simulation results and computationally validate the analytical results. We conclude in Section V and provide additional mathematical details in the Appendix.

Ii Multi-agent Multi-armed Bandit Problem

Consider a MAMAB with agents and arms, which represent options (patches). Let

be a sub-Gaussian random variable with variance proxy

that denotes the reward associated with option Define as the expected reward of option . Let be the optimal option such that . Each agent chooses one option at each time step with the goal of maximizing its cumulative reward, which is equivalent to minimizing its cumulative regret. We assume that the are unknown and the are known to the agents (e.g., if is the magnitude of sensor noise).

Let be an undirected graph that encodes the observation structure of the system. If there exists an observation link from agent to agent , , then , and we say agents and are neighbors. For social or robotic groups searching over physical space, observation links may exist between pairs of agents when they are sufficiently close to be visible (or otherwise observable) to each other. Let be the number of neighbors and the set of neighbors of agent .

Let and be random variables that denote the option chosen and reward received by agent at time , respectively. Let , , be i.i.d copies of Define Consider the probability space and the increasing sequence of subalgebras for , where is the probability measure on the sigma algebra of . Here is the sigma algebra generated by information available at time . Let be a

measurable indicator random variable that takes the value one if option

is chosen by agent at time and is zero otherwise. Define to be an measurable indicator random variable that takes the value one if agent can observe agent at time and is zero otherwise.

Let be the probability that agent observes the instantaneous actions and rewards of its neighbors. Then we have such that We let and such that An agent that has high probability is more likely to obtain observations from its neighbors. We introduce the notion of an agent’s sociability to refer to its value of : we interpret agents with high values to be more sociable and agents with low values to be less sociable.

In order to maximize the cumulative reward in the long run, agents need to both identify the best options through exploring and sample the best options through exploiting. Since agents with high sociability values are more likely to obtain a greater number of observations, they can identify the best options with less exploring. However the usefulness of their observations is affected by the sociability values of their neighbors. Better performance can be obtained by an agent when it observes neighbors that do a lot of exploring (because they are less sociable), as compared to when it observes agents that do a lot of exploiting (because they are more sociable). This is, the agent will be able to exploit more, without compromising performance, when it has neighbors that are less sociable.

Let the number of times agent samples option in trials be given by the measurable random variable . And let the total number of times that agent observes rewards from option be the measurable random variable , given as

We define a sampling rule for agent as follows. Let the measurable random variable be the estimate of option by agent at time . We define

where is the total reward observed by agent from option in trials.

Definition 1

The sampling rule for agent at time is defined as




where and a sublogarithmic nondecreasing nonnegative function.

Iii Performance Analysis

In this section we proceed to analyze the performance of the proposed sampling rule by analyzing the expected cumulative regret of the agents. Recall that the expected cumulative regret depends on the expected number of times the agents sample suboptimal options and the goal of each agent is to maximize its individual cumulative reward.

Iii-a Regret Analysis

Let be a suboptimal option. The total number of times agent samples from option can be upper bounded as follows:

Here is an indicator function that takes value one if the objective function of option is greater than the objective function of the optimal option and zero otherwise, i.e.,

Thus we have

Let be the cumulative regret of agent from option . Define Then we have, from [16],


We proceed to analyze the expected number of samples from suboptimal options as follows.

First we note that we have


Next we proceed to analyze concentration probability bounds on the estimates of options.

Theorem 1

For any and for there exists a such that


Proof of Theorem 1: See Appendix.

Using symmetry we conclude that

Lemma 1

For , with , there exists a such that

Proof of Lemma 1

Define where is a monotonically decreasing function and . This implies that since . Choose such that where Then such that This proves that such that . The lemma follows from Theorem 1.

We proceed to upper bound the summation of the probabilities of the events for as follows. Using the equation (3) we have that the inequality implies

This inequality does not hold for , where

Thus we have


Using probability bounds given in Lemma 1 and equation (7) we prove our main result on performance bounds.

Theorem 2

The total expected number of times agent samples suboptimal option until time is upper bounded as

where and

Proof of Theorem 2: See Appendix.

Since is a sublogarithmic function, the expected number of suboptimal samples are logarithmically bounded.

Iii-B Performance Measure

In this section we provide a measure to rank the agents according to their relative performance. We motivate with the following two cases for a set of four agents where there is an underlying all-to-all observation structure:

1 2 3 4
Case 1 0.5 0 0 0
Case 2 0.5 1 1 1


In Case 1, neighbors of agent 1 are not at all sociable and in Case 2 they are maximally sociable. Therefore neighbors of agent 1 explore more in Case 1 than in Case 2. As a result, in Case 1, agent 1 tends to obtain observations from neighbors about lesser known options and this allows agent 1 to exploit more. In Case 2, agent 1 tends to obtain observations from neighbors about well sampled options, and this forces agent 1 to explore more. As a result agent 1 performs better in Case 1 as illustrated in Figure 1.

Fig. 1: Expected cumulative regret of agent 1 in Cases 1 and 2.

With this intuition we propose a measure as follows. First we restrict our attention to a class of problems where the underlying observation structure of the agents is a symmetric regular graph. This means that , i.e., every agent has the same number of neighbors. The all-to-all graph is a special case of the class of regular graphs with We also assume

We define performance measure for agent as


Our goal is to show that a lower implies a lower cumulative regret and therefore higher performance for agent . The measure is inversely related to agent ’s sociability and directly related to the sociability of the neighbors of agent . It then makes sense intuitively that lower implies higher performance for agent , since the higher the the more agent observes and the lower the the more its neighbors explore and the more valuable is their information.

We next design a protocol for each agent that depends on and show in Corollary 1 that the bound on agent ’s cumulative regret is directly related to , which suggests that the ordering of agents by predicts the ordering of agents by performance. We plan to prove in a future publication that the same ordering is predicted even when the protocol does not depend on . Simulations in Section IV provide validation of these assertions.

Let Then the objective function becomes

This assumes that each agent knows the of each of its neighbors . From Theorem 2, the expected number of times agent samples the suboptimal option is bounded as

This suggests that lower values correspond to lower cumulative expected regret and hence better performance. Using the bound on and equation (4) we upper bound the cumulative expected regret of agent as follows.

Corollary 1

Let be the regret of agent up to time . Then we have

Expected cumulative regret is then logarithmically bounded.

The performance measure can be bounded as

Accounting for the sociability of neighbors, as we have done, provides a more accurate and tighter bound than only considering individual sociability values. However, for an all-to-all observation structure the rank order can be predicted using only individual sociability values as shown next.

Lemma 2

Let be an all-to-all graph. Let , be the sociability of agents , such that Then

Proof of Lemma 2

Since and we have

By equation (8), this proves that

Iv Numerical Simulations

We ran numerical simulations to evaluate the performance of sampling rule (1)–(III-B) with for agent and with . The results in both cases verify the accuracy of agent ranks predicted by the performance measure : lower corresponds to lower cumulative regret, hence higher performance. We show plots in the case .

We consider 6 agents playing 10-armed bandit problems with two distinct observation structures: A) an all-to-all graph and B) a cyclic graph. In all simulations we let the reward distributions be Gaussian with variance , , mean values given by

1 2 3 4 5 6 7 8 9 10
40 50 50 60 70 70 80 90 92 95


and sociability values given by

1 2 3 4 5 6
0.50 0.85 0.05 0.50 1.00 0.90


We provide results for 500 time steps with 1000 Monte Carlo simulations. We set the sampling rule parameter .

Iv-a All-to-all observation

The underlying observation graph structure is all-to-all, equivalently a 5-regular graph. We calculate the performance measure for each agent using equation (8):

1 2 3 4 5 6
0.542 0.415 0.825 0.542 0.374 0.401


Fig. 2: Expected cumulative regret of the 6 agents using the sampling rule given in (1)–(III-B) and performance measure defined in (8) with distinct observation probabilities and underlying all-to-all observation structure.

The best predicted performer is agent 5 with lowest performance measure and highest sociability value . Second and third best predicted performers are agent 6 and agent 2, respectively, with second and third highest sociability. Agents 1 and 4 are ranked next with equal sociability and The worst predicted performer is agent 3, with lowest sociability and highest performance measure . These predictions on performance ranking are verified in the simulation results of Figure 2. The results also verify that for an underlying all-to-all graph, the performance rank ordering is predicted by the sociability rank ordering.

Iv-B Cyclic observation

The underlying observation graph structure is a cycle, equivalently a 2-regular graph, defined as , where . We calculate the using (8):

1 2 3 4 5 6
0.624 0.284 0.783 0.483 0.418 0.456


The best predicted performer is agent 2 with the lowest performance measure but not the highest sociability. In fact, agents 5 and 6 have higher sociability than agent 2. However, while all three agents 2, 5, and 6 have one neighbor with sociability 0.5, the other neighbor of agents 5 and 6 has sociability 0.9 and 1, respectively, whereas the other neighbor of agent 2 has sociability 0.05. The very low sociability of one of agent 2’s neighbors improves agent 2’s performance significantly enough that it outperforms agents 5 and 6. This result illustrates the important role of sociability of neighbors in an agent’s performance. Further, while agents 1 and 4 are indistinguishable by their own sociability, agent 4 has a neighbor with sociability 0.05 whereas agent 1 has neighbors with relatively high sociability. Therefore as predicted by the performance measure, agent 4 outperforms agent 1. Figure 3 validates the predicted rankings.

Fig. 3: Expected cumulative regret of the 6 agents using the sampling rule given in (1)–(III-B) and performance measure defined in (8) with distinct observation probabilities and underlying cyclic observation structure.

V Conclusions

We studied a MAMAB problem where agents observe instantaneous actions and rewards of their neighbors according to a stochastic network graph in which agents are distinguished by their sociability, defined as the probability of observing their neighbors. We derived an upper bound for expected cumulative regret of agents. We proposed a measure to predict relative performance ranking of the agents as a function of sociability of agents and their neighbors. We verified that having less sociable neighbors improves the performance of agents. Accuracy of the measure has been verified analytically through expected cumulative regret bounds and computationally through numerical simulations.


The first author wishes to thank Peter Landgren for helpful comments during the preparation of this paper.

Proof of Theorem 1

Since is a sub-Gaussian random variable with variance proxy we have

Define a new random variable such that

Note that Let Let For any

is an measurable random variable, and so

Further, using the properties of conditional expectations

Thus we see that

The rest of the proof closely follows the papers [17, 18]. For clarity and completeness we include the main steps. Let Then where For and we have


Recall from the Markov inequality that for any random variable . Thus,

Proof of Theorem 2

From equations (6)–(7) we have

Note that . Using Lemma 1 and we have

Since can be made arbitrarily small, the summation term can be evaluated as follows:

Thus we have


  • [1] R. S. Sutton and A. G. Barto,

    Introduction to Reinforcement Learning

    .   MIT Press Cambridge, MA, USA, 1998.
  • [2] H. Robbins, Some Aspects of the Sequential Design of Experiments.   Springer New York, 1985.
  • [3] J. C. Gittins, “Bandit processes and dynamic allocation indices,” Journal of the Royal Statistical Society. Series B (Methodological), vol. 41, pp. 148–177, 1979.
  • [4] T. L. Lai and H. Robbins, “Asymptotically efficient adaptive allocation rules,” Advances in Applied Mathematics, vol. 6, no. 1, pp. 4–22, 1985.
  • [5] R. Agrawal, “Sample mean based index policies with o(log n) regret for the multi-armed bandit problem.” Advances in Applied Probability, vol. 27, pp. 1054–1078, 1995.
  • [6] P. Auer, N. Cesa-Bianchi, and P. Fisher, “Finite-time analysis of the multi-armed bandit problem.” Machine Learning, vol. 47, pp. 235–256, 2002.
  • [7] E. Kauffman, O. Cappe, and A. Garivier, “On Bayesian upper confidence bounds for bandit problem,” in

    International Conference on Artificial Intelligence and Statistics,

    , April 2012, pp. 592–600.
  • [8] P. Reverdy, V. Srivastava, and N. E. Leonard, “Modeling human decision-making in generalized Gaussian multi-armed bandits,” in Proceedings of the IEEE, vol. 102, no. 4, 2014, pp. 544–571.
  • [9] V. Anantharam, P. Varaiya, and J. Walrand, “Asymptotically efficient allocation rules for the multiarmed bandit problem with multiple plays-part i: I.i.d. rewards,” IEEE Transactions on Automatic Control, vol. 32, no. 11, pp. 968–976, 1987.
  • [10] P. Landgren, V. Srivastava, and N. E. Leonard, “Distributed cooperative decision-making in multiarmed bandits: Frequentist and Bayesian algorithms,” in IEEE Conference on Decision and Control, December 2016, pp. 167–172.
  • [11] D. Kalathil, N. Nayyar, and R. Jain, “Decentralized learning for multiplayer multiarmed bandits,” IEEE Transactions on Information Theory, vol. 60, no. 4, pp. 2331–2345, 2014.
  • [12] P. Landgren, V. Srivastava, and N. E. Leonard, “On distributed cooperative decision-making in multiarmed bandits,” in European Control Conference, June 2016, pp. 243–248.
  • [13] R. K. Kolla, K. Jagannathan, and A. Gopalan, “Collaborative learning of stochastic bandits over a social network,” arXiv:1602.08886v2, 2016.
  • [14] P. Landgren, V. Srivastava, and N. E. Leonard, “Social imitation in cooperative multiarmed bandits: partition-based algorithms with strictly local information,” in IEEE Conference on Decision and Control, December 2018, pp. 5239–5244.
  • [15] A. R. Tilman, J. R. Watson, and S. Levin, “Maintaining cooperation in social-ecological systems:,” Theoretical Ecology, vol. 10, pp. 155–165, 2016.
  • [16] T. L. Lai, “Adaptive treatment allocation and the multi-armed bandit problem,” Ann. Statist., vol. 15, no. 3, pp. 1091–1114, 09 1987. [Online]. Available: https://doi.org/10.1214/aos/1176350495
  • [17] A. Garivier and E. Moulines, On Upper-Confidence Bound Policies for Switching Bandit Problems.   Berlin, Heidelberg: Springer Berlin Heidelberg, 2011, pp. 174–188.
  • [18] ——, “On upper-confidence bound policies for non-stationary bandit problems,” arXiv:0805.3415v1, 2008.