A Dynamic Observation Strategy for Multi-agent Multi-armed Bandit Problem

04/08/2020 ∙ by Udari Madhushani, et al. ∙ Princeton University 0

We define and analyze a multi-agent multi-armed bandit problem in which decision-making agents can observe the choices and rewards of their neighbors under a linear observation cost. Neighbors are defined by a network graph that encodes the inherent observation constraints of the system. We define a cost associated with observations such that at every instance an agent makes an observation it receives a constant observation regret. We design a sampling algorithm and an observation protocol for each agent to maximize its own expected cumulative reward through minimizing expected cumulative sampling regret and expected cumulative observation regret. For our proposed protocol, we prove that total cumulative regret is logarithmically bounded. We verify the accuracy of analytical bounds using numerical simulations.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

The effect of communication structure in cooperative and competitive multi-agent systems has been extensively studied in decision theory. Performance of a group of social learners can be improved by the shared information among individuals. In most real-world decision-making processes, however, information sharing between agents can be costly. As a result, directed communication, where each agent only needs to observe its neighbors, has advantages over undirected communication, where each agent sends and receives information. Even when observation costs are high, agents can keep costs to a minimum by choosing when and whom to observe as a function of their own performance. Further, in this setting costs associated with cooperation can be avoided.

Consider the problem of a group of fishermen foraging in an uncertain environment that consists of a distribution of spatial resource (fish). Because of the natural dynamics of fish, environmental conditions, and other external factors, the resource will be distributed stochastically. As a result, a fisherman will receive different reward values (number of fish harvested) at different times, even when sampling from the same patch. Thus, in order to maximize cumulative reward fishermen need to be able to exploit, i.e., forage in well sampled patches known to provide better harvest, and to explore, i.e, forage in poorly sampled patches, which is riskier but may provide even better harvest than well sampled patches. Benefiting from exploitation requires sufficient exploration and identification of the patches that yield highest rewards. More generally, optimal foraging performance comes from balancing the trade-off between exploring and exploiting. This is known as the explore-exploit dilemma.

Multi-armed bandit (MAB) problems are a set of mathematical models that have been proposed to capture the salient features of explore-exploit trade-offs [14, 13]

. For the standard MAB problem the reward distributions associated with options are static. An agent estimates the expected reward of each option using the rewards it receives through sampling. The agent chooses among options by considering a trade-off between estimated expected reward (exploiting) and the uncertainty associated with the estimate (exploring). Therefore, in the frequentist setting, the natural way of estimating the expectation of the reward is to consider the sample average

[7, 1, 3]. The papers [5, 12] present how to incorporate prior knowledge about reward expectation in the estimation step by leveraging the theory of conditional expectation in the Bayesian setting.

Multi-agent multi-armed bandit (MAMAB) problems consider a group of individuals facing the same MAB problem simultaneously. For an individual to maximize its own reward, it will naturally seek to observe its neighbors and use those observations to improve its performance. Individual and group performance of agents will vary according to the observation structure, i.e., who is observing whom, and the type of information they observe. For example, if the agents are cooperative and can broadcast signals, they could share their estimates of rewards. When there are constraints, such as communication costs and privacy concerns, they might instead share only their instantaneous rewards and choices. Even without the ability to broadcast, agents may still be able to use sensors to observe the instantaneous rewards and choices of neighbors. A centralized multi-agent setting is considered in [2] and a decentralized setting is considered in [4]. The papers [9, 8] use a running consensus algorithm in which agents observe the reward estimates of their neighbors. In [6, 10] an MAMAB problem is studied in which agents observe instantaneous rewards and choices in a leader-follower setting.

In all of these previous works, communication between agents is assumed to be cost free. However, in real world settings observing neighbors or exchanging information with neighbors is costly. In the present paper, we propose a setting in which agents can decide when and whom to observe in order to receive maximum benefits from observations that incur a cost. An underlying undirected network graph defines neighbors and models the inherent observation constraints present in the network. Agents receive a fixed observation cost at every instance they observe a neighbor.

To account for the observation cost, we define cumulative regret to be the total cumulative regret agents receive from sampling suboptimal options (sampling regret) and from observing neighbors (observation regret). Deterministic [10] and probabilistic [11] communication strategies proposed in the MAB literature lead to a linear cumulative observation regret. Our main contribution is the design of a new strategy for which we prove a logarithmic total cumulative regret, i.e., order-optimal performance. Our design leverages the intuition that it is most useful to observe neighbors when uncertainty associated with estimations of rewards is high.

In Section II we introduce the MAMAB problem and we propose an efficient sampling rule and a communication protocol for an agent to maximize its own total expected cumulative reward. We analyze the performance of the proposed sampling rule in Section III. In Section III-A we analytically upper bound the expected cumulative regret and in Section III-B we analytically upper bound the expected observation regret. We present the upper bound for the total expected cumulative regret in section III-C. In Section IV we provide numerical simulation results and computationally validate the analytical results. We conclude in Section V and provide additional mathematical details in the Appendix.

Ii Multi-agent Multi-armed Bandit Problem

In this section we present the mathematical formulation of the MAMAB problem studied here. Let be the number of options (arms) and the number of agents. Define

as the random variable that denotes reward associated with option

. In this paper we assume that all the reward distributions are sub-Gaussian. Let

be the variance proxy of

and the expected reward of option Let be the optimal option with highest expected reward Each agent chooses one option at each time step with the goal of minimizing its cumulative regret. In MAB problems, cumulative regret is typically defined as cumulative sampling regret, which is equivalent to expected number of times suboptimal options are selected. We let cumulative regret be the sum of cumulative sampling regret and a cumulative observation regret that accumulates a fixed cost for every observation of a neighbor.

We assume that the expected reward values are unknown and the variance proxy values are known to the agents. To improve its own performance, each agent observes its neighbors according to an observation protocol that we define. We use a network graph to encode hard observation constraints and this defines neighbors of agents. Let be an undirected graph. is a set of nodes, such that node in corresponds to agent for . is a set of edges between nodes in . If there is an edge between node and node , then we say that agent and agent are neighbors. Since the graph is undirected, Let be the number of neighbors of agent

Let and be random variables that denote the option chosen by agent and the reward received by agent at time , respectively. Let be a random variable that takes value 1 if option is chosen by agent at time and is 0 otherwise. Let be a random variable that takes value 1 if agent can observe agent at time and is 0 otherwise.

In order to maximize the cumulative reward in the long run, agents need to both identify the best options through exploring and sample the best options through exploiting. Observing neighbors allows an agent to receive more information about options and hence obtain better estimates about expected reward values of options. This leads to less exploring and more exploiting, which reduces the regret an agent receives due to sampling suboptimal options. However, since taking observations is costly, an agent is required to find a trade-off between the information gain and the cost associated with observations. Let be the cost incurred by agent when it observes the instantaneous reward and choice of agent at time step . In this paper we consider the case in which

Let the number of times that agent samples option until time be given by the random variable . And let the total number of times that agent observes rewards from option until time be given by the random variable , where

We define a sampling rule based on the well known UCB (Upper Confidence Bound) rule for a single agent [3]. The UCB rule chooses the option at time that maximizes an objective function that is the sum of an exploit term, equal to the estimate of the reward mean at time , and an explore term, equal to a measure of uncertainty in that estimate at time . Our sampling rule for agent in the MAMAB problem accounts for the observations of neighbors by using them to improve its estimate and reduce its uncertainty. Let the estimate by agent of the expected reward from option at time be given by the random variable , where

and is the total reward observed by agent from option until time .

Definition 1

The sampling rule for agent at time is defined as

(1)

with

(2)
(3)

where is a tuning parameter that captures the trade-off between exploring and exploiting.

To find a balance between information gain and observation cost we define an observation rule for agents so that they choose to incur the cost of making observations of neighbors only when observations are most needed, i.e., when their own uncertainty is high. In the following observation rule, an agent observes the instantaneous rewards and choices of all of its neighbors only when it is exploring, since it explores when uncertainty is high. If agent chooses the option at time that corresponds to the maximum of its estimates of reward means, , then it is exploiting and it does not observe its neighbors.

Definition 2

The observing rule for agent at time and is defined as

(4)

Iii Performance Analysis

In this section we analyze the cumulative regret of agent due to sampling suboptimal options and observing neighbors when employing the sampling rule of Definition 1 and observation rule of Definition 2.

Iii-a Sampling Regret Analysis

Let be a suboptimal option. The total number of times agent samples from option can be upper bounded as

Here is an indicator function such that

Thus we have

Let be the cumulative sampling regret of agent from option until time . Recall that the cumulative regret is defined as the loss incurred by sampling suboptimal options. Define Then we have, from [7],

(5)

To analyze the expected number of samples from suboptimal options until time , we first note that we have

and so

(6)

Next we analyze concentration probability bounds on the estimates of options.

Theorem 1

For any and for there exists a such that

where

The proof of Theorem 1 can be found in the paper [11]. Using symmetry we conclude that

Lemma 1

For and there exists a such that

The proof of Lemma 1 can be found in the paper [11].

We proceed to upper bound the summation of the probabilities of the events for as follows. Using equation (3) we have that the inequality implies

This inequality does not hold for , where

Thus we have

(7)

From the probability bounds given in Lemma 1 and (7), the total expected number of times agent samples suboptimal option until time is upper bounded as

(8)

where

From equation (5) the expected cumulative sampling regret of agent until time is upper bounded as

(9)

Iii-B Observation Regret Analysis

Recall that is the constant unit cost associated with observations. Let be the cumulative observation regret of agent at time step Then we have

This is equivalent to the number of observations taken by agent until time Expected cumulative observation regret can be expressed as

(10)

So expected cumulative observation regret can be upper bounded by upper bounding the expected number of observations until time :

(11)

To analyze the expected number of observation, we use

We first upper bound the expected number of times agent observes its neighbors until time when it decides to explore after sampling a suboptimal option.

Lemma 2

For all suboptimal we have

The proof of Lemma 2 is given in the Appendix.

Next we analyze the expected number of times agent observes its neighbors until time when it decides to explore after sampling the optimal option.

Note that we have

Thus we have

From Lemma 1 we have

(12)
Theorem 2

For all suboptimal options we have

The proof of Theorem 2 is given in the Appendix.

Now we proceed to state the main result of this paper, which is that the total expected cumulative observation regret until time for agent employing the sampling rule given by Definition 1 and the observation rule given by Definition 2 is upper bounded logarithmically in .

Theorem 3

Expected cumulative observation regret until time for agent can be upper bounded as

Theorem 3 follows from equations (10)-(12), Lemma 2 and Theorem 2.

Remark 1

Note that for deterministic communication strategies [9, 10] the expected cumulative observation regret until time for agent is linear in :

For the probabilistic observation strategy of [11] the expected cumulative observation regret until time for agent is linear in :

where is the observation probability of agent Thus, our proposed sampling rule and observation rule outperform these strategies when there are cumulative observation costs.

Iii-C Total expected cumulative regret

Total expected cumulative regret is defined as the summation of expected cumulative sampling regret and expected cumulative observation regret until time :

Let Total expected cumulative regret until time of agent is upper bounded as

(13)

Iv Simulation Results

In this section we present numerical simulation results for a network of 6 agents with underlying observation structure defined by the star graph: the center agent observes all other agents and all other agents only observe the center agent. Agents other than the center agent are interchangeable and their average regret and individual regret are the same. We present numerical simulations to evaluate the performance of the sampling rule and observation rule given by Definitions 1 and 2.

The 6 agents play the same MAB problem with 10 options. In all simulations the reward distributions are Gaussian with variance , , and mean values:

i 1 2 3 4 5 6 7 8 9 10
40 50 50 60 70 70 80 90 92 95

.

The communication cost . We set the sampling rule parameter . We provide results for 1000 time steps with 1000 Monte Carlo simulations.

Figure 1 shows simulation results for the expected cumulative sampling regret of a group of 6 agents using the proposed sampling and observation rules. The blue dashed line shows regret of the center agent. The green dash-dot line shows the average regret of the agents not in the center. The red dotted line shows the average expected cumulative sampling regret over all agents. It can be observed that the expected cumulative sampling regret is logarithmic in time. For comparison, we plot the average expected cumulative regret of the agents when they make no observations of neighbors (solid gold line). When agents are not making observations they are interchangeable, and so the average performance and the individual performance are the same. The simulation results illustrate that the performance of every agent improves significantly when it observes neighbors according to the proposed protocol. The simulation results further show that the center agent outperforms the other agents. This is to be expected since the center agent has more neighbors than the other agents.

Figure 2 shows simulation results for expected observation regret. It can be seen that the expected observation regret is logarithmic in time, as proved in Theorem 3. Since the center agent has more neighbors than the others agents, its observation regret is the highest. However, the results illustrate that when observation cost is small, a significant performance improvement can be obtained for a small observation regret.

Fig. 1: Dashed and dotted lines show expected cumulative sampling regret of the agents using the sampling rule and observation rule of Definitions 1 and 2 with underlying star observation structure. The solid line shows the average performance of agents when they are not observing their neighbors.
Fig. 2: Dashed and dotted lines show expected cumulative observation regret of the agents using the sampling rule and observation rule of Definitions 1 and 2 with underlying star observation structure. The solid line shows that agents do not suffer from any observation regret when they do not observe their neighbors.

V Conclusions

We studied an MAMAB problem where agents can observe the instantaneous choices and rewards of their neighbors but incur a cumulative cost each time they make an observation of a neighbor. We proposed a sampling rule and an observation rule in which an agent observes its neighbors only when it has decided to explore. We defined total expected cumulative regret to be the regret agents receive due to sampling suboptimal options and to observing neighbors. Deterministic and stochastic observation strategies for MAB protocols in the literature yield an expected cumulative observation regret that is linear in time . We analytically proved that under the proposed sampling and observation rules, expected cumulative regret of each agent is bounded logarithmically in . Accuracy of the upper bound has been verified computationally through numerical simulations.

Proof of Lemma 2

Note that we have

Then we have

Lemma 2 follows from equation (8).

Proof of Theorem 2

Let be a suboptimal option with highest estimated expected reward for agents at time Then we have and If the agent chooses option at time step we have Thus we have and

Note that for some we have

Let Then we have

Since we have

Then we have

References

  • [1] R. Agrawal (1995) Sample mean based index policies by o (log n) regret for the multi-armed bandit problem. Advances in Applied Probability 27 (4), pp. 1054–1078. Cited by: §I.
  • [2] V. Anantharam, P. Varaiya, and J. Walrand (1987) Asymptotically efficient allocation rules for the multiarmed bandit problem with multiple plays-part i: iid rewards. IEEE Transactions on Automatic Control 32 (11), pp. 968–976. Cited by: §I.
  • [3] P. Auer, N. Cesa-Bianchi, and P. Fischer (2002) Finite-time analysis of the multiarmed bandit problem. Machine Learning 47 (2-3), pp. 235–256. Cited by: §I, §II.
  • [4] D. Kalathil, N. Nayyar, and R. Jain (2014) Decentralized learning for multiplayer multiarmed bandits. IEEE Transactions on Information Theory 60 (4), pp. 2331–2345. Cited by: §I.
  • [5] E. Kaufmann, O. Cappé, and A. Garivier (2012) On bayesian upper confidence bounds for bandit problems. In Artificial intelligence and statistics, pp. 592–600. Cited by: §I.
  • [6] R. K. Kolla, K. Jagannathan, and A. Gopalan (2018) Collaborative learning of stochastic bandits over a social network. IEEE/ACM Transactions on Networking 26 (4), pp. 1782–1795. Cited by: §I.
  • [7] T. L. Lai and H. Robbins (1985) Asymptotically efficient adaptive allocation rules. Advances in applied mathematics 6 (1), pp. 4–22. Cited by: §I, §III-A.
  • [8] P. Landgren, V. Srivastava, and N. E. Leonard (2016) Distributed cooperative decision-making in multiarmed bandits: frequentist and Bayesian algorithms. In IEEE Conference on Decision and Control (CDC), pp. 167–172. Cited by: §I.
  • [9] P. Landgren, V. Srivastava, and N. E. Leonard (2016) On distributed cooperative decision-making in multiarmed bandits. In European Control Conference (ECC), pp. 243–248. Cited by: §I, Remark 1.
  • [10] P. Landgren, V. Srivastava, and N. E. Leonard (2018) Social imitation in cooperative multiarmed bandits: partition-based algorithms with strictly local information. In IEEE Conference on Decision and Control (CDC), pp. 5239–5244. Cited by: §I, §I, Remark 1.
  • [11] U. Madhushani and N. E. Leonard (2019) Heterogeneous stochastic interactions for multiple agents in a multi-armed bandit problem. In European Control Conference (ECC), pp. 3502–3507. Cited by: §I, §III-A, §III-A, Remark 1.
  • [12] P. B. Reverdy, V. Srivastava, and N. E. Leonard (2014) Modeling human decision making in generalized gaussian multiarmed bandits. Proceedings of the IEEE 102 (4), pp. 544–571. Cited by: §I.
  • [13] H. Robbins (1952) Some aspects of the sequential design of experiments. Bulletin of the American Mathematical Society 58 (5), pp. 527–535. Cited by: §I.
  • [14] R. S. Sutton, A. G. Barto, et al. (1998)

    Introduction to reinforcement learning

    .
    Vol. 135, MIT press Cambridge. Cited by: §I.