In resource-constrained environments, the difficulty in constructing and maintaining large-scale infrastructure limits the possibility of developing a centralized learning system that has access to global information, resources for effectively processing that information, and the capacity to make all the decisions. Consequently, developing cost-efficient distributed learning systems, i.e., groups of units that collectively process information and make decisions with minimal resource, is an essential step towards making machine learning practical in such constrained environments. In general, most distributed learning strategies allow individuals to make decisions using locally available information(kalathil2014decentralized; landgren2016distributed; madhushani2019heterogeneous), i.e., information that they observe or is communicated to them from their neighbors. However, the performance of such systems is strongly dependent on the underlying communication structure. Such dependence inherently leads to a trade-off between communication cost and performance. Our goal is to develop high performance distributed learning systems with minimal communication cost.
We focus on developing cost-effective distributed learning techniques for sequential decision making under stochastic outcomes. Our work is motivated by the growing number of real-world applications such as clinical trials, recommender systems, and user-targeted online advertising. For example, consider a set of organizations networked to recommend educational programs to online users under high demand. Each organization makes a series of sequential decisions about which programs to recommend according to the user feedback (warlop2018fighting; feraud2018decentralized). As another example, consider a set of small pharmaceutical companies conducting experimental drug trials (tossou2016algorithms; Durand2018ContextualBF). Each company makes a series of sequential decisions about the drug administration procedure according to the observed patient feedback. In both examples, feedback received by the decision maker is stochastic, i.e., the feedback is associated with some uncertainty. This is due to the possibility that at different time steps online users (patients) can experience the same program (drug) differently due to internal and external factors, including their own state of mind and environmental conditions.
Performance of distributed learning in these systems can be significantly improved by establishing a communication network that facilitates full communication, whereby each organization shares all feedback immediately with others. However, communication can often be expensive and time-consuming. Under full communication, the amount of communicated data is directly proportional to the time horizon of the decision-making process. In a resource-constrained environment, when the decision-making process continues for a long time, the full communication protocol becomes impractical. We address this problem by proposing a partial communication strategy that obtains the same order of performance as the full communication protocol, while using a significantly smaller amount of data communication.
To derive and analyze our proposed strategy, we make use of the bandit framework, a mathematical model that has been developed to model sequential decision making under stochastic outcomes (laiRobbins; robbins1952some). Consider a group of agents (units) making sequential decisions in an uncertain environment. Each agent is faced with the problem of repeatedly choosing an option from the same fixed set of options (kalathil2014decentralized; landgren2016distributed; landgren2016distributedCDC; landgren2020distributed; martinez2019decentralized)
. After every choice, each agent receives a numerical reward drawn from a probability distribution associated with its chosen option. The objective of each agent is to maximize its individual cumulative reward, thereby contributing to maximizing the group cumulative reward.
The best strategy for an agent in this situation is to repeatedly choose the optimal option, i.e., the option that provides the maximum average reward. However, agents do not know the expected reward values of the options. Each individual is required to execute a combination of exploiting actions, i.e., choosing the options that are known to provide high rewards, and exploring actions, i.e., choosing the lesser known options in order to identify options that might potentially provide higher rewards.
This process is sped up through distributed learning that relies on agents exchanging their reward values and actions with their neighbors (madhushani2019heterogeneous; madhushani2020dynamic). The protocols in these works determine when an agent observes (samples) the reward values and actions of its neighbors. Our proposed protocol instead determines only when an agent shares (broadcasts). A key result is that this seemingly altruistic action (sharing) provably benefits the individual as well as the group. torney2011signalling present how this can be an evolutionarily stable strategy in animal social groups.
We define exploit-based communication to be information sharing by agents only when they execute exploiting actions. Similarly, we define explore-based communication to be information sharing by agents only when agents execute exploring actions. Thus, for information sharing, we have that
We propose a new partial communication protocol that uses only explore-based communication. We illustrate that explore-based communication obtains the same order of performance as full communication, while incurring a significantly smaller communication cost.
In this work, we study cost-efficient, information-sharing, communication protocols in sequential decision making. Our contributions include the following:
We propose a new cost-effective partial communication protocol for distributed learning in sequential decision making. The communication protocol determines information sharing.
We illustrate that with this protocol the group obtains the same order of performance as it obtains with full communication.
We prove that under the proposed partial communication protocol, the communication cost is , where is the number of decision making steps; whereas under full communication protocols, the communication cost is .
Previous works (kalathil2014decentralized; landgren2016distributed; landgren2016distributedCDC; landgren2018social; landgren2020distributed; martinez2019decentralized) have considered the distributed bandit problem without a communication cost. landgren2016distributed; landgren2016distributedCDC; landgren2020distributed
use a running consensus algorithm to update estimates and provide graph-structure-dependent performance measures that predict the relative performance of agents and networks.landgren2020distributed also address the case of a constrained reward model in which agents that choose the same option at the same time step receive no reward. martinez2019decentralized propose an accelerated consensus procedure in the case that agents know the spectral gap of the communication graph and design a decentralized UCB algorithm based on delayed rewards. szorenyi2013gossip consider a P2P communication where an agent is only allowed to communicate with two other agents at each time step. In chakraborty2017coordinated, at each time step, agents decide either to sample an option or to broadcast the last obtained reward to the entire group. In this setting, each agent suffers from an opportunity cost whenever it decides to broadcast the last obtained reward. A communication strategy where agents observe the rewards and choices of their neighbors according to a leader-follower setting is considered in landgren2018social. Decentralized bandit problems with communication costs are considered in the works of tao2019collaborative; Wang2020Distributed. tao2019collaborative consider the pure exploration bandit problem with a communication cost equivalent to the number of times agents communicate. Wang2020Distributed propose an algorithm that achieves near-optimal performance with a communication cost equivalent to the amount of data transmitted. madhushani2020dynamic propose a communication rule where agents observe their neighbors when they execute an exploring action.
2.1 Problem Formulation
In this section we present the mathematical formulation of the problem. Consider a group of agents faced with the same -armed bandit problem for time steps. In this paper we use the terms arms and options interchangeably. Let, which denotes the reward of option Define as the expected reward of option We define the option with maximum expected reward as the optimal option Let be the expected reward gap between option and option Let
be the indicator random variable that takes value 1 if agentchooses option at time and 0 otherwise.
We define the communication network as follows. Let be a fixed nontrivial graph that defines neighbors, where denotes the set of agents and denotes the communication link between agents and Let be the indicator variable that takes value 1 if agent shares its reward value and choice with its neighbors at time Since agents can send reward values and choices only to their neighbors, it holds that such that
2.2 Our Algorithm
Let be the estimated mean of option by agent at time Let and denote the number of samples of option and the number of observations of option , respectively, obtained by agent until time . is equal to plus the number of observations of option that agent received from its neighbors until time So, by definition
Initially, all the agents are given a reward value for one sample from each option.
The initial given reward values are used as the empirical estimates of the mean values of the options. Let denote the reward received initially by agent for option . The estimated mean value is calculated by taking the average of the total reward observed for option by agent until time :
The goal of each agent is to maximize its individual cumulative reward, thereby contributing to maximizing the group cumulative reward. We assume known variance proxy as follows.
All agents know the variance proxy of the rewards associated with each option.
When more than one agent chooses the same option at the same time they receive rewards independently drawn from the probability distribution associated with the chosen option.
To realize the goal of maximizing cumulative reward, agents are required to minimize the number of times they sample sub-optimal options. Thus, each agent employs an agent-based strategy that captures the trade-off between exploring and exploiting by constructing an objective function that strikes a balance between the estimation of the expected reward and the uncertainty associated with the estimate (auer2002finite). Each agent samples options according to the following rule.
(Sampling Rule) The sampling rule for agent at time is
represents agent ’s uncertainty of the estimated mean of option . When the number of observations of option is high, the uncertainty associated with the estimated mean of option will be low; this is reflected in the inverse relation between and .
An exploiting action corresponds to choosing the option with maximum estimated mean value. This occurs when the option with maximum objective function value is the same as the option with maximum estimated mean value. An exploring action correspond to choosing an option with high uncertainty. This occurs when the option with maximum objective function value is different from the option with maximum estimated mean value. Each agent can reduce the number of samples it takes from sub-optimal options by leveraging communication to reduce the uncertainty associated with the estimates of sub-optimal options. Thus, in resource-constrained environments, it is desirable to communicate reward values obtained from sub-optimal options only. Exploring actions often lead to taking samples from sub-optimal options. So, we define a partial communication protocol such that agents share their reward values with their neighbors only when they execute an exploring action.
(Communication Rule) The communication rule for agent at time is
The goal of maximizing cumulative reward is equivalent to minimizing cumulative regret, which is the loss incurred by the agent when sampling sub-optimal options. We analyze the performance of the proposed algorithm using expected cumulative regret and expected communication cost.
For a group of agents facing the -armed bandit problem for time steps, the expected group cumulative regret can be expressed as
Thus, minimizing the expected group cumulative regret can be achieved by minimizing the expected number of samples taken from sub-optimal options.
Since communication refers to agents sharing their reward values and actions with their neighbors, each communicated message has the same length. We define communication cost as the total number of times the agents share their reward values and actions during the decision-making process. Let be the group communication cost up to time Then, we have that
Under full communication, expected communication cost is We now proceed to analyze the expected communication cost under the proposed partial communication protocol.
Let be the expected cumulative communication cost of the group under the communication rule given in Definition 2. Then, we have that
We provide numerical simulation results illustrating the performance of the proposed sampling rule and the communication rule. For all the simulations presented in this section, we consider a group of 100 agents () and 10 options () with Gaussian reward distributions. We let the expected reward value of the optimal option be 11, the expected reward of all other options be 10, and the variance of all options be 1. We let the communication network graph be complete. We provide results with 1000 time steps () using 1000 Monte Carlo simulations with .
Figure 1(a) presents expected cumulative group regret for 1000 time steps. The curves illustrate that both full communication and explore-based communication significantly improve the performance of the group as compared to the case of no communication. Further, group performance with explore-based communication is of the same order as group performance with full communication. Group performance improvement obtained with exploit-based communication is insignificant as compared to the case of no communication. Figure 1(b) presents the results for expected cumulative communication cost per agent for 1000 time steps. The curves illustrate that communication cost incurred by explore-based communication is significantly smaller than the cost incurred by full communication and by exploit-based communication. In fact, the cost incurred by exploit-based communication is quite close to the cost incurred by full communication. Overall, the results illustrate that our proposed explore-based communication protocol, in which agents share their reward values and actions only when they execute an exploring action, incurs only a small communication cost while significantly improving group performance.
4 Discussion and Conclusion
The development of cost-effective communication protocols for distributed learning is desirable in resource-constrained environments. We proposed a new partial communication protocol for sharing (broadcasting) in the distributed multi-armed bandit problem. We showed that the proposed communication protocol has a significantly smaller communication cost as compared to the case of full communication while obtaining the same order of performance. An important future extension of our work is to analyze and improve the performance of the proposed communication protocol under random communication failures.
This research has been supported in part by ONR grants N00014-18-1-2873 and N00014-19-1-2556 and ARO grant W911NF-18-1-0325.
Appendix A Expected Communication Cost
Let be the expected cumulative communication cost of the group under the communication rule given in Definition 2. Then, we have that
Proof of Lemma 1
The expected communication cost can be given as
To analyze the expected number of exploring actions, we use
We first upper bound the expected number of times agent broadcasts rewards and actions to its neighbors until time when it samples a sub-optimal option:
This follows from the fact that we only sample sub-optimal options logarithmically with time (see Lemma 2 in madhushani2020dynamic).
Next we analyze the expected number of times agent broadcasts rewards and actions to its neighbors until time when it samples the optimal option. Note that we have
Thus, we have
From Lemma 1 in madhushani2020dynamic we get
Now we proceed to upper bound the second summation term of (3). Note that for some we have
Let be the sub-optimal option with highest estimated expected reward for agents at time Then we have and If agent chooses option at time we have Thus we have and Then for we have
The last equality follows from Lemma 1 by madhushani2020dynamic.