The inherent trade-off between exploration and exploitation is at the core of any reactive learning algorithm. Multi-arm bandit is a simple model which highlights this inherent trade-off. Multi-arm bandits can model a variety of scenarios, including pricing (where the actions are prices), recommendation (e.g., where actions are news articles) and many other settings.
To a large part, multi-arm bandits are viewed as a model for learning and optimization in which the planner can select any available action. However, when we are considering human agents as the entities performing the action, then incentives become a major issue. While a planner can recommend actions to the agents (in order to explore different alternatives), the agents ultimately decide whether to follow the recommendation given. This raises the issue of incentives in addition to the exploration-exploitation trade-off.
The planner can induce explorations in many ways. The simplest is using monetary transfers, paying the agents in order to explore (for example, Frazier et al. ). We are interested in the case when the social planner is unable or prefers to avoid any monetary transfers. (This can be due to regulatory constraints, business model, social norms, or any other reason.) The main advantage of the planner in our model is the information asymmetry, namely, the fact that the planner has much more information than the agents.
As a motivating example for information asymmetry, consider a GPS driving application. The application is recommending to the drivers (agents) the best route to drive, given the changing road delays, and observes the actual road delays when the route is driven. While the application can recommend driving routes, ultimately, the driver decides which route to actually drive. The application needs periodically to send drivers (agents) on exploratory routes, where it has uncertainty regarding the actual delay, in order to observe their delay. The driver (agent) is aware that the application has updated information regarding the current delays on various roads. For this reason, the driver (agent) would be willing to follow the recommendation even if she knows that there is a small probability that she is asked to explore. On the other extreme, if the driver would assume that with high probability a certain recommended route has a higher delay, she might drive an alternate route. This inherent balancing of exploration and exploitation while satisfying agents’ incentives, is at the core of our work.
The abstract model that we consider is the following. There is a finite set of actions, and for each action there is a prior distribution on its rewards. A social planner is faced with a sequence of myopic selfish agents, and each agent appears only once. The social planner would like to maximize the social welfare, the sum of the agents’ utilities. The social planner recommends to each agent an action, and if the recommendation is Bayesian incentive compatible, the agent will follow the action. This model was presented in Kremer et al.  and studied in [11, 12, 13]. The work of Kremer et al.  presented an optimal algorithm for the social planner in the case of two actions with deterministic outcome. (Deterministic outcome implies that each time the action is performed we receive the same reward, and the uncertainty is what that value will be, which is govern by the prior distribution.)
Our main focus is to make progress on this important open problem of providing an optimal policy for this setting. For this end, we consider a somewhat more restricted setting, where each action has a finite support. If we assume that there are only two possible values, say , then the task becomes trivial. We can simply order the agents according to their expectation, and ask them to explore until we reach an action of value , and then recommend it forever. This would work even if we provide the agents with the realizations of the previous actions. In this work we take a small, yet significant, step away from this trivial model. We assume that the best a priori action has a larger support. For the most part we analyze the case that the support of the a priori best action is , while the other actions have support . We do extend our results to handle a more general setting of a continuous distribution with full support on for the a priori best action.
The simple model has a significant complexity and allows us to draw a few interesting insights. To understand the challenges, consider the case that the actions have a negative expected reward. (For simplicity, we assume that the actions are sorted by their expected reward, where action has the highest expectation.) In such a case, if the realization of action is , clearly the planner would recommend it for all the following agents. If the realization of action is , clearly any other action is superior to it. However, the challenging case that the realization of action is . In this case, the selfish agents would prefer to perform action with reward (since other actions have negative expected reward). The challenge to the social planner is to incentivize the agents to explore. The main idea is that of information asymmetry. When the planner recommends action , the agent is unsure whether the social planner observed that outcome of action is , in which it would like to perform it, or whether the social planner observed that the outcome is and asks the agent to explore. The social planner, by a delicate balancing of the exploration probability, can make the recommendation Bayesian incentive compatible.
Our main result is an optimal algorithm for the social planner when faced with actions, both for support and for the best apriori action. First, the algorithm makes sure that the BIC constraints are tight, which is a simple intuitive requirement and is clearly required for optimality. However, we need to exhibit much more refine properties to construct an optimal algorithm. An interesting issue regarding the exploration order is whether when we force a tight BIC constraint we might be forced to explore an action before we know the values of actions . We show that this is not the case in the optimal algorithm, namely, the exploration of action starts only after the social planner knows the realizations of all the better a priori actions, i.e., . While this seems like an intuitive outcome, it relies on the very delicate way in which our algorithm performs its randomization. (Recall that the recommendation algorithm uses randomization to balance between exploring and exploiting.) The implementation of the randomization is the second interesting property of our algorithm. In our randomization, we use a correlation between agents and actions. Specifically, the randomization selects for each action a random agent that might potentially explore it (if needed). Special care needs to be taken to make sure that for different actions we always select different agents.
We show that our algorithm does not only maximize the social welfare but in addition minimize the exploration time, the time until the social planner does not need to explore any more. For the most part we assume that the number of agents is large enough that the social planner completes the exploration. We show also how to derive the optimal policy in the case of a limited number of agents.
As mentioned, the work of Kremer et al.  presented the model and derived the optimal policy for two deterministic actions. Mansour et al.  derive tight asymptotic regret bound in the case of stochastic actions as well as a reduction from an arbitrary non-BIC policy to a BIC one. Bahar et al.  enrich the model by embedding the agents in a social network, and allowing them to observe their neighbors. Mansour et al.  extended the model to allow a multi-agent game in each time step, rather than a single agent. Mansour et al.  consider the case of two competing planners.
Frazier et al.  consider a model with monetary transfers, where the social planner can pay agents to explore. Che et al.  consider a setting with two binary-valued actions and continuous information flow and a continuum of agents. Finally, Slivkins  has an excellent overview of the topic.
A related topic is that of Bayesian Persuasion by Kamenica and Gentzkow  where the planner tries to infer a value of an “unobservable” state using interaction with multiple agents. See [4, 5, 6] for a more algorithmic perspective of Bayesian Persuasion.
Let be the set of possible actions. The prior distribution
defines random variablesfor the rewards of actions . The reward of action , denoted by , is sampled from (it is sampled once, and any application of action yields the same reward ). The prior expected reward of action is , and for notational convenience we assume that .
In this work we focus on the case that the support of distribution of is (the case of support appears in Section 5). The support of distribution , for , is . We denote by , which implies that the distribution , for , has a single parameter, (and ). W.l.o.g. we assume that , otherwise the action has a constant reward of .
The interaction between the planner and the agents proceeds as follows. At time , the -th agent arrives, and the planner recommends to the -th agent action , which is called the recommended action. Given the recommended action , the -th agent selects an action , receives a reward , and leaves. Formally, the -th agent has a utility function and if action has been explored, else . A history at time , , contains all the previous chosen actions by the agents, i.e., , and their corresponding rewards, . A strategy for the planner is a recommendation policy, , where , where is the set of distributions over , i.e., . The value of is the probability that , i.e., .
A recommended action, , is Bayesian incentive-compatible (BIC) if for any action , we have . Such constrains are called BIC constrains. I.e., there is no other action that can increase agent ’s expected reward, based on the prior , the policy , the recommended action , and the agent’s place in line , all of which the agent observe (note that the agent does not observe the history ). A recommendation policy for the planner, , is BIC if all it’s recommendations are BIC. Namely, for any agent and any history , the recommendation is BIC, i.e., for any action , we have .
The social welfare is the expected cumulative reward of all the agents. The social welfare of a BIC recommendation policy is: ,
The Bayesian prior on the rewards, is a common knowledge to the planner as well as all the agents. W.l.o.g, we restrict the planner’s recommendation policy to be BIC, which assures that the agents follow the recommended actions. Our main goal is to design a BIC algorithm that maximizes social welfare (i.e., the cumulative reward of the agents).
3 Optimal BIC Algorithm for
We start with a simpler case that will have most of the ingredients of the more general case. We restrict the first action to have only three possible values , namely, the support of is . The second restriction is that we assume that there are only three actions, i.e., . The terminology is provided for -actions settings, but some of the intuition and motivation are provided for three actions settings. The proofs appear in Appendix 0.A. The algorithm for the general case of actions, and some of its proofs are in the appendix 0.B.
Given this special case, we claim that the challenging case is when . In the case that , we can simply recommend to the first agent action , i.e., . When we observe , then: (1) If , we recommend to all the agents action , i.e., . (2) If or , we recommend to the second agent action , i.e., . This is BIC since in this case. If we recommend to all the agents . Otherwise, , and we recommend to the third agent action , i.e., . Again, this is BIC since . Either way, all the agents after the first three will be performing the optimal action. The above policy maximizes social welfare even if we do not restrict the information flow, and the planner announces to the agents the actions’ realizations. In the case that , we can execute for the first two agents the above strategy, and essentially reduce the number of actions to two, for which the optimal policy was given by Kremer et al. . For this reason, we assume that . (And for actions, we assume .)
To build intuition we start with a simple example, in order to explain how a BIC policy can give a recommendation .
Consider a recommendation to agent . The possible reasons for it is one of the following:
Exploitation driven recommendation: Action is the best action given the history. This can be due to one of the following cases:
A known reward: The planner already observed that , which is the maximum possible reward. From that time, the recommended action is , as it has the maximum possible reward.
An unknown reward: The observed realizations have the minimum possible reward, i.e., and maybe . Given this realization, we know that (and in case that , also ). This makes action the best action to execute, considering the history.
Exploration driven recommendation: The planner has not yet observed an action with the best possible reward (i.e., ), and observed . Since we assume that , such a recommendation would not benefit for agent (but the planner is recommending it since it might benefit future agents).
Fortunately, the agents do not know the realizations of the actions’ rewards, hence cannot infer the reason for their recommendations. This is where the information asymmetry translates into an advantage for the planner, and enable her to maximize social welfare.
3.1 Information States
It would be very useful to partition the histories depending on the information that the planner has, regarding the realized values of the actions. Since we have only three actions, we have at most three realized values, and we can encode them in a vector of length three. We use thesymbol to indicate that a value is still unknown. For example, implies that we know that , and we never explored the value of . Any history of the first agents which is compatible with is assigned to the information state . The recommendation to the -th agent would depend on the planner’s information state.
Note that the agents do not know the planner’s information state. However, given the recommendation , and the planner policy , they can deduce the probabilities of each state, conditioned on the recommendation they received. Those probabilities allow them to test whether the recommended action is indeed BIC, i.e., maximizes their expected reward given the information they observe.
Going back to example 1, we can now describe it using information states.
Consider a recommendation to agent , . Every possible reason for it can be one of the following:
States that result in exploitation driven recommendation, action either has:
A known reward: The planner has already observed action ’s reward and it is the maximum possible reward. I.e., the planner is in one of the following information states: , or .
An unknown reward: The only action with a better prior expected reward compared to action , action , has been explored and resulted in minimal reward (i.e, ). Action that now has the best utility, has not yet explored. We denote this state with . (An additional possible state is where the planner also observed that .)
The set of these exploitation states is denoted by , for the reason that following a recommendation for action in such states produces higher expected utility for agent compared to action .
States that may result in exploration driven recommendation: Action has not been explored yet, whereas . This implies that the planner is either in information state , or in information state .
The set of these exploration states is denoted by , for the reason that following a recommendation for action in such states produces lower expected utility for agent then selecting action .
3.2 The optimal BIC recommendation algorithm
Given the information states, we can describe the planner’s recommendation policy. The recommendation policy will map the information states to recommended actions. In the case of an “Exploration driven Recommendation” the mapping would be stochastic, to make sure that the incentives are maintained. Algorithm 3-actions is described in Table 1, defining what recommendation to give in each information state.
Algorithm 3-actions uses two functions, and , which control the exploration and are based on a mutual parameter , which will be selected uniformly at random in . The states are marked also as terminal states if there is a unique recommendation for all future agents, and exploration if the recommended action might not have the highest expected reward (). States not marked as exploration result in a exploitation driven recommendation, and are therefore exploitation states ().
Looking at Algorithm 3-actions in Table 1 might be intimidating, however, in most of the information states the recommendations are rather straightforward. In the initial information state, i.e., , the only BIC recommendation is action , since the first agent knows that the planner has no additional information beyond the prior. In any information state in which some , the planner recommends that action, the agents get the maximum reward, and the state does not change (i.e., terminal state). In any information state in which all the realized actions are , the planner recommends an unexplored action with the highest expected reward, the agents get the maximum expected reward, and after it the state does change to include the new explored action.
The main challenge is in the cases that the realized value of action is and . In such information states we have a tension between the agent incentive, to perform action and maximize her expected reward, and the planner incentive to explore new actions to the benefit of future agents. Indeed we have two information states in which we explore stochastically, balancing between the incentives of the agent and making the recommendation BIC. In information state the planner explores with some probability action , and in information state the planner explores with some probability action .
We stress that the stochastic exploration is not done in an “independent” way, but rather in a coordinated way through the parameter , which is selected initially uniformly at random, and never changes. The property that we will have is that while we are in information state we eventually have an agent that explores action , and its index is . Similarly, while we are in information state we eventually have an agent that tries action , and its index is . We need to take special care to make sure that agent , which explores action , is different than agent , which explores action . (Clearly, each agent can explore at most one action.) This is why we use a coordinate sampling (to be defined later).
We also show that some information are never reachable, namely, , and . This will be due to the fact that for any we will show that , which implies that we complete the exploration of action before exploring action . As we extend to actions, we use the same to coordinate between the stochastic exploration of all the actions. Then again, by showing that for any pair of actions , it holds that , we deduce that the order in which the actions are explored is from the a priori highest expected reward to the lowest, i.e., .
3.3 Exploration Rates
In this section we formalize the exploration rate that the planner can have. A BIC exploration rate, denoted by , measures the probability that a BIC recommendation is given when the planner is in some exploration state. Namely, for any BIC recommendation policy , the BIC exploration rate is , where is the probability that the planner is in at time , and recommends to explore action , assuming that all the recommendations until the current agent use .
Let denote a BIC recommendation policy that recommends actions base on Table 1 (or Table 2 for actions) and uses maximum BIC exploration rates for every agent and for every . Maximal BIC exploration rate, denoted by is the maximum probability of exploration, subject to the BIC constraints, and bounded by the probability that the planner is in exploration state at time with as a recommended action. I.e., is the solution of:
The first constraint makes sure that is a BIC recommendation. Its first summand is a summation taken over each exploitation state probability, multiplied by the “gain” from choosing action instead of action in this state. The second summand is the “loss” of the agent, namely the prior expected reward of action (i.e., ), multiplied by the exploration rate and divided by the probability of the event (which includes also the exploration probability ). The terms “gain” and “loss” are from the agent’s perspective. By looking at Table 1, we can see that when is given in exploitation state, the expected utility difference is positive, therefore the agent has a “gain” of reward in these states. On the other hand, as we assume that , the agent has a “loss” of reward in the exploration states (all of which share ). When this entire expression is non-negative (i.e., the first constraint holds), it is BIC.
Notice that is defined as a BIC policy, and as such every recommendation is BIC, i.e., its BIC constraints must be met for every action . We argue that in , if the BIC constraint of action compared to action is satisfied, all the other BIC constraints for agent are met. Therefore, we only refer to the BIC constraint with respect to action when calculating . The reason is that for any pair of actions , and for every , we show that , i.e., the exploration of action is done before the exploration of action . Along with Table 1 that represents the recommendations of , we deduce that whenever a recommendation is given, the reward of action is either unknown (i.e, the expected reward is ) or has been observed and . Now, for any action such that , yields that has been observed and . As for every action , since yields that has not to been sampled yet, and from the assumption that we know that . Either way .
The second constraint in (1) prevents the exploration rate from exceeding the probability that the planner is in exploration state (). This guarantees that we can actually use of all of to give an exploration driven recommendation. Namely, .
Let denote the index of last agent that might explore action , i.e., . For convenience, for every agent and action , we denote
3.4 Computing the Maximum BIC Exploration Rates
Given , we have
And for action , given for and , assuming and , we have
In addition we show that .
The next lemma derives the value of (without an assumption on ).
For action , given for and , such that , we have
For every , the exploration rate of agent for action is strictly positive, i.e., .
The following lemmas relate the exploration rates and the parameters and .
For every action and agent and that , it holds that .
For every action , for every , it holds that
If , then it holds that for every , therefore and we stop.
3.5 Properties of the Exploration Rates
In this subsection we show properties regarding the exploration rates which later enable to show that is well defined, that eventually reaches a terminal state, and finally, that it maximizes expected social welfare.
For action and for agent we have,
For action and agent we have,
Let . We show that is the total exploration rate of action .
For , the probability for exploration driven recommendation for any action is , i.e.,
3.6 Determining the explorers
We now explain how the algorithm chooses which agent will explore each action. Recall that the planner knows the history , and therefore knows the current state at time , as defined in Table 1. She then sets to be the corresponding recommendation for the current state in Table 1. Together with the policy parameters and the functions that we later define in Definitions 1 and 2, respectively, she returns as the recommended action.
a valid input for our algorithm is a triple such that:
is a real number that is sampled from a uniform distribution in.
indicates the agent number (the agent for which the algorithm is run).
is a set that contains exploration rates vectors for each action excluding action , such that (i.e, the exploration rate for agent with as recommended action).
We now define the functions that determines which agent will explore action .
Let be the function that maps a real number to an agent such that
Let be the function for action and agent that maps a real number to a recommendation for agent , i.e., , and is defined as follows:
The following lemma shows that different actions are explored by different agents, and that better a priori actions are explored always earlier.
For every , and for every action , .
Since , Lemma 7 implies the following corollaries.
For every action , it holds that
Action is explored before any action , making every state such that and (e.g., ) infeasible for every agent . Namely,
We finish this section by showing that is well-defined in a sense that every agent gets exactly one action as a recommendation in Theorem 3.2.
Recommendation policy is a well-defined recommendation policy, since for every , and for pair of actions action , , and there exists such that . This implies that every agent receives a recommendation for exactly one action.
4.1 Finite exploration
Clearly the flow between the information states is acyclic. From Table 1, when the planner is in a non-terminal state, she is exploring, with some probability. This implies that after there is no more exploration. Therefore, will eventually reach a terminal state and thus complete the exploration. From this we derive the following theorem:
In , as long as the planner has not observed an action with , she will keep exploring until all actions’ rewards are revealed. Therefore always reaches a terminal state.
4.2 Minimum exploration time
Two BIC planners may differ only in their recommendations when they are in the exploration states. We would prefer the one that explores the actions ”faster”, as it would mean finding the optimal action sooner. For this we define a partial order between policies. We say that a policy is stochastic dominant over another if it discovers the realizations of the rewards faster.
A BIC policy algorithm is stochastic dominant over another BIC policy algorithm if for every prior and for every agent , has at least the same probability to observe action ’s reward as , and for some action a strictly higher probability to know it’s reward in time . I.e., for any agent and action we have , and there exists some action and agent for which .
The following lemma states that the suggested policy, is stochastic dominant over all other BIC policies.
Let be a BIC policy algorithm, with the same recommendations for the exploitation states as in Table 1. Then is stochastic dominant over .
From Lemma 8 we easily obtain that maximizes exploration rates of each action and agent . Due to the use of to decide which agent will explore each action, manages to maximize exploration rates of all the actions independently. This give us an important result regarding :
minimizes the time until terminal state .
4.3 Maximum expected social welfare
In this section we present the main result: The best BIC policy is the one that minimizes exploration time for every action simultaneously.
Let be a BIC policy algorithm that maximizes the expected Social welfare. Then for a large number of agents (specifically, ), it holds that
From the above theorem we deduce the following corollary.
Recommendation policy maximizes social welfare for every
4.4 Limited number of agents
The planner’s goal is to maximize social welfare. If there is a limited number of agents, she cannot rely on the existence of the agent that balances the the loss of social welfare (i.e., agent in the proof for Theorem 4.3). Our algorithm must be adjusted for that. A natural solution is to limit the recommendation for exploration, so that the planner must give exploration driven recommendation for action in round if the gain for the following agents, is high enough to cover for the expected loss of the -th agent, i.e., . We add the following requirement that must be fulfilled if the algorithm gives an exploration driven recommendations to agent . Namely
or alternatively .
Theorem 4.3’s proof still applies for any pair that meets the additional requirement. For pairs that do not meet the requirement, action is no longer recommended for exploration in round or afterwards. An exploration driven recommendation for these agents harms the social welfare.
5 Continuous distribution for the a priori best action’s reward
In this section we explore the same model with one significant difference. The prior distribution is now a continuous distribution that has full support of . (Note that we do not allow mass points.) Consider that the number of agents, , is large enough so that a social planner must complete the exploration of all the actions.
The different type of recommendation policy algorithm we introduce for this setting is a generalization of the partition policy, originally defined in Kremer et al. .
5.1 Partition policy as a recommendation algorithm
The following two definitions are used to define a partition policy (in Definition 6).
is a collection of disjoint sets, , where .
A valid input for any partition policy algorithm is a series s.t. for any pair of actions , it holds that for every agent .
Given a valid input, , and a realization , a partition policy is a recommendation policy that makes the following recommendations. For agent we have,
For we have .
If there is an explored action with a reward of (i.e., it is optimal), then .
Else, if then . (In this case agent is the first agent for whom .).
Let us inspect each clause in the above definition with regards to BIC and social welfare.
Since action is the a priori better action, any BIC policy must recommend to agent action (clause (1)).
After finding an explored action with value , to maximize the social welfare we must recommend it (clause (2)).
Clause (5) gives an exploitation recommendation. Note that any explored action , in this case, has .
Notice that a valid input, , insures that every agent receives a recommendation for exactly one action. We can now derive the following lemma:
The optimal BIC recommendation policy is a partition policy.
5.2 The suggested BIC Partition Policy
Recall that agent finds that recommendation to be BIC if for any action we have
Note that this holds if and only if for any action
We now describe how to extract parameters for the suggested policy iteratively, given a prior for the problem. Then we continue by showing that these collections can be used as a valid input of a partition policy. Finally, we show that using these parameters produces a BIC recommendation policy.
The sets are calculated as follows.
Let be the ordered interval , where
For let (this can be done by setting for ).
For , recall that , and let be the solution to:
For every , let be the solution to:
for every . (We have .).
Notice that in each step, distribution , the parameters and are known, therefore one can compute the value of .
In the next lemma, we show that for every action and agent . This will allow us to deduce in Corollary 4 that is a collection of disjoint sets, which is required from a valid input for partition policy.
For every action and agent , it holds that .
For every action