In the classic literature on multi-armed bandits, an agent repeatedly selects one of a set of actions, each of which has a payoff drawn from an unknown fixed distribution. Over time, she can trade off exploitation, in which she picks an action to maximize her expected reward, with exploration, in which she takes potentially sub-optimal actions to learn more about their rewards. By coordinating her actions across time, she can guarantee an average reward which converges to that of the optimal action in hindsight at a rate proportional to the inverse square-root of the time horizon.
In many decision problems of interest, the actions are not chosen by a single agent, as above, but rather a sequence of agents. This is particularly common in social learning settings such as online websites, where a population of users try to learn about the content of the site. In such settings, each agent will choose an exploitive action as the benefits of explorative actions are only accrued by future agents. For example, in online retail, products are purchased by a sequence of customers, each of which buys what she estimates to be the best available product. This behavior can cause herding, in which all agents eventually take a sub-optimal action of maximum expected payoff given the available information.
This situation can be circumvented by a centralized algorithm that induces agents to take explorative actions, an idea called incentivizing exploration. Such algorithms are often encountered in the form of recommendations and are quite common in practice. Many online websites, like Amazon, Reddit, Yelp, and Tripadvisor, among many others, use recommendation policies of some sort to help users navigate their offerings. One way recommendation policies induce exploration is to introduce payments, [17, 22, 15]. For example, the recommendation system of an online retailer might offer coupons to agents for trying certain products. When payments are financially or technologically infeasible, another alternative is to rely on information asymmetry, [27, 14, 31, 11]. Here the idea is that the centralized algorithm, often called a disclosure policy, can choose to selectively release information about the past actions and rewards to the agents in the form of a message. For example, the recommendation system of an online retailer might disclose past reviews or product rankings to the agents. Importantly, agents can not directly observe the past, but only learn about it through this message. The agent then chooses an action, using the content of the message as input.
Our scope. Prior work on incentivizing exploration, with or without monetary incentives, achieves much progress (more on this in “related work”), but relies heavily on the standard assumptions of Bayesian rationality and the “power to commit” (users trust that the principal actually implements the policy that it claims to implement). However, these assumptions appear quite problematic in the context of recommendation systems of actual online websites such as those mentioned above. In particular, much of the prior work suggests disclosure policies that merely recommend an action to each agent, without any other supporting information, and moreover recommend exploratory actions to some randomly selected users. This works out extremely well in theory, but it is very unclear whether users would even know some complicated policy of the principal, let alone trust the principal to implement the stated policy. Even if they do know the policy and trust that it was implemented as claimed, it’s unclear whether users would react to it rationally. Several issues are in play: to wit, whether the principal intentionally uses a different disclosure policy than the claimed one (because its incentives are not quite aligned with the users’), whether the principal correctly implements the policy that it wants to implement, whether the users trust the principal to make correct inferences on their behalf, and whether they find it acceptable that they may be singled out for exploration. Furthermore, regardless of how the users react to such disclosure policies, they may prefer not to be subjected to them, and leave the system.
We strive to design disclosure policies which mitigate these issues and (still) incentivize a good balance between exploration and exploitation. While some assumptions on human behavior are unavoidable, we are looking for a class of disclosure policies for which we can make plausible behavioral assumptions. Then we arrive at a concrete mathematical problem: design policies from this class so as to optimize performance, the induced explore-exploit tradeoff. Our goal in terms of performance is to approach the performance of the social planner.
Our model. For the sake of intuition, let us revisit the full-disclosure policy that reveals the full history of observations from the previous users. We interpret it as the “gold standard”: we posit that users would trust such policy, even if they cannot verify it. Unfortunately, the full-disclosure policy is not good for our purposes, essentially because we expect users to exploit rather than explore. However, what if a disclosure policy reveals the outcomes for every tenth agent, rather than the outcomes for all agents? We posit that users would trust such policy, too. Given a large volume of data, we posit that users would not be too unhappy with having access to only a fraction this data. A crucial aspect of our intuition here is that the “subhistory” revealed to a given user comes from a subset of previous users that is chosen in advance, without looking at what happens during the execution. In particular, the subhistory is not “biased”, in the sense that the disclosure policy cannot subsample the observations in favor of a particular action.
With this intuition in mind, we define the class of unbiased-subhistory policies: disclosure policies that reveal, to each arriving agent , a subhistory consisting of the outcomes for a subset of previous agents, where is chosen ahead of time. Further, we impose a transitivity property: if , for some previous agent , then . So, agent has all information that agent had at the time she chose her action. In particular, agent does not need to second-guess which message has caused agent to make choose that action.
Following much of the prior work on incentivizing exploration, we do not attempt to model heterogenous agent preferences and non-stationarity. Formally, we assume that the expected reward of taking a given action , denoted , is the same for all agents, and does not change over time. Then the crucial parameter of interest, for a given action , are the number of samples and the empirical mean reward in the observed subhistory. We consider a flexible model of agent response: for each action an agent forms an estimate of the mean reward , roughly following but taking into account the uncertainty due to a small number of samples, and chooses an action with a largest reward estimate. We allow the reward estimates to be arbitrary otherwise, and not known to the principal.
Regret. We measure the performance of a disclosure policy in terms of regret, a standard notion from the literature on multi-armed bandits. Regret is defined as the difference in the total expected reward between the best fixed action and actions induced by the policy. Regret is typically studied as a function of the time horizon , which in our model is the number of agents. For multi-armed bandits, regret bounds are deemed non-trivial, and regret bounds are optimal in the worst case. Regret bounds that depend on a particular problem instance are also considered. A crucial parameter then is the gap , the difference between the best and second best expected reward. One can achieve regret rate, without knowing the .
Our results and discussion. Our main result is a transitive, unbiased-subhistory policy which attains near-optimal regret rate for a constant number of actions. This policy also obtains the optimal instance-dependent regret rate for problem instances with gap , without knowing the in advance. In particular, we match the regret rate achieved for incentivizing exploration with unrestricted disclosure policies .
The main challenge is that the agents still follow exploitation-only behavior, just like they do for the full-disclosure policy, albeit based only on a portion of history. A disclosure policy controls the flow of information (who sees what), but not the content of that information.
The first step is to obtain any substantial improvement over the full-disclosure policy. We accomplish this with a relatively simple policy which runs the full-disclosure policy “in parallel” on several disjoint subsets of agents, collects all data from these runs and discloses it to all remaining agents. In practice, these subsets may correspond to multiple “focus groups”. While any single run of the full-disclosure policy may get stuck on a suboptimal arm, having these parallel runs ensure that sufficiently many of them will “get lucky” and provide some exploration. This simple policy achieves regret. Conceptually, it implements a basic bandit algorithm that explores uniformly for a pre-set number of rounds, then picks one arm for exploitation and stays with it for the remaining rounds. We think of this policy as having two “levels”: Level 1 contains the parallel runs, and Level 2 is everything else.
The next step is to implement adaptive exploration, where the exploration schedule is adapted to previous observations. This is needed to improve over the regret. As a proof of concept, we focus on the case of two actions, and upgrade the simple two-level policy with a middle level. The agents in this new level receive the data collected in some (but not all) runs from the first level. What happens is that these agents explore only if the gap between the best and second-best arm is sufficiently small, and exploit otherwise. When is small, the runs in the first level do not have sufficient time to distinguish the two arms before herding on one of them. However, for each of these arms, there is some chance that it has an empirical mean reward significantly above its actual mean while the other arm has empirical mean reward significantly below its actual mean in any given first-level run. The middle-level agents observing such runs will be induced to further explore that arm, collecting enough samples for the third-level agents to distinguish the two arms. The main result extends this construction to multiple levels, connected in fairly intricate ways, obtaining optimal regreat of .
Related work. The problem of incentivizing exploration via information asymmetry was introduced in , under the Bayesian rationality and the (implicit) power-to-commit assumptions. The original problem – essentially, a version of our model with unrestricted disclosure policies – was largely resolved in  and the subsequent work [31, 33]. Several extensions were considered: to contextual bandits , to repeated games , and to social networks .
Several other papers study related, but technically different models: same model with time-discounted utilities ; a version with monetary incentives  and moreover with heterogenous agents ; a version with a continuous information flow and a continuum of agents ; coordination of costly exploration decisions when they are separate from the “payoff-generating” decisions [26, 29, 30]. Scenarios with long-lived, exploring agents and no principal to coordinate them have been studied in [12, 24].
Full-disclosure policy, and closely related “greedy” (exploitation-only) algorithm in multi-armed bandits, have been a subject of a recent line of work [37, 23, 8, 36]. A common theme is that the greedy algorithm performs well in theory, under substantial assumptions on heterogeneity of the agents. Yet, it suffers regret in the worst case.111This is a well-known folklore result; see  for a concrete example.
Exploration-exploitation tradeoff received much attention over the past decades, usually under the rubric of “multi-armed bandits”; see [13, 19] for background. Exploration-exploitation problems with incentives issues arise in several other scenarios: dynamic pricing [25, 10, 6], dynamic auctions [1, 9, 21], pay-per-click ad auctions [5, 16, 4], and human computation [20, 18, 38].
2 Model and Preliminaries
We study the multi-armed bandit problem in a social learning context, in which a principal faces a sequence of myopic agents. There is a set of possible actions, a.k.a. arms. At each round , a new agent arrives, receives a message from the principal, chooses an arm , and collects a reward that is immediately observed by the principal. The reward from pulling an arm
is drawn independently from Bernoulli distributionwith an unknown mean . An agent does not observe anything from the previous rounds, other than the message . The problem instance is defined by (known) parameters and the (unknown) tuple of mean rewards, . We are interested in regret, defined as
(The expectation is over the chosen arms , which depend on randomness in rewards, and possibly in the algorithm.) The principal chooses messages according to an online algorithm called disclosure policy, with a goal to minimize regret. We assume that mean rewards are bounded away from and , to ensure sufficient entropy in rewards. For concreteness, we posit .
Unbiased subhistories. The subhistory for a subset of rounds is defined as
Accordingly, is called the full history at time . The outcome for agent is the tuple .
We focus on disclosure policies of a particular form, where the message in each round is for some subset . We assume that the subset is chosen ahead of time, before round (and therefore does not depend on the observations ). Such message is called unbiased subhistory, and the resulting disclosure policy is called an unbiased-history policy.
Further, we focus on disclosure policies that are transitive, in the following sense:
In words, if agent observes the outcome for some previous agent , then she observes the entire message revealed to that agent. In particular, agent does not need to second-guess which message has caused agent to choose action .
A transitive unbiased-history policy can be represented as an undirected graph, where nodes correspond to rounds, and any two rounds are connected if and only if and there is no intermediate round with and . This graph is henceforth called the information flow graph of the policy, or info-graph for short. We assume that this graph is common knowledge.
Agents’ behavior. Let us define agents’ behavior in response to an unbiased-history policy. We posit that each agent uses its observed subhistory to form a reward estimate for each arm , and chooses an arm with a maximal estimator. (Ties are broken according to an arbitrary rule that is the same for all agents.) The basic model is that is the sample average for arm over the subhistory , as long as it includes at least one sample for ; else, .
We allow a much more permissive model that allows agents to form arbitrary reward estimates as long as they lie within some “confidence range” of the sample average. Formally, the model is characterized by the following assumptions (which we make without further notice).
Reward estimates are close to empirical averages. Let and denote the number of pulls and the empirical mean reward of arm in subhistory . Then for some absolute constant and , and for all agents and arms it holds that
Also, if . (NB: we make no assumption if .)
In each round , the estimates depend only on the multiset , called anonymized subhistory. Each agent forms its estimates according to an estimate function from anonymized subhistories to
, so that the estimate vectorequals . This function is drawn from some fixed distribution over estimate functions.
The first assumption ensures that the reward distributions have sufficient entropy to induce natural exploration. We choose the bounded range for the simplicity of our analysis, and it can further relaxed to for any constant . The second assumption says that the estimates computed by the agents are well-behaved, and are close to the empirical estimates given by the sub-history, provided that the number of observations is sufficiently large.
We model agents with heterogeneity in their arm selections. In particular, there is an unknown distribution over the set of agent estimators satisfying Assumption 2. Each agent indepedently draws an estimator from this distribution, uses it to calculate the mean reward estimates for every arm , and then chooses the arm with the highest estimate.
We emphasize that the agents we consider in this paper are frequentists. Thus their estimators, which determine their behavior, take samples as inputs and not priors. The estimators satisfying Assumption 2 include that of the natural greedy frequentist, who always pulls the arm with the highest empirical mean.
Connection to multi-armed bandits. The special case when each message is an arm, and the -th agent always chooses this arm, corresponds to a standard multi-armed bandit problem with IID rewards. Thus, regret in our problem can be directly compared to regret in the bandit problem with the same mean rewards . Following the literature on bandits, we define the gap parameter as the difference between the largest and second largest mean rewards.222Formally, the second-largest mean reward is , where . The gap parameter is not known to the principal (in our problem), or to the algorithm (in the bandit problem). Optimal regret rates for bandits with IID rewards are as follows [2, 3, 28]:
This regret bound can only be achieved using adaptive exploration: when exploration schedule is adapted to the observations. A simple example of non-adaptive exploration is the explore-then-exploit algorithm which samples arms uniformly at random for the first rounds, for some pre-set number , then chooses one arm and sticks with it till the end. More generally, exploration-separating algorithms have a property that in each round , either the choice of an arm does not depend on the observations so far, or the reward collected in this round is not used in the subsequent rounds. Any such algorithm suffers from regret in the worst case.333The first explicit reference we know of is [5, 16], but this fact has been known in the community for much longer.
Preliminaries. We assume that is constant, and focus on the dependence on . However, we explicitly state the dependence on , using the notation.
Throughout the paper, we use the standard concentration and anti-concentration inequalities: respectively, Chernoff Bounds and Berry-Esseen Theorem. The former states that that a sum of independent random variables converges to its expectation quickly. The latter states that the CDF of an appropriately scaled average of IID random variables converges to the CDF of the standard normal distribution pointwise. In particular, the average strays far enough from its expectation with some guaranteed probability. The theorem statements are moved to AppendixA.
We use the notion of reward tape to simplify the application of (anti-)concentration inequalities. This is a random matrix with rows and columns corresponding to arms and rounds, respectively. For each arm and round , the value in cell is drawn independently from Bernoulli distribution . W.l.o.g., rewards in our model are defined by the rewards tape: namely, the reward for the -th pull of arm is taken from the -th entry of the reward matrix.
We use notation to hide the dependence on parameter , and notation to hide polylogarithmic factors. We denote .
3 Warm-up: full-disclosure paths
We first consider a disclosure policy that reveals the full history in each round , ; we call it the full-disclosure policy. The info-path for this policy is a simple path. We use this policy as a “gadget” in our constructions. Hence, we formulate it slightly more generally:
A subset of rounds is called a full-disclosure path in the info-graph if the induced subgraph is a simple path, and it connects to the rest of the graph only through the terminal node , if at all.
We prove that for a constant number of arms, with constant probability, a full-disclosure path of constant length suffices to sample each arm at least once. We will build on this fact throughout.
There exist numbers and that depend only on , the number of arms, with the following property. Consider an arbitrary disclosure policy, and let be a full-disclosure path in its info-graph, of length . Under Assumption 2, with probability at least , subhistory contains at least once sample of each arm .
Jieming, pls add a brief proof sketch if you can. The main idea of the proof is to consider the case when all pulled arms have bad realized rewards (i.e. 0). Actually, not clear about how to write a sketch for a proof of 8 lines. ∎
We provide a simple disclosure policy based on full-disclosure paths. The policy follows the “explore-then-exploit” paradigm. The “exploration phase” comprises the first rounds, and consists of full-disclosure paths of length each, where is a parameter. In the “exploitation phase”, each agent receives the full subhistory from exploration, . The info-graph for this disclosure policy is shown in Figure 1.
The info-graph has two “levels”, corresponding to exploration and exploitation. Accordingly, we call this policy the two-level policy. We show that it incentivizes the agents to perform non-adaptive exploration, and achieves a regret rate of . The key idea is that since one full-disclosure path collects one sample of a given arm with constant probability, using many full-disclosure paths “in parallel” ensures that sufficiently many samples of this arm are collected.
The two-level policy with parameter achieves regret
For a constant , the number of arms, we match the optimal regret rate for non-adaptive multi-armed bandit algorithms. If the gap parameter is known to the principal, then (for an appropriate tuning of parameter ) we can achieve regret .
The proofs can be found in Appendix B. One important quantity is the expected number of samples of a given arm collected by a full-disclosure path of length (present in the subhistory . Indeed, this number, denoted , is the same for all such paths. Then,
Suppose the info-graph contains full-disclosure paths of rounds each. Let be the number of samples of arm collected by all paths. Then with probability at least , for all ,
4 Adaptive exploration with a three-level disclosure policy
The two-level policy from the previous section implements the explore-then-exploit paradigm using a basic design with parallel full-disclosure paths. The next challenge is to implement adaptive exploration, and go below the barrier. We accomplish this using a construction that adds a middle level to the info-graph. This construction also provides intuition for the main result, the multi-level construction presented in the next section. For simplicity, we assume arms.
For the sake of intuition, consider the framework of bandit algorithms with limited adaptivity . Suppose a bandit algorithm outputs a distribution over arms in each round , and the arm is then drawn independently from . This distribution can change only in a small number of rounds, called adaptivity rounds, that need to be chosen by the algorithm in advance. A single round of adaptivity corresponds to explore-then-exploit paradigm. Our goal here is to implement one extra adaptivity round, and this is what the middle level accomplishes.
The three-level policy is defined as follows. The info-graph consists of three levels: the first two correspond to exploration, and the third implements exploitation. Like in the two-level policy, the first level consists of multiple full-disclosure paths of length each, and each agent in the exploitation level sees full history from exploration (see Figure 2).
The middle level consists of disjoint subsets of agents each, called second-level groups. Each second-level group has the following property:
|all nodes in are connected to the same nodes outside of , but not to one another.||(4)|
The full-disclosure paths in the first level are also split into disjoint subsets, called first-level groups. Each first-level group consists of full-disclosure paths, for the total of rounds in the first layer. There is a 1-1 correspondence between first-level groups and second-level groups , whereby each agent in observes the full history from the corresponding group . More formally, agent in is connected to the last node of each full-disclosure path in . In other words, this agent receives message , where is the set of all rounds in .
The key idea is as follows. Consider the gap parameter . If it is is large, then each first-level group produces enough data to determine the best arm with high confidence, and so each agent in the upper levels chooses the best arm. If is small, then due to anti-concentration each arm gets “lucky” within at least once first-level group, in the sense that it appears much better than the other arm based on the data collected in this group (and therefore this arm gets explored by the corresponding second-level group). To summarize, the middle level exploits if the gap parameter is large, and provides some more exploration if it is small.
For two arms, the three-level policy achieves regret
This is achieved with parameters , , and .
Let us sketch the proof of this theorem; the full proof can be found in Appendix C.
The “good events”. We establish four “good events” each of which occurs with high probability.
Exploration in Level 1: Every first-level group collects at least samples of each arm.
Concentration in Level 1: Within each first-level group, empirical mean rewards of each arm concentrate around .
Anti-concentration in Level 1: For each arm, some first-level subgroup collects data which makes this arm look much better than its actual mean and other arms look worse than their actual means.
Concentration in prefix: The empirical mean reward of each arm concentrates around in any prefix of its pulls. (This ensures accurate reward estimates in exploitation.)
The analysis of these events applies Chernoff Bounds to a suitable version of “reward tape” (see the definition of “reward tape” in Section 2). For example, considers a reward tape restricted to a given first-level group.
Case analysis. We now proceed to bound the regret conditioned on the four “good events”. W.l.o.g., assume . We break down the regret analysis into four cases, based on the magnitude the gap parameter . As a shorthand, denote . In words, this is a confidence term, up to constant factors, for independent random samples.
The simplest case is very small gap, which trivially yields an upper bound on regret.
[Negligible gap] If then .
Another simple case is when is sufficiently large, so that the data collected in any first-level group suffices to determine the best arm. The proof follows from and .
[Large gap] If then all agents in the second and the third levels pull arm 1.
In the medium gap case, the data collected in a given first-level group is no longer guaranteed to determine the best arm. However, agents in the third level see the history of not only one but all first-level groups and the data collected by all first-level groups enables agents in the third level to correctly identify the best arm.
[Medium gap] All agents pull arm 1 in the third level, when satisfies
Finally, the small gap case, when is between and is more challenging since even aggregating the data from all first-level groups is not sufficient for identifying the best arm. We need to ensure that both arms continue to be explored in the second level. To achieve this, we leverage , which implies that each arm has a first-level group where it gets “lucky”, in the sense that its empirical mean reward is slightly higher than , while the empirical mean reward of the other arm is slightly lower than its true mean. Since the deviations are in the order of , and Assumption 2 guarantees the agents’ reward estimates are also within of the empirical means, the sub-history from this group ensures that all agents in the respective second-level group prefer arm . Therefore, both arms are pulled at least times in the second level, which in turn gives the following guarantee:
[Small gap] All agents pull arm 1 in the third level, when satisfies
Wrapping up: proof of Section 4.
In negligible gap case, the stated regret bound holds regardless of what the algorithm does. In the large gap case, the regret only comes from the first level, so it is upper-bounded by the total number of agents in this level, which is . In both intermediate cases, it suffices to bound the regret from the first and second levels, so
Therefore, we obtain the stated regret bound in all cases.
Wlog we assume as the recommendation policy is symmetric to both arms. We do a case analysis based on .
Before we start with the case analysis, we first define several clean events and show that the intersection of them happens with high probability.
Concentration of the number of arm pulls in the first level: By Lemma 3, we know . For the -th first-level group, define to be the event that the number of arm pulls in the -th first-level group is between and . By Chernoff bound, For the -th first-level group, define to be the event that the number of arm pulls in the -th first-level group is between and . By Lemma 3
Let be the intersection of all these events (i.e. ). By union bound, we have
Concentration of the empirical mean of arm pulls in the first level: For each first-level group and arm , imagine there is a tape of enough arm pulls sampled before the recommendation policy starts and these samples are revealed one by one whenever agents in this group pull arm . For the -th first-level group and arm , define to be the event that the mean of -th to -th pulls in the tape is at most away from . By Chernoff bound,a bit confused about what and mean?
Define to be the intersection of all these events (i.e. ). By union bound, we have
Concentration of the empirical mean of arm pulls in the first two levels:
For all the groups in the first two levels and arm , imagine there is a tape of enough arm pulls sampled before the recommendation policy starts and these samples are revealed one by one whenever agents in the first two levels pull arm . Define to be the event that the mean of the first pulls in the tape is at most away from . By Chernoff bound,
Define to be the intersection of all these events (i.e. ). By union bound, we have
Anti-concentration of the empirical mean of arm pulls in the first level:
Consider the tapes defined in the second bullet again. For the -th first-level group and arm , define to be the event that first pulls of arm in the corresponding tape has empirical mean at least and define to be the event that first pulls of arm in the corresponding tape has empirical mean at most . By Berry-Essen Theorem and , we have
The last inequality follows when is larger than some constant. Similarly we also have
Since is independent with , we have
Now define as . Notice that are independent across different ’s. By union bound, we have
By union bound, the intersection of these clean events (i.e. ) happens with probability . When this intersection does not happen, since the probability is , it contributes to the expected regret.
Now we assume the intersection of clean events happens and we summarize what these clean events imply.
For the -th first-level group and arm , define to be the empirical mean of arm pulls in this group. , for together imply that
The last inequality holds when is larger than some constant.
For each arm , define to be the empirical mean of arm pulls in the first two levels. for and for together imply that
The last inequality holds when is larger than some constant.
If there are at least pulls of arm in the first two levels,
For each , implies that there exists such that and happen. , , for and for together imply that
The second last inequality holds when is larger than some constant. Similarly, we also have
Finally we proceed to the case analysis. We give upper bounds on the expected regret conditioned on the intersection of clean events.
. In this case, we want to show that agents in the second and the third levels all pull arm 1.
First consider the -th second-level group. We know that
For any agent in the -th second-level group, by Assumption 2, we have
Therefore, we know agents in the -th second-level group will all pull arm 1.
Now consider the agents in the third level group. Recall is the empirical mean of arm in the history they see. We have
Similarly as above, by Assumption 2, we know for any agent in the third level. So we know agents in the third-level group will all pull arm 1. Therefore the expected regret is at most .
. In this case, we want to show agents in the third level all pull arm 1. Recall is the empirical mean of arm in the first two levels. We have
For any agent in the third level, by Assumption 2, we have
So we know agents in the third-level group will all pull arm 1. Therefore the expected regret is at most
. In this case, we just need to make sure that agents in the third level all pull arm 1. To do so, we need both arms to be pulled at least rounds in the second level.
Now consider the -th second-level group. We have
For any agent in the -th second-level group, by Assumption 2, we have