Bayesian Exploration with Heterogeneous Agents

02/19/2019 ∙ by Nicole Immorlica, et al. ∙ University of Minnesota Microsoft 0

It is common in recommendation systems that users both consume and produce information as they make strategic choices under uncertainty. While a social planner would balance "exploration" and "exploitation" using a multi-armed bandit algorithm, users' incentives may tilt this balance in favor of exploitation. We consider Bayesian Exploration: a simple model in which the recommendation system (the "principal") controls the information flow to the users (the "agents") and strives to incentivize exploration via information asymmetry. A single round of this model is a version of a well-known "Bayesian Persuasion game" from [Kamenica and Gentzkow]. We allow heterogeneous users, relaxing a major assumption from prior work that users have the same preferences from one time step to another. The goal is now to learn the best personalized recommendations. One particular challenge is that it may be impossible to incentivize some of the user types to take some of the actions, no matter what the principal does or how much time she has. We consider several versions of the model, depending on whether and when the user types are reported to the principal, and design a near-optimal "recommendation policy" for each version. We also investigate how the model choice and the diversity of user types impact the set of actions that can possibly be "explored" by each type.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1. Introduction

Recommendation systems are ubiquitous in online markets (Netflix for movies, Amazon for products, Yelp for restaurants, etc.), high-quality recommendations being a crucial part of their value proposition. A typical recommendation system encourages its users to submit feedback on their experiences, and aggregates this feedback in order to provide better recommendations in the future. Each user plays a dual rule: she consumes information from the previous users (indirectly, via recommendations), and produces new information (a review) that benefits future users. This dual role creates a tension between exploration, exploitation, and users’ incentives.

A social planner – a hypothetical entity that controls users for the sake of common good – would balance “exploration” of insufficiently known alternatives and “exploitation” of the information acquired so far. Designing algorithms to trade off these two objectives is a well-researched subject in machine learning and operations research. However, a given user who decides to “explore” typically suffers all the downside of this decision, whereas the upside (improved recommendations) is spread over many users in the future. Therefore, users’ incentives are skewed in favor of exploitation. As a result, observations may be collected at a slower rate, and suffer from selection bias (ratings of a particular movie may mostly come from people who like this type of movies). Moreover, in some natural but idealized examples (

(Kremer et al., 2014; Mansour et al., 2015)) optimal recommendations are never found because they are never explored.

Thus, we have a problem of incentivizing exploration. Providing monetary incentives can be financially or technologically unfeasible, and relying on voluntary exploration can lead to selection biases. A recent line of work, started by (Kremer et al., 2014), relies on the inherent information asymmetry between the recommendation system and a user. These papers posit a simple model, termed Bayesian Exploration in (Mansour et al., 2016). The recommendation system is a “principal” that interacts with a stream of self-interested “agents” arriving one by one. Each agent needs to make a decision: take an action from a given set of alternatives. The principal issues a recommendation, and observes the outcome, but cannot direct the agent to take a particular action. The problem is to design a “recommendation policy” for the principal that learns over time to make good recommendations and ensures that the agents are incentivized to follow this recommendation. A single round of this model is a version of a well-known “Bayesian Persuasion game” (Kamenica and Gentzkow, 2011).

Our scope. We study Bayesian Exploration with agents that can have heterogenous preferences. The preferences of an agent are encapsulated in her type, vegan vs meat-lover. When an agent takes a particular action, the outcome depends on the action itself (the selection of restaurant), the “state” of the world (the qualities of the restaurants), and the type of the agent. The state is persistent (does not change over time), but initially not known; a Bayesian prior on the state is common knowledge. In each round, the agent type is drawn independently from a fixed and known distribution. The principal strives to learn the best possible recommendation for each agent type.

We consider three models, depending on whether and when the agent type is revealed to the principal: the type is revealed immediately after the agent arrives (public types), the type is revealed only after the principal issues a recommendation (reported types),111Reported types may arise if the principal asks agents to report the type after the recommendation is issued, in a survey. While the agents are allowed to misreport their respective types, they have no incentives to do that. and the type is never revealed (private types). We design a near-optimal recommendation policy for each modeling choice. In fact, we consider a stronger benchmark: optimal Bayesian-expected reward achieved by any recommendation policy in any one round.

Explorability. A distinctive feature of Bayesian Exploration is that it may be impossible to incentivize some agent types to take some actions, no matter what the principal does or how much time she has. For a more precise terminology, a given type-action pair is explorable

if this agent type takes this action under some recommendation policy in some round with positive probability. This action is also called

explorable for this type. Thus: some type-action pairs might not be explorable. Moreover, one may need to explore to find out which pairs are explorable. The set of explorable pairs is interesting in its own right as they bound the welfare of a setting. Recommendation policies cannot do better than the “best explorable action” for a particular agent type: an explorable action with a largest reward in the realized state.

Comparative statics for explorability. We study how the set of all explorable type-action pairs (explorable set) is affected by the model choice and the diversity of types. First, we find that for each problem instance the explorable set stays the same if we transition from public types to reported types, and can only become smaller if we transition from reported types to private types. We provide a concrete example when the latter transition makes a huge difference. Second, we vary the distribution of agent types. For public types (and therefore also for reported types), we find that the explorable set is determined by the support set of . Further, if we make the support set larger, then the explorable set can only become larger. In other words, diversity of agent types helps exploration. We provide a concrete example when the explorable set increases very substantially even if the support set increases by a single type. However, for private types the picture is quite different: we provide an example when diversity hurts, in the same sense as above. Intuitively, with private types, diversity muddles the information available to the principal making it harder to learn about the state of the world, whereas for public types diversity helps the principal refine her belief about the state.

Our techniques. As a warm-up, we first develop a recommendation policy for public types. In the long run, our policy matches the benchmark of “best explorable action”. While it is easy to prove that such a policy exists, the challenge is to provide it as an explicit procedure. Our policy focuses on exploring all explorable type-action pairs. Exploration needs to proceed gradually, whereby exploring one action may enable the policy to explore another. In fact, exploring some action for one type may enable the policy to explore some action for another type. Our policy proceeds in phases: in each phase, we explore all actions for each type that can be explored using information available at the start of the phase. Agents of different types learn separately, in per-type “threads”; the threads exchange information after each phase.

An important building block is the analysis of the single-round game. We use information theory to characterize how much state-relevant information the principal has. In particular, we prove a version of information-monotonicity: the set of all explorable type-action pairs can only increase if the principal has more information.

As our main contribution, we develop a policy for private types. In this model, recommending one particular action to the current agent is not very meaningful because the agents’ type is not known to the principal. Instead, one can recommend a menu: a mapping from agent types to actions. Analogous to the case of public types, we focus on explorable menus and gradually explore all such menus, eventually matching the Bayesian-expected reward of the best explorable menu. Without loss of generality, we restrict to Bayesian-incentive compatible (BIC) policies: essentially, policies that output menus such that the agents are incentivized to follow them. The issue of explorability is now about menus: a menu is called explorable if some BIC policy recommends this menu in some round with a positive probability. As some menus might not be explorable, we are interested in the “best explorable menu”: an explorable menu with a largest expected reward for the realized state of the world. Our recommendation policy for private types competes with this benchmark, eventually matching its Bayesian-expected reward. Our policy focuses on exploring all explorable menus, and proceeds gradually, whereby exploring one menu may enable the policy to explore another. One difficulty is that exploring a given menu does not immediately reveal the reward of a particular type-action pair (because multiple types could map to the same action). Consequently, even keeping track of what the policy knows is now non-trivial. The analysis of the single-round game becomes more involved, as one needs to argue about “approximate information-monotonicity”. To handle these issues, our recommendation policy satisfies only a relaxed version of incentive-compatibility.

In the reported types model, we face a similar issue, but achieve a much stronger result: we design a policy which matches our public-types benchmark in the long run. This may seem counterintuitive because “reported types” are completely useless to the principal in the single-round game (whereas public types are very useful). Essentially, we reduce the problem to the public types case, at the cost of a much longer exploration.

Discussion. This paper, as well as all prior work on incentivizing exploration, relies on very standard yet idealized assumptions of Bayesian rationality and the “power to commit” (i.e., principal can announce a policy and commit to implementing it). A recent paper (Immorlica et al., 2018) attempts to mitigate these assumptions (in a setting with homogeneous agents). However, some form of the “power to commit” assumption appears necessary to make any progress.

We do not attempt to elicit agents’ types when they are not public, in the sense that our recommendation to a given agent is not contingent on anything that this agent reports. However, our result for reported types is already the best possible, in the sense that the explorable set is the same as for public types, so (in the same sense) elicitation is not needed.

Related work. Bayesian Exploration with homogenous agents was introduced in (Kremer et al., 2014), and largely resolved: for optimal policy in the case of two actions and deterministic utilities (Kremer et al., 2014), for explorability (Mansour et al., 2016), and for regret minimization and stochastic utilities (Mansour et al., 2015).

Bayesian Exploration with heterogenous agents and public types is studied in (Mansour et al., 2015), under a very strong assumption which ensures explorability of all type-action pairs, and in (Mansour et al., 2016), where a fixed tuple of agent types arrives in each round and plays a game. (Mansour et al., 2016) focus on explorability of joint actions. Our approach for the public-type case is similar on a high level, but simpler and more efficient, essentially because we focus on type-action pairs rather than joint actions.

A very recent paper (Chen et al., 2018) (ours is independent work) studies incentivizing exploration with heterogenous agents and private types, but allows monetary transfers. Assuming that each action is preferred by some agent type, they design an algorithm with a (very) low regret, and conclude that diversity helps in their setting.

Several papers study “incentivizing exploration” in substantially different models: with a social network (Bahar et al., 2016); with time-discounted utilities (Bimpikis et al., 2018); with monetary incentives (Frazier et al., 2014; Chen et al., 2018); with a continuous information flow and a continuum of agents (Che and Hörner, 2018); with long-lived agents and “exploration” separate from payoff generation (Kleinberg et al., 2016; Liang and Mu, 2018; Liang et al., 2018); with fairness (Kannan et al., 2017). Also, seminal papers (Bolton and Harris, 1999; Keller et al., 2005) study scenarios with long-lived, exploring agents and no principal.

Recommendation policies with no explicit exploration, and closely related “greedy algorithm” in multi-armed bandits, have been studied recently (Bastani et al., 2018; Schmit and Riquelme, 2018; Kannan et al., 2018; Raghavan et al., 2018). A common theme is that the greedy algorithm performs well under substantial assumptions on the diversity of types. Yet, it suffers regret in the worst case.222This is a well-known folklore result in various settings; see (Mansour et al., 2018; Schmit and Riquelme, 2018).

(Schmit and Riquelme, 2018) consider a “full-revelation” recommendation system, and show that (under some substantial assumptions) agent heterogeneity leads to exploration.

Exploration-exploitation tradeoff received much attention over the past decades, usually under the rubric of “multi-armed bandits”, see books (Cesa-Bianchi and Lugosi, 2006; Bubeck and Cesa-Bianchi, 2012; Gittins et al., 2011). Absent incentives, Bayesian Exploration with public types is a well-studied problem of “contextual bandits” (with deterministic rewards and a Bayesian prior). A single round of Bayesian Exploration is a version of the Bayesian Persuasion game (Kamenica and Gentzkow, 2011), where the signal observed by the principal is distinct from the state. Exploration-exploitation problems with incentives issues arise in several other scenarios: dynamic pricing, (Kleinberg and Leighton, 2003; Besbes and Zeevi, 2009; Badanidiyuru et al., 2018), dynamic auctions (Bergemann and Said, 2011), advertising auctions (Babaioff et al., 2014; Devanur and Kakade, 2009; Babaioff et al., 2015), human computation (Ho et al., 2016; Ghosh and Hummel, 2013; Singla and Krause, 2013), and repeated actions, (Amin et al., 2013, 2014; Braverman et al., 2018).

2. Model and Preliminaries

Reviewer 2: The model description is not clear. In particular, it did not describe what the principal knows about the states, agents’ reward functions etc., and what the agents know. Note that it is crucial to be clear about who knows what since the information asymmetry is essential to the model. Is a new state drawn at every round t or is the state of nature the same across all rounds? I think for the model to make sense, it should be the latter case. Then, the questions are: (1) how does the principal’s belief about the state of nature evolve? (2) in the benchmark rightly above Section 3, why you need to take expectation over w? Why not just the particular w realized at the beginning of the game? Then at each single round, what do principal and agents know about S?

Bayesian Exploration is a game between a principal and agents. The game consists of rounds. Each round proceeds as follows: a new agent arrives, receives a message from the principal, chooses an action from a fixed action space , and collects a reward that is immediately observed by the principal. Each agent has a type , drawn independently from a fixed distribution , and an action space (same for all agents). There is uncertainty, captured by a “state of nature” , henceforth simply the state, drawn from a Bayesian prior at the beginning of time and fixed across rounds. The reward of agent is determined by its type , the action chosen by this agent, and the state , for some fixed and deterministic reward function . The principal’s messages are generated according to a randomized online algorithm termed “recommendation policy”. Thus, an instance of Bayesian Exploration consists of the time horizon , the sets , the type distribution , the prior , and the reward function .

The knowledge structure is as follows. The type distribution , the Bayesian prior , the reward function , and the recommendation policy are common knowledge. Each agent knows her own type , and observes nothing else except the message . We consider three model variants, depending on whether and when the principal learns the agent’s type: the type is revealed immediately after the agent arrives (public types), the type is revealed only after the principal issues a recommendation (reported types), the type is not revealed (private types).

Let denote the history observed by the principal at round , immediately before it chooses its message . Hence, it equals for public types, for reported types, and for private types.333For randomized policies, the history also contains policy’s random seed in each round. Formally, this is the input to the recommendation policy in each round . Borrowing terminology from the Bayesian Persuasion literature, we will often refer to the history as the signal. We denote the set of all possible histories (signals) at time by .

A solution to an instance of the Bayesian Exploration game is a randomized online algorithm termed “recommendation policy” which, at each round , maps the current history to a distribution over messages which, in general, are arbitrary bit strings of length polynomial in the size of the instance.

The recommendation policy , the type distribution , the state distribution , and the reward function

induce a joint distribution

over states and histories, henceforth called the signal structure at round . Note that it is known to agent .

We are ready to state agents’ decision model. Each agent , given the realized message , chooses an action so as to maximize her Bayesian-expected reward

Given the instance of Bayesian Exploration, the goal of the principal is to choose a policy that maximizes (Bayesian-expected) total reward, .444While the principal must commit to the policy given only the problem instance, the policy itself observes the history and thus can adapt recommendations to inferences about the state based on the history. See Example 3.1.

We assume that the sets , and are finite. We use

as the random variable for the state, and write

for . Similarly, we write for .

Bayesian-incentive compatibility. For public types, we assume the message in each round is a recommended action which, for convenience, we sometimes write as . For private and reported types, we assume that the message in each round is a menu mapping types to actions, i.e., . We further assume is Bayesian incentive-compatible.

Definition 2.1 ().

Let be the event that the agents have followed principal’s recommendations before round , for all rounds . The recommendation policy is Bayesian incentive compatible (BIC) if for all rounds and messages such that

it holds that for all types and actions ,

(1)

where the expectation is over .

The above assumptions are without loss of generality, by a suitable version of Myerson’s “revelation principle”.

Explorability and benchmarks. For public types, a type-action pair is called eventually-explorable in state if there is some BIC recommendation policy that, for large enough, eventually recommends this action to this agent type with positive probability. Then action is called eventually-explorable for type and state . The set of all such actions is denoted .

Likewise, for private types, a menu is called eventually-explorable in state if there is some BIC recommendation policy that eventually recommends this menu with positive probability. The set of all such menus is denoted .

Our benchmark is the best eventually-explorable recommendation for each type. For public and private types, resp., this is

(2)
(3)

We have , essentially because any BIC policy for private types can be simulated as a BIC policy for public types. We provide an example (Example 3.1) when .

Note that, for all settings, no BIC recommendation policy can out-perform the corresponding benchmark. Our main technical contributions are (computationally efficient) policies that get arbitrarily close to these benchmarks as the number of agents grows.

3. Comparative Statics

We discuss how the set of all eventually-explorable type-action pairs (explorable set) is affected by the model choice and the diversity of types. The explorable set is all information that can possibly be learned in the public-types model. All else equal, settings with larger explorable set have greater or equal total expected reward, both in benchmark (2) and in our approximation guarantees. For private types, the exploration set provides an “upper bound” on the information available to the principal, because the principal does not directly observe the agent types. Our first result shows that models with public or reported types can explore (strictly) more actions for each type than models with private types. Thus more information about types (strictly) improves the outcomes. Our second result shows that greater diversity (in the sense of a greater number of possible agent types) improves exploration for public or reported types but, in fact, can harm exploration for private types. The intuition is that with private types, diversity can muddle the information of the principal, hindering her ability to learn about the state, whereas for public or reported types diversity only helps the principal refine her beliefs about the state.

Explorability and the model choice. Fix an instance of Bayesian Exploration. Let and be the explorable set for a given state , for public and private types, respectively.555Equivalently, is the set of all type-action pairs that appear in some eventually-explorable menu in state with private types. We will show in Section 4.3 that the explorable set for reported types is , too.

.

The idea is that one can simulate any BIC recommendation policy for private types with a BIC recommendation policy for public types; we omit the details.

Interestingly, can in fact be a strict subset of :

Example 3.1 ().

There are 2 states, 2 types and 2 actions: . States and types are drawn uniformly at random: . Rewards are defined as follows:

Table 1. Rewards when and .

In Example 3.1, is a strict subset of .

Proof.

Action 0 is preferred by both types initially. Thus in the first round, the principal must recommend action in order for the policy to be BIC. Hence type-action pairs are eventually-explorable in all models.

In the second round, the principal knows the reward of the first-round agent. When types are public or reported, the reward together with the type is sufficient information for the principal to learn the state. Moving forward, the principal can now recommend the higher-reward action for each type (either directly or, in the case of reported types, through a menu). Thus, type-action pair is eventually-explorable when and, similarly, type-action pair is eventually-explorable when .

For private types, samples from the first-round menu (which, as argued above, must recommend action for both types) do not convey any information about the state, as they have the same distribution in both states. Therefore, action is not eventually-explorable, for either type and either state. ∎

Explorability and diversity of agent types. Fix an instance of Bayesian Exploration with type distribution . We consider how the explorable set changes if we modify the type distribution in this instance to some other distribution . Let and be the corresponding explorable sets, for each state .

For public and reported types, we show that the explorable set is determined by the support set of , denoted , and can only increase if the support set increases:

Consider Bayesian Exploration with public types. Then:

if then .

if then .

Consider Bayesian exploration with public or reported types. Then:

if the supports of distributions and are the same, then .

if the support of distribution is contained in the support of distribution then .

Proof Sketch.

Consider public types (the case of reported types then follows by arguments in Section 4.3). Let be a BIC recommendation policy for the instance with type distribution and suppose eventually explores type-action pairs for this instance and state . Consider the instance with type distribution . Extend to a policy as follows: let be the subsequence of for which . If , then recommend the action that maximizes agent ’s Bayesian-expected reward. If , then consider the sub-history restricted to and recommend action . Then is BIC for the instance with type distribution . Furthermore, eventually explores the same set of type-action pairs for this modified instance as well (and possibly more) as every history that occurs with positive probability in the original instance occurs as a sub-history in the modified instance with positive probability as well. ∎

For private types, the situation is more complicated. More types can help for some problem instances. For example, if different types have disjoint sets of available actions (more formally: say, disjoint sets of actions with positive rewards) then we are essentially back to the case of reported types, and the conclusions in Claim 3 apply. On the other hand, we can use Example 3.1 to show that more types can hurt explorability when types are private. Recall that in this example, for private types only action 0 can be recommended. Now consider a less diverse instance in which only type 0 appears. After one agent in that type chooses action 0, the state is revealed to the principal. For example, when the state , action can be recommended to future agents. This shows that, in this example, explorable set increases when we have fewer types.

4. Public Types

In this section, we develop our recommendation policy for public types. Throughout, .

Theorem 4.1 ().

Consider an arbitrary instance of Bayesian Exploration with public types. There exists a BIC recommendation policy with expected total reward at least , for some constant that depends on the problem instance but not on . This policy explores all type-action pairs that are eventually-explorable for a given state.

4.1. A single round of Bayesian Exploration

Signal and explorability. We first analyze what actions can be explored by a BIC policy in a single round of Bayesian Exploration for public types, as a function of the history. Throughout, we suppress and from our notation. Let be a random variable equal to the history at round (referred to as a signal throughout this section), be a realization of , and be the signal structure: the joint distribution of . Note different policies induce different histories and hence different signal structures. Thus it will be important to be explicit about the signal structure throughout this section.

Definition 4.2 ().

Consider a single-round of Bayesian Exploration when the principal receives signal with signal structure . An action is called signal-explorable for a realized signal if there exists a BIC recommendation policy such that . The set of all such actions is denoted as . The signal-explorable set, denoted , is the random subset of actions .

Information-monotonicity. We compare the information content of two signals using the notion of conditional mutual information (see Appendix A for background). Essentially, we show that a more informative signal leads to the same or larger explorable set.

Definition 4.3 ().

We say that signal is at least as informative as signal if .

Intuitively, the condition means if one is given random variable , one can learn no further information from about . Note that this condition depends not only on the signal structures of the two signals, but also on their joint distribution.

Lemma 4.4 ().

Let be two signals with signal structures . If is at least as informative as , then for all such that .

Proof.

Consider any BIC recommendation policy for signal structure . We construct for signal structure by setting . Notice that implies and are independent given , i.e for all . Therefore, for all and ,

Therefore being BIC implies that is also BIC. Indeed, for any and , by plugging in the definition of ,

Finally, for any such that and , we have . This implies . ∎

Max-Support Policy. We can solve the following LP to check whether a particular action is signal-explorable given a particular realized signal . In this LP, we represent a policy as a set of numbers , for each action and each feasible signal .

subject to:

Since the constraints in this LP characterize any BIC recommendation policy, it follows that action is signal-explorable given realized signal if and only if the LP has a positive solution. If such solution exists, define recommendation policy by setting for all . Then this is a BIC recommendation policy such that .

Definition 4.5 ().

Given a signal structure , a BIC recommendation policy is called max-support if and signal-explorable action given , .

It is easy to see that we obtain max-support recommendation policy by averaging the policies defined above. Specifically, the following policy is BIC and max-support:

(4)

Maximal Exploration. We design a subroutine which outputs a sequence of actions with two properties: it includes every signal-explorable action at least once, and each action in the sequence marginally distributed as . The length of this sequence, denoted , should satisfy

(5)

This step is essentially from (Mansour et al., 2016); Reviewer 2 mentioned that we cite your EC paper twice. I know Alex wants to show that this is from your working paper which is not the version for EC. But it might confuse the reviewer. Not sure if we should do this. we provide the details below for the sake of completeness. The idea is to put copies of each action into a sequence of length and randomly permute the sequence. However, might not be an integer, and in particular may be smaller than 1. The latter issue is resolved by making sufficiently large. For the former issue, we first put copies of each action into the sequence, and then sample the remaining actions according to distribution . For details, see Algorithm 1.

1:  Input: type , signal and signal structure .
2:  Output: a list of actions
3:  Compute as per (4)
4:  Initialize .
5:  for each action  do
6:     
7:     Add copies of action into list .
8:     .
9:     
10:  , .
11:  Sample many actions from distribution independently and add these actions into .
12:  Randomly permute the actions in .
13:  return  .
Algorithm 1 Subroutine MaxExplore

Given type and signal , MaxExplore outputs a sequence of actions. Each action in the sequence marginally distributed as . For any action such that , shows up in the sequence at least once with probability exactly 1. MaxExplore runs in time polynomial in , , and (size of the support of the signal).

4.2. Main Recommendation Policy

Algorithm 2 is the main procedure of our recommendation policy. It consists of two parts: exploration, which explores all the eventually-explorable actions, and exploitation, which simply recommends the best explored action for a given type. The exploration part proceeds in phases. In each phase , each type gets a sequence of actions from MaxExplore using the data collected before this phase starts. The phase ends when every agent type has finished rounds. We pick parameter large enough so that the condition (5) is satisfied for all phases and all possible signals . (Note that is finite because there are only finitely many such signals.) After phases, our recommendation policy enters the exploitation part. See Algorithm 2 for details.

1:  Initialization: signal , phase count , index for each type .
2:  for rounds to  do
3:     if  then
4:        {Exploration}
5:        Call thread .
6:        if every type has finished rounds in the current phase (then
7:           Start a new phase: .
8:           Let be the signal for phase : the set of all observed type-action-reward triples.
9:           Let be the signal structure for given the realized type sequence .
10:     else
11:        {Exploitation}
12:        Recommend the best explored action for agent type .
Algorithm 2 Main procedure for public types

There is a separate thread for each type , denoted , which is called whenever an agent of this type shows up; see Algorithm 3. In a given phase , it recommends the actions computed by MaxExplore, then switches to the best explored action. The thread only uses the information collected before the current phase starts: the signal and signal structure .

1:  if this is the first call of of the current phase then
2:     Compute a list of actions MaxExplore().
3:     Initialize the index of type : .
4:  .
5:  if  then
6:     Recommend action .
7:  else
8:     Recommend the best explored action of type .
Algorithm 3 Thread for agent type :

The BIC property follows easily from Claim 4.1. The key is that Algorithm 2 explores all eventually-explorable type-action pairs.

The performance analysis proceeds as follows. First, we upper-bound the expected number of rounds of a phase (Lemma 4.6). Then we show, in Lemma 4.7, that Algorithm 2 explores all eventually-explorable type-action pairs in phases. We use these two lemmas to prove the main theorem.

Lemma 4.6 ().

The expected number of rounds in each phase at most .

Proof.

Phase ends as soon as each type has shown up at least times. The expected number of rounds by which this happens is at most . ∎

Notice that in Algorithm 2 the partition of phases depends only on realized types . Given the sequence , the partition of phases in Algorithm 2 is fixed.

The following lemma compares the exploration of Algorithm 2 with phases and some other BIC recommendation policy with rounds. Notice that a phase in Algorithm 2 has many rounds.

Lemma 4.7 ().

Fix phase and the sequence of agent types . Assume Algorithm 2 has been running for at least phases. For a given state , if type-action pair can be explored by some BIC recommendation policy at round with positive probability, then such action is explored by Algorithm 2 by the end of phase with probability .

Proof.

We prove this by induction on for . Base case is trivial by Claim 4.1. Assuming the lemma is correct for , let’s prove it’s correct for .

Let be the signal of Algorithm 2 by the end of phase . Let be the history of in the first rounds. More precisely, , where is the internal randomness of policy , and is the type-action-reward triple in round of policy .

The proof plan is as follows. We first show that . Informally, this means the information collected in the first phases of Algorithm 2 contains all the information has about the state . After that, we will use the information monotonicity lemma to show that phase of Algorithm 2 explores all the action-type pairs might explore in round .

First of all, we have

By the chain rule of mutual information, we have

For all , we have

Notice that the suggested action is a deterministic function of randomness of the recommendation policy , history of previous rounds and type in the current round . Also notice that, by induction hypothesis, is a deterministic function of . Therefore we have

Then we get

By Lemma 4.4, we know that . For state , there exists a signal such that and . Now let be the realized value of given , we know that , so . By Claim 4.1, we know that at least one agent of type in phase of Algorithm 2 will choose action .

Now consider the case when . Define to be the variant of Algorithm 2 such that it only does exploration (removing the if-condition and exploitation in Algorithm 2). For , the above induction proof still work for , i.e. for a given state , if an action of type can be explored by a BIC recommendation policy at round , then such action is guaranteed to be explored by by the end of phase . Now we are going to argue that won’t explore any new action-type pairs after phase . Call a phase exploring if in that phase explores at least one new action-type pair. As there are type-action pairs, can have at most exploring phases. On the other hand, once has a phase that is not exploring, because the signal stays the same after that phase, all phases afterwards are not exploring. So, does not have any exploring phases after phase . For , the first phases of Algorithm 2 explores the same set of type-action pairs as the first phases of . ∎

Proof of Theorem 4.1.

Algorithm 2 is BIC by Claim 4.1. By Lemma 4.7, Algorithm 2 explores all the eventually-explorable type-actions pairs after phases. After that, for each agent type , Algorithm 2 recommends the best explored action:
with probability exactly 1.666This holds with probability exactly 1, provided that our algorithm finishes phases. If some undesirable low-probability event happens, if all agents seen so far have had the same type, our algorithm would never finish phases.

Therefore Algorithm 2 gets reward except rounds in the first phases. It remains to prove that the expected number of rounds in exploration (i.e. first phases) does not depend on the time horizon . Let be the duration of phase . Recall that the phase ends as soon as each type has shown up at least times. It follows that . So, one can take . ∎

4.3. Extension to Reported Types

We sketch how to extend our ideas for public types to handle the case of reported types. We’d like to simulate the recommendation policy for public types, call it . We simulate it separately for the exploration part and the exploitation part. The exploitation part is fairly easy: we provide a menu that recommends the best explored action for each agent types.

In the exploration part, in each round we guess the agent type to be , with equal probability among all types.777We guess the types uniformly, rather than according to their probabilities, because our goal is to explore each type for certain number of rounds. Guessing a type according to its probability will only make rare types appear even rarer. The idea is to simulate only in lucky rounds when we guess correctly, . Thus, in each round we simulate the -th round of , where is the number of lucky rounds before round . In each round of exploration, we suggest the following menu. For type , we recommend the same action as would recommend for this type in the -th round, namely . For any other type, we recommend the action which has the best expected reward given the “common knowledge” (information available before round ) and the action . This is to ensure that in a lucky round, the menu does not convey any information beyond action . When we receive the reported type, we can check whether our guess was correct. If so, we input the type-action-reward triple back to . Else, we ignore this round, as if it never happened.

Thus, our recommendation policy eventually explores the same type-action pairs as . The expected number of rounds increases by the factor of . Thus, we have the following theorem.

Theorem 4.8 ().

Consider Bayesian Exploration with reported types. There exists a BIC recommendation policy whose expected total reward is at least , for some constant that depends on the problem instance but not on . This policy explores all type-action pairs that are eventually-explorable for public types.

5. Private Types

Our recommendation policy for private types satisfies a relaxed version of the BIC property, called -BIC, where the right-hand side in (1) is for some fixed . We assume a more permissive behavioral model in which agents obey such policy.

The main result is as follows. (Throughout, .)

Theorem 5.1 ().

Consider Bayesian Exploration with private types, and fix . There exists a -BIC recommendation policy with expected total reward at least , where depends on the problem instance but not on time horizon .

The recommendation policy proceeds in phases: in each phase, it explores all menus that can be explored given the information collected so far. The crucial step in the proof is to show that:

the first phases of our recommendation policy explore all the menus that could be possibly explored by the first rounds of any BIC recommendation policy.

The new difficulty for private types comes from the fact that we are exploring menus instead of type-actions pairs, and we do not learn the reward of a particular type-action pair immediately. This is because a recommended menu may map several different types to the chosen action, so knowing the latter does not immediately reveal the agent’s type. Moreover, the full “outcome” of a particular menu is a distribution over action-reward pairs, it is, in general, impossible to learn this outcome exactly in any finite number of rounds. Because of these issues, we cannot obtain Property prop:private-exploration exactly. Instead, we achieve an approximate version of this property, as long as we explore each menu enough times in each phase.

We then show that this approximate version of prop:private-exploration suffices to guarantee explorability, if we relax the incentives property of our policy from BIC to -BIC, for any fixed . In particular, we prove an approximate version of the information-monotonicity lemma (Lemma 4.4) which (given the approximate version of prop:private-exploration) ensures that our recommendation policy can explore all the menus that could be possibly explored by the first rounds of any BIC recommendation policy.

5.1. A Single round of Bayesian Exploration

Recall that for a random variable , called signal, the signal structure is a joint distribution of .

Definition 5.2 ().

Consider a single-round of Bayesian Exploration when the principal has signal from signal structure . For any , a menu is called -signal-explorable, for a given signal , if there exists a single-round -BIC recommendation policy such that . The set of all such menus is denoted as . The -signal-explorable set is defined as . We omit in when .

Approximate Information Monotonicity. In the following definition, we define a way to compare two signals approximately.

Definition 5.3 ().

Let and be two random variables. We say random variable is -approximately informative as random variable about state if .

Lemma 5.4 ().

Let and be two random variables and and be their signal structures. If is -approximately informative as about state (i.e. ), then for all such that .

Proof.

For each signal realization , denote

We have .

By Pinsker’s inequality, we have

Consider any BIC recommendation policy for signal structure . We construct for signature structure by setting

Now we check is -BIC. For any and ,

We also have for any such that and , we have . This implies . ∎

Max-Support Policy. We can solve the following LP to check whether a particular menu is signal-explorable given a particular realized signal . In this LP, we represent a policy as a set of numbers , for each menu and each feasible signal .

subject to: