Multi-armed bandits (henceforth MABs) [9, 11] is a well-studied problem domain in online learning. In that setting, several arms (i.e., actions) are available to a planner; each arm is associated with an unknown reward distribution, from which rewards are sampled independently each time the arm is pulled. The planner selects arms sequentially, aiming to maximize her sum of rewards. This often involves a tradeoff between exploiting arms that have been observed to yield good rewards and exploring arms that could yield even higher rewards. Many variations of this model exist, including stochastic [1, 21], Bayesian , contextual [13, 29], adversarial  and non-stationary [8, 23] bandits.
This paper considers a setting motivated by recommender systems. Such systems suggest actions to agents based on a set of current beliefs and assess agents’ experiences to update these beliefs. For instance, in navigation applications (e.g., Waze; Google maps) the system recommends routes to drivers based on beliefs about current traffic congestion. The planner’s objective is to minimize users’ average travel time. The system cannot be sure of the congestion on a road segment that no agents have recently traversed; thus, navigation systems offer the best known route most of the time and explore occasionally. Of course, users are not eager to perform such exploration; they are self-interested in the sense that they care more about minimizing their own travel times than they do about conducting surveillance about traffic conditions for the system.
A recent line of work [22, 26], inspired by the viewpoint of algorithmic mechanism design [27, 28], deals with that challenge by incentivizing exploration—that is, setting up the system in such a way that no user would ever rationally choose to decline an action that was recommended to him. The key reason that it is possible to achieve this property while still performing a sufficient amount of exploration is that the planner has more information than the agents. At each point in time, each agent holds beliefs about the arms’ reward distributions; the planner has the same information, but also knows about all of the arms previously pulled and the rewards obtained in each case. More specifically, Kremer et al.  consider a restricted setting and devise an MAB algorithm that is incentive compatible (IC), meaning that whenever the algorithm recommends arm to an agent, the best response of the agent is to select arm .
Although this approach explicitly reasons about agents’ incentives, it does not treat agents fairly: agents who are asked to explore receive lower expected rewards. More precisely, these IC MAB algorithms reach optimality (in the static setting) or minimize regret (in the online setting) by intentionally providing (a priori) sub-optimal recommendations to some of the agents. In particular, such IC MAB algorithms lead agents astray from the default arm— the a priori superior arm, which would be every agent’s rational choice in the absence both of knowledge of other agents’ experiences and of a trusted recommendation. Thus, it would be natural for agents to see the recommendations of such IC MAB algorithms as a betrayal of trust; they might ask “if you believed that what you recommended to me was worse than what I would have picked without your help, why did you recommend it?” Of course, the scale of this problem gets worse as the importance of the recommendation to the agent increases. At the extreme end of the spectrum, it would clearly be unethical for a doctor to recommend a treatment to a patient that she considers sub-optimal, even if doing so would lead to better medical outcomes for society as a whole. Similarly, nobody wants a financial advisor who occasionally recommends stocks about which she has low confidence in order to improve her predictive model. Indeed, in many domains such behavior is illegal: e.g., in jurisdictions where a financial advisor has a fiduciary duty to her clients, she is legally required to act at all times for the sole benefit of her clients.
In this paper we introduce the idea of fiduciary bandits. We explore a definition of a recommender’s fiduciary duty that we call ex-ante
individual rationality (EAIR). A mechanism is EAIR if any probability distribution over arms that it selects has expected reward that is always at least as great as the reward of the default arm, both calculated based on the recommender’s knowledge (which may be more extensive than that of agents). This means that, while it is possible for the mechanism to sample a recommendation from the distribution that isa priori inferior to the default arm, the agent receiving the recommendation is nevertheless guaranteed to realize expected reward weakly greater than that offered by the default arm—independently from the other agents and their recommendations. Satisfying this requirement makes a MAB algorithm more appealing to agents; we foresee that in some domains, such a requirement might be imposed as fairness constraints by authorities.
Algorithmically, we focus on constructing an optimal EAIR mechanism that also satisfies the IC requirement. Our model is similar to that of Kremer et al. , but with arms and uniform agent arrival.111The assumption of uniform arrival is only required for incentive compatibility; thus, the algorithms we propose in this paper are asymptotically optimal and IR regardless of the agent arrival distribution. In particular, the planner and the agents have (the same) Bayesian prior over the rewards of the arms, which are fixed but initially unknown. The main technical contribution of this paper is an IC and EAIR mechanism, which obtains the highest possible social welfare by an ex-ante IR mechanism up to a factor of , where is the number of agents. The optimality of our mechanism, which we term Fiduciary Explore & Exploit (FEE) and outline as Algorithm 1, follows from a careful construction of the exploration phase. Our analysis uses an intrinsic property of the setting, which is further elaborated in Theorem 1.
To complement this analysis, we also investigate the more demanding concept of ex-post individual rationality (EPIR). The EPIR condition requires that a recommended arm must never be a priori
inferior to the default arm given the planner’s knowledge. The two requirements differ in the guarantees that they provide to agents and correspondingly allow the system different degrees of freedom in performing exploration. We design an asymptotically optimal IC and EPIR mechanism and analyze the social welfare cost of adopting both EAIR and EPIR mechanisms.
Background on MABs can be found in Cesa-Bianchi and Lugosi  and a recent survey . Kremer et al.  is the first work of which we are aware that investigated the problem of incentivizing exploration. The authors considered two deterministic arms, a prior known both to the agents and the planner, and an arrival order that is common knowledge among all agents, and presented an optimal IC mechanism. Cohen and Mansour  extended this optimality result to several arms under further assumptions. This setting has also been extended to regret minimization , social networks [4, 5], and heterogeneous agents [12, 19]. All of this literature disallows paying agents; monetary incentives for exploration are discussed in e.g., [12, 16]. None of this work considers an individual rationality constraint as we do here.
Our work also contributes to the growing body of work on fairness in Machine Learning[7, 15, 18, 24]. In the context of MABs, some recent work focuses on fairness in the sense of treating arms fairly. In particular, Liu et al.  aim at treating similar arms similarly and Joseph et al. 
demand that a worse arm is never favored over a better one despite a learning algorithm’s uncertainty over the true payoffs. Finally, we note that the EAIR requirement we impose—that agents be guaranteed an expected reward at least as high as that offered by a default arm—is also related to the burgeoning field of safe reinforcement learning.
Let be a set of arms (actions). The rewards are deterministic but initially unknown: the reward of arm
is a random variable, and are mutually independent. We denote by the observed value of . Further, we denote by the expected value of , and assume for notational convenience that . We also make the simplifying assumption that the rewards are fully supported on the set , and refer to the continuous case in Section 5.
The agents arrive in a random order, , where is selected uniformly at random from the set of all permutations. Let denote an agent and denote a stage (i.e., a position in the ordering). The proposition asserts that agent arrived at stage . In what follows, we refer to agents according to the (random) stage at which they arrive, and not according to their names.222The effect of this construction is that, unlike in the work of Kremer et al. , agents cannot infer their positions in the ordering from their names. As a result, as we show in the appendix, in our setting a mechanism that first explores all arms and then exploits the best arm is IC and asymptotically optimal. We denote by the action of the agent arriving at stage (not the action of agent ). The reward of the agent arriving at stage is denoted by , and is a function of the arm she chooses. For instance, by selecting arm the agent obtains . Each agent aims at maximizing her payoff. Agents are fully aware of the distribution of .
A mechanism is a recommendation engine that interacts with agents. The input for the mechanism at stage is the sequence of arms pulled and rewards received by the previous agents. The output of the mechanism is a recommended arm for agent . Formally, a mechanism is a function of course, we can also define a deterministic notion that maps simply to . The mechanism makes action recommendations, but cannot compel agents to follow these recommendations. We say that a mechanism is incentive compatible (IC) when following its recommendations constitutes an equilibrium333This is done for ease of presentation. The mechanisms in this paper can be modified to offer dominant strategies to agents.: that is, when given a recommendation and given that others follow their own recommendations, an agent’s best response is to follow her own recommendation.
Definition 1 (Incentive Compatibility).
A mechanism is incentive compatible (IC) if , for every history and for all actions ,
The mechanism has a global objective, which is to maximize agents’ social welfare . When a mechanism is IC, we can also talk about the mechanism’s (expected) social welfare under the assumption that the agents follow its recommendations, :
A mechanism is said to be optimal w.r.t. the problem parameters if for every other mechanism it holds that . A mechanism is asymptotically optimal if, for every “large enough” number of agents greater than some number , the social welfare of the agents under is at most away from the social welfare attained by the best possible mechanism for agents (that is, if for every other mechanism it holds that ).
As elaborated above, in this paper we focus on mechanisms that are individually rational. Since agents are aware of the distributions , they also know that ; hence, we shall assume that every agent’s default action is . As a result, a fiduciary mechanism should guarantee each agent at least the reward provided by action . We now formally define this notion, using the property of ex-ante individual rationality (EAIR).
Definition 2 (Ex-Ante Individual Rationality).
A mechanism is ex-ante individually rational (EAIR) if for every agent , every value in the support of , and every history ,
Due to the mutual independence assumption, we must have if arm was observed under the history and otherwise. An EAIR mechanism must select a portfolio of arms with expected reward never inferior to the reward obtained for . We also discuss a more strict form of rationality, namely ex-post individual rationality (EPIR) in Section 4.
We now give an example to illustrate our setting and to familiarize the reader with our notation. Consider arms, and for ; thus , , and . As always, is the default arm. To satisfy EAIR, a mechanism should recommend to the first agent, since EAIR requires that the expected value of any recommendation should weakly exceed . Let be the history after the first agent. Now, we have three different cases. First, if , we know that and ; therefore, an EAIR mechanism can never explore any other arm, since any distribution over would violate Inequality (2). Second, if , then and , and hence an EAIR mechanism can explore both and .
The third and most interesting case is where , as when . In this case, arm could only be recommended through a portfolio. An EAIR mechanism could select any distribution over that satisfies Inequality (2): any such that . This means that an EAIR mechanism can potentially explore arm , yielding higher expected social welfare overall than simply recommending a non-inferior arm deterministically.
3 Asymptotically Optimal IC-EAIR Mechanism
In this section we present the main technical contribution of this paper: a mechanism that asymptotically optimally balances the explore-exploit tradeoff while satisfying both EAIR and incentive compatibility. The mechanism, which we term Fiduciary Explore & Exploit (FEE), is described as Algorithm 1. FEE is an event-based protocol that triggers every time an agent arrives. We now give an overview of FEE, focusing on the case where all agents adopt the recommendation of the mediator; this is reasonable because we will go on to show that FEE is IC.444Nevertheless, we add for completeness that in case one or more agents deviate from her recommendation, the algorithm would always recommend the first arm from that moment on.
from that moment on.We explain the algorithm’s exploration phase in Subsection 3.1, describe the overall algorithm in Subsection 3.2, and prove the algorithm’s formal guarantees in Subsection 3.3.
FEE is composed of three phases: primary exploration (Lines 1–6), secondary exploration (Lines 7–19), and exploitation (Lines 20). During the primary exploration phase, the mechanism compares the default arm to whichever other arms are permitted by the individual rationality constraint. This turns out to be challenging for two reasons. First, the order in which arms are explored matters; tackling them in the wrong order can reduce the set of arms that can be considered overall. Second, it is nontrivial to search in the continuous space of probability distributions over arms. To address this latter issue, we present a key lemma that allows us to use dynamic programming and find the optimal exploration policy in time . Because we expect either to be fixed or to be significantly smaller than , this policy is computationally efficient. Moreover, we note that the optimal exploration policy can be computed offline prior to the agents’ arrival.
The primary exploration phase terminates in one of two scenarios: either the reward of arm is the best that was observed and thus no other arm could be explored (as in our example when , or when and exploring yielded and thus could not be explored), or another arm was found to be superior to : i.e., an arm was observed for which . In the latter case, the mechanism gains the option of conducting a secondary exploration, using arm to investigate all the arms that were not explored in the primary exploration phase. The third and final phase—to which we must proceed directly after the primary exploration phase if that phase does not identify an arm superior to the default arm—is to exploit the most rewarding arm observed.
3.1 Primary Exploration Phase
Performing primary exploration optimally requires solving a planning problem; it is a challenging one, because it involves a continuous action space and a number of states exponential in and
. We approach this task as a Goal Markov Decision Process (GMDP) (see, e.g.,) that abstracts everything but pure exploration. In our GMDP encoding, all terminal states fall into one of two categories. The first category is histories that lead to pure exploitation of , which can arise either because EAIR permits no arm to be explored or because all explored arms yield rewards inferior to the observed ; the second is histories in which an arm superior to was found. Non-terminal states thus represent histories in which it is still permissible for some arms to be explored. The set of actions in each non-terminal state is the set of distributions over the non-observed arms (i.e., portfolios) corresponding to the history represented in that state, which satisfy the EAIR condition. The transition probabilities encode the probability of choosing each candidate arm from a portfolio; observe that the rewards of each arm are fixed, so this is not a source of additional randomness in our model. GMDP rewards are given in terminal nodes only: either the observed if no superior arm was found or the expected value of the maximum between the superior reward discovered and the maximal reward of all unobserved arms (since in this case, as we show later on, the mechanism is able to explore all arms w.h.p. during the secondary exploration phase).
Formally, the GMDP is a tuple , where
is a finite set of states. Each state is a pair , where is the set of arm-reward pairs that have been observed so far, with each appearing at most once in (since rewards from the arms are deterministic): for every and every , . is the set of arms not yet explored. The initial state is thus . For every non-empty555Due to the construction, every non-empty must contain for some . set of pairs we define to be the reward observed for arm , and to be the maximal reward observed.
is an infinite set of actions. For each , is defined as follows:
If , then : i.e., a deterministic selection of .
Else, if , then . This condition implies that we can move to secondary exploration.
Otherwise, is a subset of , such that if and only if
Notice that this resembles the EAIR condition given in Inequality (2). Moreover, the case where none of the remaining arms have strong enough priors to allow exploration falls here as a vacuous case of the above inequality.
We denote by the set of terminal states, namely .
is the transition probability function. Let , and let such that and for some . Then, the transition probability from to given an action is defined by If is some other state that does not meet the conditions above, then let for every .
is the reward function, defined on terminal states only. For each terminal state ,
That is, when was the highest-reward arm observed, the reward of a terminal state is ; otherwise, it is the expectation of the maximum between and the highest reward of all unobserved arms. The reward depends on unobserved arms since the secondary exploration phase allows us to explore all these arms; hence, their values are also taken into account.
A policy is a function from all GMDP histories (sequences of states and actions) and a current state to an action. A policy is valid if for every history and every non-terminal state , . A policy is stationary if for every two histories and a state , . When discussing a stationary policy, we thus neglect its dependency on , writing .
Given a policy and a state , we denote by the expected reward of when initialized from , which is defined recursively from the terminal states:
We now turn to our technical results. The following lemma shows that we can safely focus on stationary policies that effectively operate on a significantly reduced state space.
For every policy there exists a stationary policy such that (1) for every pair of states and with and ; and (2) for every state s, .
Lemma 1 tells us that there exists an optimal, stationary policy that selects the same action in every pair of states that share the same unobserved set and values and , but are distinguished in the component. Thus, we do not need a set of states whose size depends on the number of possible arm-reward observation histories: all we need to record is , and , reducing the number of states to .
We still have one more challenge to overcome: the set of actions available in each state is uncountable; hence, employing standard value iteration or policy iteration is infeasible. Instead, we prove that there exists an optimal “simple” policy, which we denote . Given two indices , we denote by (for ) and by (for ) the distributions over such that
When is clear from context, we omit it from the superscript. We are now ready to describe the policy , which we later prove to be optimal. For the initial state , . For every non-terminal state with , such that
The optimality of follows from a property that is formally proven in Theorem 1: any policy that satisfies the conditions of Lemma 1 can be presented as a mixture of policies that solely take actions of the form . As a result, we can improve by taking the best such policy from that mixture. We derive via dynamic programming, where the base cases are the set of terminal states. For any other state, is the best action of the form as defined above, considering all states that are reachable from . While any policy can be encoded as a weighted sum over such “simple” policies, is the best one, and hence is optimal.
For every valid policy and every state , it holds that .
Since our compressed state representation consists of states, the computation of in each stage requires us to consider candidate actions, each of which involves summation of at most summands; thus, can be computed in time.
3.2 Intuitive Description of the Fiduciary Explore & Exploit Algorithm
We now present the FEE algorithm, stated formally as Algorithm 1. The primary exploration phase (Lines 1–6) is based on the GMDP from the previous subsection. It is composed of computing and then producing recommendations according to its actions, each of which defines a distribution over (at most) two actions. Let denote the terminal state reached by (the primary exploration selects a fresh arm in each stage; hence such a state is reached after at most agents).
We then enter the secondary exploration phase. If then this phase is vacuous: no distribution over the unobserved arms can satisfy the EAIR condition and/or all the observed arms are inferior to arm . On the other hand, if (Line 7), we found an arm with a reward superior to , and can use it to explore all the remaining arms. For every , the mechanism operates as follows. If the probability of yielding a reward greater than is zero, we neglect it (Lines 11–13). Else, if , we recommend . This is manifested in the second condition in Line 15. Otherwise, . In this case, we select a distribution over that satisfies the EAIR condition and explore with the maximal possible probability, which is . As we show formally in the proof of Lemma 2, the probability of exploring in this case is at least , implying that after tries in expectation the algorithm would succeed to explore .
3.3 Algorithmic Guarantees
We begin by arguing that FEE is indeed EAIR.
FEE satisfies the EAIR condition.
One more property FEE exhibits is IC. The next theorem shows that when there are enough agents, the best action of each agent is to accept FEE’s recommendation. To present it, we introduce the following quantity . Let . In words, is the probability that arm is superior to all other arms. Clearly, if for some , we can safely ignore that arm and never explore it; therefore, for every arm . Finally, let . Theorem 2 implies that if there are agents, then FEE is IC.
If , then FEE is IC.
We now move on to consider the social welfare of FEE. First, we show that the expected value of at , denoted by , upper bounds the social welfare of any EAIR mechanism.
For every mechanism which is EAIR, .
The proof proceeds by contradiction: given an EAIR mechanism , we construct a series of progressively-easier-to-analyze EAIR mechanisms with non-decreasing social welfare; we modify the final mechanism by granting it oracular capabilities, making it violate the EAIR property and yet preserving reducibility to a policy for the GMDP of Subsection 3.1. We then argue via the optimality of that the oracle mechanism cannot obtain a social welfare greater than . Next, we lower bound the social welfare of FEE by .
The proof relies mainly on an argument that the primary and secondary explorations will not be too long on average: after agents the mechanism is likely to begin exploiting. Noting that the lower bound of Lemma 2 asymptotically approaches the upper bound of Theorem 3, we conclude that FEE is asymptotically optimal.
We now analyze the loss in social welfare imposed by the fiduciary constraint. Let denote the best asymptotic social welfare (w.r.t. some instance and infinitely many agents) achievable by an unconstrained mechanism and an EAIR mechanism, respectively.
For every such that , there exists an instance with .
Proposition 2 shows that when and have the same magnitude, the ratio is on the order of , meaning that EAIR mechanisms perform poorly when a large number of different reward values are possible. However, this result describes the worst case; it turns out that optimal EAIR mechanisms have constant ratio under some reward distributions. For example, as we show in the appendix, this ratio is at most if for every .
4.1 Ex-Post Individually Rational Mechanism
Notice that EAIR mechanisms guarantee each agent the value of the default arm, but only in expectation. In this subsection, we propose a more strict form of individual rationality, ex-post individual rationality (EPIR).
Definition 3 (Ex-Post Individual Rationality).
A mechanism is ex-post individually rational (EPIR) if for every agent , every value in the support of , every history , and every arm such that , it holds that
Satisfying EPIR means that the mechanism never recommends an arm that is a priori inferior to arm . Noticeably, every EPIR mechanism is also EAIR, yet EPIR mechanisms are quite conservative, since they can only explore arms that yield expected rewards of at least the value obtained for . We now provide the intuitive IC and EPIR mechanism , which is also asymptotically optimal. First, explores , and observes . Then, it explores all arms such that , one at each stage, in any arbitrary order. When no additional arm could be explored (due to the EPIR condition), it exploits the best arm observed.
is IC, EPIR and asymptotically optimal among all EPIR mechanisms.
Next, we consider the cost of adopting the stricter EPIR condition rather than EAIR. Let be the best asymptotic social welfare achievable by an EPIR mechanism. As we show, by providing a more strict fiduciary guarantee, the social welfare may be harmed by a factor of .
For every such that , there exists an instance with .
5 Conclusions and Discussion
This paper introduces a model in which a recommender system must manage an exploration-exploitation tradeoff under the constraint that it may never knowingly make a recommendation that will yield lower reward than any individual agent would achieve if he/she acted without relying on the system.
We see considerable scope for follow-up work. First, from a technical point of view, our algorithmic results are limited to discrete reward distributions. One possible future direction would be to present an algorithm for the continuous case. More conceptually, we see natural extensions of EPIR and EAIR to stochastic settings , either by assuming a prior and requiring the conditions w.r.t. the posterior distribution or by requiring the conditions to hold with high probability. Moreover, we are intrigued by non-stationary settings —where e.g., rewards follow a Markov process—since the planner would be able to sample a priori inferior arms with high probability assuming the rewards change fast enough, thereby reducing regret.
The work of G. Bahar, O. Ben-Porat and M. Tennenholtz is funded by the European Research Council (ERC) under the European Union’s Horizon 2020 research and innovation programme (grant agreement n 740435). The work of K. Leyton-Brown is funded by the NSERC Discovery Grants program, DND/NSERC Discovery Grant Supplement, Facebook Research and Canada CIFAR AI Chair Amii. Part of this work was done while K. Leyton-Brown was a visiting researcher at Technion - Israeli Institute of Science and was partially funded by the European Union’s Horizon 2020 research and innovation programme (grant agreement n 740435).
- Abbasi-Yadkori et al.  Y. Abbasi-Yadkori, D. Pál, and C. Szepesvári. Improved algorithms for linear stochastic bandits. In Advances in Neural Information Processing Systems, pages 2312–2320, 2011.
Agrawal and Goyal 
S. Agrawal and N. Goyal.
Analysis of thompson sampling for the multi-armed bandit problem.In Proceedings of the 25th Annual Conference on Learning Theory (COLT),, pages 1–39, 2012.
- Auer et al.  P. Auer, N. Cesa-Bianchi, Y. Freund, and R. E. Schapire. Gambling in a rigged casino: The adversarial multi-armed bandit problem. In Proceedings of IEEE 36th Annual Foundations of Computer Science, pages 322–331. IEEE, 1995.
- Bahar et al.  G. Bahar, R. Smorodinsky, and M. Tennenholtz. Economic recommendation systems: One page abstract. In Proceedings of the 2016 ACM Conference on Economics and Computation, EC ’16, pages 757–757, New York, NY, USA, 2016. ACM. ISBN 978-1-4503-3936-0. doi: 10.1145/2940716.2940719. URL http://doi.acm.org/10.1145/2940716.2940719.
- Bahar et al.  G. Bahar, R. Smorodinsky, and M. Tennenholtz. Social learning and the innkeeper challenge. In ACM Conf. on Economics and Computation (EC), 2019.
- Barto et al.  A. G. Barto, S. J. Bradtke, and S. P. Singh. Learning to act using real-time dynamic programming. Artificial Intelligence, 72(1-2):81–138, 1995.
- Ben-Porat and Tennenholtz  O. Ben-Porat and M. Tennenholtz. A game-theoretic approach to recommendation systems with strategic content providers. In Advances in Neural Information Processing Systems 31: Annual Conference on Neural Information Processing Systems 2018, NeurIPS 2018, 3-8 December 2018, Montréal, Canada., pages 1118–1128, 2018.
- Besbes et al.  O. Besbes, Y. Gur, and A. Zeevi. Stochastic multi-armed-bandit problem with non-stationary rewards. In Advances in Neural Information Processing Systems (NIPS), pages 199–207, 2014.
- Bubeck et al. [2012a] S. Bubeck, N. Cesa-Bianchi, et al. Regret analysis of stochastic and nonstochastic multi-armed bandit problems. Foundations and Trends® in Machine Learning, 5(1):1–122, 2012a.
- Bubeck et al. [2012b] S. Bubeck, N. Cesa-Bianchi, et al. Regret analysis of stochastic and nonstochastic multi-armed bandit problems. Foundations and Trends® in Machine Learning, 5(1):1–122, 2012b.
- Cesa-Bianchi and Lugosi  N. Cesa-Bianchi and G. Lugosi. Prediction, learning, and games. Cambridge Univ Press, 2006.
- Chen et al.  B. Chen, P. Frazier, and D. Kempe. Incentivizing exploration by heterogeneous users. In S. Bubeck, V. Perchet, and P. Rigollet, editors, Proceedings of the 31st Conference On Learning Theory, volume 75 of Proceedings of Machine Learning Research, pages 798–818. PMLR, 06–09 Jul 2018. URL http://proceedings.mlr.press/v75/chen18a.html.
- Chu et al.  W. Chu, L. Li, L. Reyzin, and R. Schapire. Contextual bandits with linear payoff functions. In Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pages 208–214, 2011.
- Cohen and Mansour  L. Cohen and Y. Mansour. Optimal algorithm for bayesian incentive-compatible. In ACM Conf. on Economics and Computation (EC), 2019.
- Dwork et al.  C. Dwork, M. Hardt, T. Pitassi, O. Reingold, and R. Zemel. Fairness through awareness. In Proceedings of the 3rd Innovations in Theoretical Computer Science conference (ITCS), pages 214–226. ACM, 2012.
- Frazier et al.  P. Frazier, D. Kempe, J. Kleinberg, and R. Kleinberg. Incentivizing exploration. In Proceedings of the Fifteenth ACM Conference on Economics and Computation, EC ’14, pages 5–22, New York, NY, USA, 2014. ACM. ISBN 978-1-4503-2565-3. doi: 10.1145/2600057.2602897. URL http://doi.acm.org/10.1145/2600057.2602897.
- Garcıa and Fernández  J. Garcıa and F. Fernández. A comprehensive survey on safe reinforcement learning. Journal of Machine Learning Research, 16(1):1437–1480, 2015.
Hardt et al. 
M. Hardt, E. Price, N. Srebro, et al.
Equality of opportunity in supervised learning.In Advances in Neural Information Processing Systems (NIPS), pages 3315–3323, 2016.
- Immorlica et al.  N. Immorlica, J. Mao, A. Slivkins, and Z. S. Wu. Bayesian exploration with heterogeneous agents, 2019.
- Joseph et al.  M. Joseph, M. Kearns, J. H. Morgenstern, and A. Roth. Fairness in learning: Classic and contextual bandits. In D. D. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon, and R. Garnett, editors, Advances in Neural Information Processing Systems 29, pages 325–333. Curran Associates, Inc., 2016. URL http://papers.nips.cc/paper/6355-fairness-in-learning-classic-and-contextual-bandits.pdf.
- Karnin et al.  Z. Karnin, T. Koren, and O. Somekh. Almost optimal exploration in multi-armed bandits. In International Conference on Machine Learning, pages 1238–1246, 2013.
- Kremer et al.  I. Kremer, Y. Mansour, and M. Perry. Implementing the wisdom of the crowd. Journal of Political Economy, 122:988–1012, 2014.
- Levine et al.  N. Levine, K. Crammer, and S. Mannor. Rotting bandits. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems 30, pages 3074–3083. Curran Associates, Inc., 2017. URL http://papers.nips.cc/paper/6900-rotting-bandits.pdf.
- Liu et al.  L. Liu, S. Dean, E. Rolf, M. Simchowitz, and M. Hardt. Delayed impact of fair machine learning. In International Conference on Machine Learning, pages 3156–3164, 2018.
- Liu et al.  Y. Liu, G. Radanovic, C. Dimitrakakis, D. Mandal, and D. C. Parkes. Calibrated fairness in bandits, 2017.
- Mansour et al.  Y. Mansour, A. Slivkins, and V. Syrgkanis. Bayesian incentive-compatible bandit exploration. In ACM Conf. on Economics and Computation (EC), 2015.
Nisan and Ronen 
N. Nisan and A. Ronen.
Algorithmic mechanism design.
Proceedings of the thirty-first annual ACM Symposium on Theory of Computing (STOC), pages 129–140. ACM, 1999.
Nisan et al. 
N. Nisan, T. Roughgarden, E. Tardos, and V. V. Vazirani.
Algorithmic game theory, volume 1. Cambridge University Press Cambridge, 2007.
- Slivkins  A. Slivkins. Contextual bandits with similarity information. The Journal of Machine Learning Research, 15(1):2533–2568, 2014.
6 Omitted Proofs from Section 3.1
If , .
Else, set such that
For every non-stationary policy , there exists a stationary policy such that for every state , .
Moreover, the following Proposition 6 implies that we can substantially reduce the state space by disregarding the observed part and
For every stationary policy there exists a stationary policy such that:
for every pair of states with and .
for every state s, .
Proof of Proposition 5.
Fix an arbitrary non-stationary policy . We prove the claim by iterating over all states in an increasing order of the number of elements in . We use induction to show that the constructed indeed satisfies the assertion.
For every such that , i.e., . If is terminal, then and . Otherwise, the unique element in is the action that assigns probability 1 to , and by setting we get .
Assume that the assertion holds for every ; namely, that for all with . We now prove the assertion for with . If is a terminal state, then we are done. Else, since and the support of each arm are finite, there exists a finite number of possible histories that lead from to that we will mark as . For every possible history , assigns an action that (can) depend on the history . Let
breaking ties arbitrarily. We set . Hence we get:
hence, . This concludes the proof. ∎
Proof of Proposition 6.
The proof is similar to the proof of Proposition 5, and is given for completeness. Fix an arbitrary stationary policy . We prove the claim by iterating over all states in an increasing order of the number of elements in . We use induction to show that the constructed indeed satisfies the assertion.
For every such that , i.e., , if is terminal then . Otherwise, the unique element in is the action that assigns probability 1 to ; hence, by setting we get .
Assume the assertion holds for every ; namely, that for all with . Next, we prove the assertion for with . If is a terminal state, then we are done. Else, since the size of and the support of each arm are finite, there exists only a finite number of states with the same and , which we mark as . For every state , assigns an action . Let
breaking ties arbitrarily. Next, set . We have that
Proof of Theorem 1.
Fix an arbitrary policy . We prove the claim by iterating over all states in an increasing order of the number of elements of . We use induction to show that the constructed indeed satisfies the assertion.
For every such that , the claim holds trivially. To see this, recall that if is terminal, ; otherwise, the unique element in is the action that assigns probability 1 to the sole element in . Either way, .
Assume the assertion holds for every ; namely, that for all with . If is a terminal state, then we are done. Else, we shall make use of the following claim, which shows that every action in can be viewed as a weighted sum over the elements of .
For any and , there exist coefficients such that
The proof of the claim appears below this proof. In particular, Claim 1 suggests that , which is valid and thus w.p. 1, can be presented as a weighted sum over all pairs . Finally,
where the last equality follows since and by the definition of given in Equation (5). To sum, the constructed satisfies for every state . ∎
Proof of Claim 1.
To ease readability, we shall use the notation and in this proof. Let be an arbitrary state and be an arbitrary action. Notice that could be described as
where and for every such that . We now describe a procedure that shifts mass from the set to , while still satisfying the equality in Equation (8). Each time we apply this procedure we decrease the value of one or more elements from and increase one or more elements from by the same quantity. As a result, when it converges (assuming that it does), namely when , we are guaranteed that all the conditions of the claim hold. Importantly, throughout the course of this procedure, the following inequalities hold