Incentivizing Exploration with Unbiased Histories

11/14/2018
by   Nicole Immorlica, et al.
0

In a social learning setting, there is a set of actions, each of which has a payoff that depends on a hidden state of the world. A sequence of agents each chooses an action with the goal of maximizing payoff given estimates of the state of the world. A disclosure policy tries to coordinate the choices of the agents by sending messages about the history of past actions. The goal of the algorithm is to minimize the regret of the action sequence. In this paper, we study a particular class of disclosure policies that use messages, called unbiased subhistories, consisting of the actions and rewards from by a subsequence of past agents, where the subsequence is chosen ahead of time. One trivial message of this form contains the full history; a disclosure policy that chooses to use such messages risks inducing herding behavior among the agents and thus has regret linear in the number of rounds. Our main result is a disclosure policy using unbiased subhistories that obtains regret Õ(√(T)). We also exhibit simpler policies with higher, but still sublinear, regret. These policies can be interpreted as dividing a sublinear number of agents into constant-sized focus groups, whose histories are then fed to future agents.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
06/09/2023

Strategic Apple Tasting

Algorithmic decision-making in high-stakes domains often involves assign...
research
10/28/2020

Provably Efficient Online Agnostic Learning in Markov Games

We study online agnostic learning, a problem that arises in episodic mul...
research
02/22/2021

Communication Efficient Parallel Reinforcement Learning

We consider the problem where M agents interact with M identical and ind...
research
11/21/2022

Best of Both Worlds in Online Control: Competitive Ratio and Policy Regret

We consider the fundamental problem of online control of a linear dynami...
research
02/28/2023

Policy Dispersion in Non-Markovian Environment

Markov Decision Process (MDP) presents a mathematical framework to formu...
research
05/01/2023

The Impact of the Geometric Properties of the Constraint Set in Safe Optimization with Bandit Feedback

We consider a safe optimization problem with bandit feedback in which an...
research
11/17/2018

The Impatient May Use Limited Optimism to Minimize Regret

Discounted-sum games provide a formal model for the study of reinforceme...

Please sign up or login with your details

Forgot password? Click here to reset