Reproducible Bandits

10/04/2022
by   Hossein Esfandiari, et al.
0

In this paper, we introduce the notion of reproducible policies in the context of stochastic bandits, one of the canonical problems in interactive learning. A policy in the bandit environment is called reproducible if it pulls, with high probability, the exact same sequence of arms in two different and independent executions (i.e., under independent reward realizations). We show that not only do reproducible policies exist, but also they achieve almost the same optimal (non-reproducible) regret bounds in terms of the time horizon. More specifically, in the stochastic multi-armed bandits setting, we develop a policy with an optimal problem-dependent regret bound whose dependence on the reproducibility parameter is also optimal. Similarly, for stochastic linear bandits (with finitely and infinitely many arms) we develop reproducible policies that achieve the best-known problem-independent regret bounds with an optimal dependency on the reproducibility parameter. Our results show that even though randomization is crucial for the exploration-exploitation trade-off, an optimal balance can still be achieved while pulling the exact same arms in two different rounds of executions.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
11/04/2022

Online Learning and Bandits with Queried Hints

We consider the classic online learning and stochastic multi-armed bandi...
research
10/07/2019

Stochastic Bandits with Delay-Dependent Payoffs

Motivated by recommendation problems in music streaming platforms, we pr...
research
09/29/2021

Batched Bandits with Crowd Externalities

In Batched Multi-Armed Bandits (BMAB), the policy is not allowed to be u...
research
10/31/2022

Indexability is Not Enough for Whittle: Improved, Near-Optimal Algorithms for Restless Bandits

We study the problem of planning restless multi-armed bandits (RMABs) wi...
research
03/04/2020

Taking a hint: How to leverage loss predictors in contextual bandits?

We initiate the study of learning in contextual bandits with the help of...
research
07/07/2021

Episodic Bandits with Stochastic Experts

We study a version of the contextual bandit problem where an agent is gi...
research
06/10/2015

Explore no more: Improved high-probability regret bounds for non-stochastic bandits

This work addresses the problem of regret minimization in non-stochastic...

Please sign up or login with your details

Forgot password? Click here to reset