One-bit feedback is sufficient for upper confidence bound policies

12/04/2020
by   Daniel Vial, et al.
0

We consider a variant of the traditional multi-armed bandit problem in which each arm is only able to provide one-bit feedback during each pull based on its past history of rewards. Our main result is the following: given an upper confidence bound policy which uses full-reward feedback, there exists a coding scheme for generating one-bit feedback, and a corresponding decoding scheme and arm selection policy, such that the ratio of the regret achieved by our policy and the regret of the full-reward feedback policy asymptotically approaches one.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
01/22/2021

Nonstationary Stochastic Multiarmed Bandits: UCB Policies and Minimax Regret

We study the nonstationary stochastic Multi-Armed Bandit (MAB) problem i...
research
06/12/2023

Budgeted Multi-Armed Bandits with Asymmetric Confidence Intervals

We study the stochastic Budgeted Multi-Armed Bandit (MAB) problem, where...
research
10/27/2011

The multi-armed bandit problem with covariates

We consider a multi-armed bandit problem in a setting where each arm pro...
research
04/11/2023

: Fair Multi-Armed Bandits with Guaranteed Rewards per Arm

Classic no-regret online prediction algorithms, including variants of th...
research
06/28/2021

Dynamic Planning and Learning under Recovering Rewards

Motivated by emerging applications such as live-streaming e-commerce, pr...
research
02/04/2014

Online Stochastic Optimization under Correlated Bandit Feedback

In this paper we consider the problem of online stochastic optimization ...
research
04/09/2019

A Note on the Equivalence of Upper Confidence Bounds and Gittins Indices for Patient Agents

This note gives a short, self-contained, proof of a sharp connection bet...

Please sign up or login with your details

Forgot password? Click here to reset