Online Learning and Bandits with Queried Hints

11/04/2022
by   Aditya Bhaskara, et al.
0

We consider the classic online learning and stochastic multi-armed bandit (MAB) problems, when at each step, the online policy can probe and find out which of a small number (k) of choices has better reward (or loss) before making its choice. In this model, we derive algorithms whose regret bounds have exponentially better dependence on the time horizon compared to the classic regret bounds. In particular, we show that probing with k=2 suffices to achieve time-independent regret bounds for online linear and convex optimization. The same number of probes improve the regret bound of stochastic MAB with independent arms from O(√(nT)) to O(n^2 log T), where n is the number of arms and T is the horizon length. For stochastic MAB, we also consider a stronger model where a probe reveals the reward values of the probed arms, and show that in this case, k=3 probes suffice to achieve parameter-independent constant regret, O(n^2). Such regret bounds cannot be achieved even with full feedback after the play, showcasing the power of limited “advice” via probing before making the play. We also present extensions to the setting where the hints can be imperfect, and to the case of stochastic MAB where the rewards of the arms can be correlated.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
10/04/2022

Reproducible Bandits

In this paper, we introduce the notion of reproducible policies in the c...
research
04/11/2023

: Fair Multi-Armed Bandits with Guaranteed Rewards per Arm

Classic no-regret online prediction algorithms, including variants of th...
research
06/28/2023

Allocating Divisible Resources on Arms with Unknown and Random Rewards

We consider a decision maker allocating one unit of renewable and divisi...
research
10/12/2021

Dare not to Ask: Problem-Dependent Guarantees for Budgeted Bandits

We consider a stochastic multi-armed bandit setting where feedback is li...
research
02/11/2020

Online Learning with Imperfect Hints

We consider a variant of the classical online linear optimization proble...
research
11/27/2018

Rotting bandits are no harder than stochastic ones

In bandits, arms' distributions are stationary. This is often violated i...
research
09/03/2010

Gaussian Process Bandits for Tree Search: Theory and Application to Planning in Discounted MDPs

We motivate and analyse a new Tree Search algorithm, GPTS, based on rece...

Please sign up or login with your details

Forgot password? Click here to reset