Stochastic Bandits with Delay-Dependent Payoffs

10/07/2019
by   Leonardo Cella, et al.
28

Motivated by recommendation problems in music streaming platforms, we propose a nonstationary stochastic bandit model in which the expected reward of an arm depends on the number of rounds that have passed since the arm was last pulled. After proving that finding an optimal policy is NP-hard even when all model parameters are known, we introduce a class of ranking policies provably approximating, to within a constant factor, the expected reward of the optimal policy. We show an algorithm whose regret with respect to the best ranking policy is bounded by (√(kT)), where k is the number of arms and T is time. Our algorithm uses only (k T) switches, which helps when switching between policies is costly. As constructing the class of learning policies requires ordering the arms according to their expectations, we also bound the number of pulls required to do so. Finally, we run experiments to compare our algorithm against UCB on different problem instances.

READ FULL TEXT
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

03/03/2021

Combinatorial Bandits without Total Order for Arms

We consider the combinatorial bandits problem, where at each time step, ...
09/20/2021

Reinforcement Learning for Finite-Horizon Restless Multi-Armed Multi-Action Bandits

We study a finite-horizon restless multi-armed bandit problem with multi...
03/29/2022

Near-optimality for infinite-horizon restless bandits with many arms

Restless bandits are an important class of problems with applications in...
07/25/2021

Restless Bandits with Many Arms: Beating the Central Limit Theorem

We consider finite-horizon restless bandits with multiple pulls per peri...
05/04/2019

Pandora's Problem with Nonobligatory Inspection

Martin Weitzman's "Pandora's problem" furnishes the mathematical basis f...
01/23/2013

My Brain is Full: When More Memory Helps

We consider the problem of finding good finite-horizon policies for POMD...
01/06/2022

Learning Optimal Antenna Tilt Control Policies: A Contextual Linear Bandit Approach

Controlling antenna tilts in cellular networks is imperative to reach an...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.