Reinforcement Learning for Markovian Bandits: Is Posterior Sampling more Scalable than Optimism?

06/16/2021
by   Nicolas Gast, et al.
0

We study learning algorithms for the classical Markovian bandit problem with discount. We explain how to adapt PSRL [24] and UCRL2 [2] to exploit the problem structure. These variants are called MB-PSRL and MB-UCRL2. While the regret bound and runtime of vanilla implementations of PSRL and UCRL2 are exponential in the number of bandits, we show that the episodic regret of MB-PSRL and MB-UCRL2 is Õ(S√(nK)) where K is the number of episodes, n is the number of bandits and S is the number of states of each bandit (the exact bound in S, n and K is given in the paper). Up to a factor √(S), this matches the lower bound of Ω(√(SnK)) that we also derive in the paper. MB-PSRL is also computationally efficient: its runtime is linear in the number of bandits. We further show that this linear runtime cannot be achieved by adapting classical non-Bayesian algorithms such as UCRL2 or UCBVI to Markovian bandit problems. Finally, we perform numerical experiments that confirm that MB-PSRL outperforms other existing algorithms in practice, both in terms of regret and of computation time.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
11/01/2017

Minimal Exploration in Structured Stochastic Bandits

This paper introduces and addresses a wide class of stochastic bandit pr...
research
05/08/2018

Profitable Bandits

Originally motivated by default risk management applications, this paper...
research
04/26/2023

Adaptation to Misspecified Kernel Regularity in Kernelised Bandits

In continuum-armed bandit problems where the underlying function resides...
research
10/07/2020

Effects of Model Misspecification on Bayesian Bandits: Case Studies in UX Optimization

Bayesian bandits using Thompson Sampling have seen increasing success in...
research
06/07/2021

On Learning to Rank Long Sequences with Contextual Bandits

Motivated by problems of learning to rank long item sequences, we introd...
research
02/03/2022

Deep Hierarchy in Bandits

Mean rewards of actions are often correlated. The form of these correlat...
research
10/18/2021

Online Sign Identification: Minimization of the Number of Errors in Thresholding Bandits

In the fixed budget thresholding bandit problem, an algorithm sequential...

Please sign up or login with your details

Forgot password? Click here to reset