Optimistic Posterior Sampling for Reinforcement Learning with Few Samples and Tight Guarantees

09/28/2022
βˆ™
by   Daniil Tiapkin, et al.
βˆ™
0
βˆ™

We consider reinforcement learning in an environment modeled by an episodic, finite, stage-dependent Markov decision process of horizon H with S states, and A actions. The performance of an agent is measured by the regret after interacting with the environment for T episodes. We propose an optimistic posterior sampling algorithm for reinforcement learning (OPSRL), a simple variant of posterior sampling that only needs a number of posterior samples logarithmic in H, S, A, and T per state-action pair. For OPSRL we guarantee a high-probability regret bound of order at most π’ͺ(√(H^3SAT)) ignoring polylog(HSAT) terms. The key novel technical ingredient is a new sharp anti-concentration inequality for linear forms which may be of independent interest. Specifically, we extend the normal approximation-based lower bound for Beta distributions by Alfers and Dinges [1984] to Dirichlet distributions. Our bound matches the lower bound of order Ξ©(√(H^3SAT)), thereby answering the open problems raised by Agrawal and Jia [2017b] for the episodic setting.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
βˆ™ 07/19/2019

Delegative Reinforcement Learning: learning to avoid traps with a little help

Most known regret bounds for reinforcement learning are either episodic ...
research
βˆ™ 05/16/2022

From Dirichlet to Rubin: Optimistic Exploration in RL without Bonuses

We propose the Bayes-UCBVI algorithm for reinforcement learning in tabul...
research
βˆ™ 03/01/2021

UCB Momentum Q-learning: Correcting the bias without forgetting

We propose UCBMQ, Upper Confidence Bound Momentum Q-learning, a new algo...
research
βˆ™ 11/29/2022

Posterior Sampling for Continuing Environments

We develop an extension of posterior sampling for reinforcement learning...
research
βˆ™ 02/21/2022

Double Thompson Sampling in Finite stochastic Games

We consider the trade-off problem between exploration and exploitation u...
research
βˆ™ 11/08/2020

Online Sparse Reinforcement Learning

We investigate the hardness of online reinforcement learning in fixed ho...
research
βˆ™ 06/20/2019

Near-optimal Reinforcement Learning using Bayesian Quantiles

We study model-based reinforcement learning in finite communicating Mark...

Please sign up or login with your details

Forgot password? Click here to reset