Regret Bounds for Reinforcement Learning via Markov Chain Concentration

08/06/2018
by   Ronald Ortner, et al.
0

We give a simple optimistic algorithm for which it is easy to derive regret bounds of Õ(√(t_ mix SAT)) after T steps in uniformly ergodic MDPs with S states, A actions, and mixing time parameter t_ mix. These bounds are the first regret bounds in the general, non-episodic setting with an optimal dependence on all given parameters. They could only be improved by using an alternative mixing time parameter.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
10/01/2020

Minimax Optimal Reinforcement Learning for Discounted MDPs

We study the reinforcement learning problem for discounted Markov Decisi...
research
07/02/2021

Beyond Value-Function Gaps: Improved Instance-Dependent Regret Bounds for Episodic Reinforcement Learning

We provide improved gap-dependent regret bounds for reinforcement learni...
research
07/19/2019

Delegative Reinforcement Learning: learning to avoid traps with a little help

Most known regret bounds for reinforcement learning are either episodic ...
research
02/12/2020

Regret Bounds for Discounted MDPs

Recently, it has been shown that carefully designed reinforcement learni...
research
12/13/2021

Continual Learning In Environments With Polynomial Mixing Times

The mixing time of the Markov chain induced by a policy limits performan...
research
05/09/2019

Non-Asymptotic Gap-Dependent Regret Bounds for Tabular MDPs

This paper establishes that optimistic algorithms attain gap-dependent a...
research
07/06/2023

Optimal Scalarizations for Sublinear Hypervolume Regret

Scalarization is a general technique that can be deployed in any multiob...

Please sign up or login with your details

Forgot password? Click here to reset