Almost Optimal Model-Free Reinforcement Learning via Reference-Advantage Decomposition

04/21/2020
by   Zihan Zhang, et al.
0

We study the reinforcement learning problem in the setting of finite-horizon episodic Markov Decision Processes (MDPs) with S states, A actions, and episode length H. We propose a model-free algorithm UCB-Advantage and prove that it achieves Õ(√(H^2SAT)) regret where T = KH and K is the number of episodes to play. Our regret bound improves upon the results of [Jin et al., 2018] and matches the best known model-based algorithms as well as the information theoretic lower bound up to logarithmic factors. We also show that UCB-Advantage achieves low local switching cost and applies to concurrent reinforcement learning, improving upon the recent results of [Bai et al., 2019].

READ FULL TEXT

page 1

page 2

page 3

page 4

research
06/08/2020

A Model-free Learning Algorithm for Infinite-horizon Average-reward MDPs with Near-optimal Regret

Recently, model-free reinforcement learning has attracted research atten...
research
04/24/2019

Stochastic Lipschitz Q-Learning

In an episodic Markov Decision Process (MDP) problem, an online algorith...
research
03/03/2022

The Best of Both Worlds: Reinforcement Learning with Logarithmic Regret and Policy Switches

In this paper, we study the problem of regret minimization for episodic ...
research
09/22/2020

Is Q-Learning Provably Efficient? An Extended Analysis

This work extends the analysis of the theoretical results presented with...
research
02/16/2020

Investigating Simple Object Representations in Model-Free Deep Reinforcement Learning

We explore the benefits of augmenting state-of-the-art model-free deep r...
research
12/02/2021

Differentially Private Exploration in Reinforcement Learning with Linear Representation

This paper studies privacy-preserving exploration in Markov Decision Pro...
research
01/31/2023

Sharp Variance-Dependent Bounds in Reinforcement Learning: Best of Both Worlds in Stochastic and Deterministic Environments

We study variance-dependent regret bounds for Markov decision processes ...

Please sign up or login with your details

Forgot password? Click here to reset