Breaking the Sample Complexity Barrier to Regret-Optimal Model-Free Reinforcement Learning

10/09/2021
by   Gen Li, et al.
5

Achieving sample efficiency in online episodic reinforcement learning (RL) requires optimally balancing exploration and exploitation. When it comes to a finite-horizon episodic Markov decision process with S states, A actions and horizon length H, substantial progress has been achieved towards characterizing the minimax-optimal regret, which scales on the order of √(H^2SAT) (modulo log factors) with T the total number of samples. While several competing solution paradigms have been proposed to minimize regret, they are either memory-inefficient, or fall short of optimality unless the sample size exceeds an enormous threshold (e.g., S^6A^4 poly(H) for existing model-free methods). To overcome such a large sample size barrier to efficient RL, we design a novel model-free algorithm, with space complexity O(SAH), that achieves near-optimal regret as soon as the sample size exceeds the order of SA poly(H). In terms of this sample size requirement (also referred to the initial burn-in cost), our method improves – by at least a factor of S^5A^3 – upon any prior memory-efficient algorithm that is asymptotically regret-optimal. Leveraging the recently introduced variance reduction strategy (also called reference-advantage decomposition), the proposed algorithm employs an early-settled reference update rule, with the aid of two Q-learning sequences with upper and lower confidence bounds. The design principle of our early-settled variance reduction method might be of independent interest to other RL settings that involve intricate exploration-exploitation trade-offs.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
07/25/2023

Settling the Sample Complexity of Online Reinforcement Learning

A central issue lying at the heart of online reinforcement learning (RL)...
research
05/24/2023

Regret-Optimal Model-Free Reinforcement Learning for Discounted MDPs with Short Burn-In Time

A crucial problem in reinforcement learning is learning the optimal poli...
research
05/26/2020

Breaking the Sample Size Barrier in Model-Based Reinforcement Learning with a Generative Model

We investigate the sample efficiency of reinforcement learning in a γ-di...
research
02/28/2022

Pessimistic Q-Learning for Offline Reinforcement Learning: Towards Optimal Sample Complexity

Offline or batch reinforcement learning seeks to learn a near-optimal po...
research
05/29/2023

One Objective to Rule Them All: A Maximization Objective Fusing Estimation and Planning for Exploration

In online reinforcement learning (online RL), balancing exploration and ...
research
04/14/2023

Minimax-Optimal Reward-Agnostic Exploration in Reinforcement Learning

This paper studies reward-agnostic exploration in reinforcement learning...
research
07/14/2020

Single-partition adaptive Q-learning

This paper introduces single-partition adaptive Q-learning (SPAQL), an a...

Please sign up or login with your details

Forgot password? Click here to reset