Settling the Sample Complexity of Online Reinforcement Learning

07/25/2023
by   Zihan Zhang, et al.
0

A central issue lying at the heart of online reinforcement learning (RL) is data efficiency. While a number of recent works achieved asymptotically minimal regret in online RL, the optimality of these results is only guaranteed in a “large-sample” regime, imposing enormous burn-in cost in order for their algorithms to operate optimally. How to achieve minimax-optimal regret without incurring any burn-in cost has been an open problem in RL theory. We settle this problem for the context of finite-horizon inhomogeneous Markov decision processes. Specifically, we prove that a modified version of Monotonic Value Propagation (MVP), a model-based algorithm proposed by <cit.>, achieves a regret on the order of (modulo log factors) min{√(SAH^3K), HK }, where S is the number of states, A is the number of actions, H is the planning horizon, and K is the total number of episodes. This regret matches the minimax lower bound for the entire range of sample size K≥ 1, essentially eliminating any burn-in requirement. It also translates to a PAC sample complexity (i.e., the number of episodes needed to yield ε-accuracy) of SAH^3/ε^2 up to log factor, which is minimax-optimal for the full ε-range. Further, we extend our theory to unveil the influences of problem-dependent quantities like the optimal value/cost and certain variances. The key technical innovation lies in the development of a new regret decomposition strategy and a novel analysis paradigm to decouple complicated statistical dependency – a long-standing challenge facing the analysis of online RL in the sample-hungry regime.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
10/09/2021

Breaking the Sample Complexity Barrier to Regret-Optimal Model-Free Reinforcement Learning

Achieving sample efficiency in online episodic reinforcement learning (R...
research
04/11/2022

Settling the Sample Complexity of Model-Based Offline Reinforcement Learning

This paper is concerned with offline reinforcement learning (RL), which ...
research
02/07/2023

Near-Minimax-Optimal Risk-Sensitive Reinforcement Learning with CVaR

In this paper, we study risk-sensitive Reinforcement Learning (RL), focu...
research
05/01/2020

Is Long Horizon Reinforcement Learning More Difficult Than Short Horizon Reinforcement Learning?

Learning to plan for long horizons is a central challenge in episodic re...
research
02/12/2021

Is Q-Learning Minimax Optimal? A Tight Sample Complexity Analysis

Q-learning, which seeks to learn the optimal Q-function of a Markov deci...
research
05/26/2020

Breaking the Sample Size Barrier in Model-Based Reinforcement Learning with a Generative Model

We investigate the sample efficiency of reinforcement learning in a γ-di...
research
03/25/2021

Nearly Horizon-Free Offline Reinforcement Learning

We revisit offline reinforcement learning on episodic time-homogeneous t...

Please sign up or login with your details

Forgot password? Click here to reset