Fast Rates for the Regret of Offline Reinforcement Learning

01/31/2021
by   Yichun Hu, et al.
15

We study the regret of reinforcement learning from offline data generated by a fixed behavior policy in an infinite-horizon discounted Markov decision process (MDP). While existing analyses of common approaches, such as fitted Q-iteration (FQI), suggest a O(1/√(n)) convergence for regret, empirical behavior exhibits much faster convergence. In this paper, we present a finer regret analysis that exactly characterizes this phenomenon by providing fast rates for the regret convergence. First, we show that given any estimate for the optimal quality function Q^*, the regret of the policy it defines converges at a rate given by the exponentiation of the Q^*-estimate's pointwise convergence rate, thus speeding it up. The level of exponentiation depends on the level of noise in the decision-making problem, rather than the estimation problem. We establish such noise levels for linear and tabular MDPs as examples. Second, we provide new analyses of FQI and Bellman residual minimization to establish the correct pointwise convergence guarantees. As specific cases, our results imply O(1/n) regret rates in linear cases and (-Ω(n)) regret rates in tabular cases.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
08/04/2023

Learning Optimal Admission Control in Partially Observable Queueing Networks

We present an efficient reinforcement learning algorithm that learns the...
research
02/07/2020

The Power of Linear Controllers in LQR Control

The Linear Quadratic Regulator (LQR) framework considers the problem of ...
research
12/02/2021

Convergence Guarantees for Deep Epsilon Greedy Policy Learning

Policy learning is a quickly growing area. As robotics and computers con...
research
11/08/2020

Online Sparse Reinforcement Learning

We investigate the hardness of online reinforcement learning in fixed ho...
research
07/22/2019

Convergence Rates of Posterior Distributions in Markov Decision Process

In this paper, we show the convergence rates of posterior distributions ...
research
10/31/2021

Fast Global Convergence of Policy Optimization for Constrained MDPs

We address the issue of safety in reinforcement learning. We pose the pr...
research
09/10/2020

RLCFR: Minimize Counterfactual Regret by Deep Reinforcement Learning

Counterfactual regret minimization (CFR) is a popular method to deal wit...

Please sign up or login with your details

Forgot password? Click here to reset