DeepAI AI Chat
Log In Sign Up

Optimal Uniform OPE and Model-based Offline Reinforcement Learning in Time-Homogeneous, Reward-Free and Task-Agnostic Settings

by   Ming Yin, et al.

This work studies the statistical limits of uniform convergence for offline policy evaluation (OPE) problems with model-based methods (for finite horizon MDP) and provides a unified view towards optimal learning for several well-motivated offline tasks. Uniform OPE sup_Π|Q^π-Q̂^π|<ϵ (initiated by <cit.>) is a stronger measure than the point-wise (fixed policy) OPE and ensures offline policy learning when Π contains all policies (global policy class). In this paper, we establish an Ω(H^2 S/d_mϵ^2) lower bound (over model-based family) for the global uniform OPE, where d_m is the minimal state-action probability induced by the behavior policy. Next, our main result establishes an episode complexity of Õ(H^2/d_mϵ^2) for local uniform convergence that applies to all near-empirically optimal policies for the MDPs with stationary transition. This result implies the optimal sample complexity for offline learning and separates the local uniform OPE from the global case due to the extra S factor. Paramountly, the model-based method combining with our new analysis technique (singleton absorbing MDP) can be adapted to the new settings: offline task-agnostic and the offline reward-free with optimal complexity Õ(H^2log(K)/d_mϵ^2) (K is the number of tasks) and Õ(H^2S/d_mϵ^2) respectively, which provides a unified framework for simultaneously solving different offline RL problems.


page 1

page 2

page 3

page 4


Near Optimal Provable Uniform Convergence in Off-Policy Evaluation for Reinforcement Learning

The Off-Policy Evaluation aims at estimating the performance of target p...

Near-Optimal Offline Reinforcement Learning via Double Variance Reduction

We consider the problem of offline reinforcement learning (RL) – a well-...

Towards Instance-Optimal Offline Reinforcement Learning with Pessimism

We study the offline reinforcement learning (offline RL) problem, where ...

Nearly Horizon-Free Offline Reinforcement Learning

We revisit offline reinforcement learning on episodic time-homogeneous t...

Learn Dynamic-Aware State Embedding for Transfer Learning

Transfer reinforcement learning aims to improve the sample efficiency of...

π2vec: Policy Representations with Successor Features

This paper describes π2vec, a method for representing behaviors of black...

Off-policy Evaluation in Doubly Inhomogeneous Environments

This work aims to study off-policy evaluation (OPE) under scenarios wher...