Optimal Uniform OPE and Model-based Offline Reinforcement Learning in Time-Homogeneous, Reward-Free and Task-Agnostic Settings

05/13/2021

∙

This work studies the statistical limits of uniform convergence for offline policy evaluation (OPE) problems with model-based methods (for finite horizon MDP) and provides a unified view towards optimal learning for several well-motivated offline tasks. Uniform OPE sup_Π|Q^π-Q̂^π|<ϵ (initiated by <cit.>) is a stronger measure than the point-wise (fixed policy) OPE and ensures offline policy learning when Π contains all policies (global policy class). In this paper, we establish an Ω(H^2 S/d_mϵ^2) lower bound (over model-based family) for the global uniform OPE, where d_m is the minimal state-action probability induced by the behavior policy. Next, our main result establishes an episode complexity of Õ(H^2/d_mϵ^2) for local uniform convergence that applies to all near-empirically optimal policies for the MDPs with stationary transition. This result implies the optimal sample complexity for offline learning and separates the local uniform OPE from the global case due to the extra S factor. Paramountly, the model-based method combining with our new analysis technique (singleton absorbing MDP) can be adapted to the new settings: offline task-agnostic and the offline reward-free with optimal complexity Õ(H^2log(K)/d_mϵ^2) (K is the number of tasks) and Õ(H^2S/d_mϵ^2) respectively, which provides a unified framework for simultaneously solving different offline RL problems.

READ FULL TEXT

Optimal Uniform OPE and Model-based Offline Reinforcement Learning in Time-Homogeneous, Reward-Free and Task-Agnostic Settings

Sign in with Google

Consider DeepAI Pro