A Near-Optimal Primal-Dual Method for Off-Policy Learning in CMDP

07/13/2022
by   Fan Chen, et al.
0

As an important framework for safe Reinforcement Learning, the Constrained Markov Decision Process (CMDP) has been extensively studied in the recent literature. However, despite the rich results under various on-policy learning settings, there still lacks some essential understanding of the offline CMDP problems, in terms of both the algorithm design and the information theoretic sample complexity lower bound. In this paper, we focus on solving the CMDP problems where only offline data are available. By adopting the concept of the single-policy concentrability coefficient C^*, we establish an Ω(min{|𝒮||𝒜|,|𝒮|+I} C^*/(1-γ)^3ϵ^2) sample complexity lower bound for the offline CMDP problem, where I stands for the number of constraints. By introducing a simple but novel deviation control mechanism, we propose a near-optimal primal-dual learning algorithm called DPDL. This algorithm provably guarantees zero constraint violation and its sample complexity matches the above lower bound except for an 𝒪̃((1-γ)^-1) factor. Comprehensive discussion on how to deal with the unknown constant C^* and the potential asynchronous structure on the offline dataset are also included.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
06/09/2021

Policy Finetuning: Bridging Sample-Efficient Offline and Online Reinforcement Learning

Recent theoretical work studies sample-efficient reinforcement learning ...
research
12/03/2018

AsyncQVI: Asynchronous-Parallel Q-Value Iteration for Reinforcement Learning with Near-Optimal Sample Complexity

In this paper, we propose AsyncQVI: Asynchronous-Parallel Q-value Iterat...
research
09/13/2021

Achieving Zero Constraint Violation for Constrained Reinforcement Learning via Primal-Dual Approach

Reinforcement learning is widely used in applications where one needs to...
research
03/07/2021

A Lower Bound for the Sample Complexity of Inverse Reinforcement Learning

Inverse reinforcement learning (IRL) is the task of finding a reward fun...
research
07/07/2020

Near Optimal Provable Uniform Convergence in Off-Policy Evaluation for Reinforcement Learning

The Off-Policy Evaluation aims at estimating the performance of target p...
research
03/14/2022

The Efficacy of Pessimism in Asynchronous Q-Learning

This paper is concerned with the asynchronous form of Q-learning, which ...
research
02/24/2020

Q-learning with Uniformly Bounded Variance: Large Discounting is Not a Barrier to Fast Learning

It has been a trend in the Reinforcement Learning literature to derive s...

Please sign up or login with your details

Forgot password? Click here to reset