Nearly Minimax Optimal Reward-free Reinforcement Learning

10/12/2020
by   Zihan Zhang, et al.
0

We study the reward-free reinforcement learning framework, which is particularly suitable for batch reinforcement learning and scenarios where one needs policies for multiple reward functions. This framework has two phases. In the exploration phase, the agent collects trajectories by interacting with the environment without using any reward signal. In the planning phase, the agent needs to return a near-optimal policy for arbitrary reward functions. We give a new efficient algorithm, Staged Sampling + Truncated Planning (), which interacts with the environment at most O( S^2A/ϵ^2polylog(SAH/ϵ) ) episodes in the exploration phase, and guarantees to output a near-optimal policy for arbitrary reward functions in the planning phase. Here, S is the size of state space, A is the size of action space, H is the planning horizon, and ϵ is the target accuracy relative to the total reward. Notably, our sample complexity scales only logarithmically with H, in contrast to all existing results which scale polynomially with H. Furthermore, this bound matches the minimax lower bound Ω(S^2A/ϵ^2) up to logarithmic factors. Our results rely on three new techniques : 1) A new sufficient condition for the dataset to plan for an ϵ-suboptimal policy; 2) A new way to plan efficiently under the proposed condition using soft-truncated planning; 3) Constructing extended MDP to maximize the truncated accumulative rewards efficiently.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
03/17/2023

Optimal Horizon-Free Reward-Free Exploration for Linear Mixture MDPs

We study reward-free reinforcement learning (RL) with linear function ap...
research
02/07/2020

Reward-Free Exploration for Reinforcement Learning

Exploration is widely regarded as one of the most challenging aspects of...
research
05/29/2018

Truncated Horizon Policy Search: Combining Reinforcement Learning & Imitation Learning

In this paper, we propose to combine imitation and reinforcement learnin...
research
06/11/2020

Exploration by Maximizing Rényi Entropy for Zero-Shot Meta RL

Exploring the transition dynamics is essential to the success of reinfor...
research
06/28/2022

Safe Exploration Incurs Nearly No Additional Sample Complexity for Reward-free RL

While the primary goal of the exploration phase in reward-free reinforce...
research
09/06/2021

Hindsight Reward Tweaking via Conditional Deep Reinforcement Learning

Designing optimal reward functions has been desired but extremely diffic...
research
12/26/2021

Reducing Planning Complexity of General Reinforcement Learning with Non-Markovian Abstractions

The field of General Reinforcement Learning (GRL) formulates the problem...

Please sign up or login with your details

Forgot password? Click here to reset