Gap-Dependent Unsupervised Exploration for Reinforcement Learning

08/11/2021
by   Jingfeng Wu, et al.
8

For the problem of task-agnostic reinforcement learning (RL), an agent first collects samples from an unknown environment without the supervision of reward signals, then is revealed with a reward and is asked to compute a corresponding near-optimal policy. Existing approaches mainly concern the worst-case scenarios, in which no structural information of the reward/transition-dynamics is utilized. Therefore the best sample upper bound is ∝𝒪(1/ϵ^2), where ϵ>0 is the target accuracy of the obtained policy, and can be overly pessimistic. To tackle this issue, we provide an efficient algorithm that utilizes a gap parameter, ρ>0, to reduce the amount of exploration. In particular, for an unknown finite-horizon Markov decision process, the algorithm takes only 𝒪 (1/ϵ· (H^3SA / ρ + H^4 S^2 A) ) episodes of exploration, and is able to obtain an ϵ-optimal policy for a post-revealed reward with sub-optimality gap at least ρ, where S is the number of states, A is the number of actions, and H is the length of the horizon, obtaining a nearly quadratic saving in terms of ϵ. We show that, information-theoretically, this bound is nearly tight for ρ < Θ(1/(HS)) and H>1. We further show that ∝𝒪(1) sample bound is possible for H=1 (i.e., multi-armed bandit) or with a sampling simulator, establishing a stark separation between those settings and the RL setting.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
06/19/2020

On Reward-Free Reinforcement Learning with Linear Function Approximation

Reward-free reinforcement learning (RL) is a framework which is suitable...
research
10/05/2022

Tractable Optimality in Episodic Latent MABs

We consider a multi-armed bandit problem with M latent contexts, where a...
research
04/14/2023

Minimax-Optimal Reward-Agnostic Exploration in Reinforcement Learning

This paper studies reward-agnostic exploration in reinforcement learning...
research
10/15/2022

Near-Optimal Regret Bounds for Multi-batch Reinforcement Learning

In this paper, we study the episodic reinforcement learning (RL) probl...
research
03/15/2021

Reinforcement Learning with Algorithms from Probabilistic Structure Estimation

Reinforcement learning (RL) algorithms aim to learn optimal decisions in...
research
12/21/2021

Nearly Optimal Policy Optimization with Stable at Any Time Guarantee

Policy optimization methods are one of the most widely used classes of R...
research
05/29/2018

Truncated Horizon Policy Search: Combining Reinforcement Learning & Imitation Learning

In this paper, we propose to combine imitation and reinforcement learnin...

Please sign up or login with your details

Forgot password? Click here to reset