Log In Sign Up

A Provably Efficient Sample Collection Strategy for Reinforcement Learning

by   Jean Tarbouriech, et al.

A common assumption in reinforcement learning (RL) is to have access to a generative model (i.e., a simulator of the environment), which allows to generate samples from any desired state-action pair. Nonetheless, in many settings a generative model may not be available and an adaptive exploration strategy is needed to efficiently collect samples from an unknown environment by direct interaction. In this paper, we study the scenario where an algorithm based on the generative model assumption defines the (possibly time-varying) amount of samples b(s,a) required at each state-action pair (s,a) and an exploration strategy has to learn how to generate b(s,a) samples as fast as possible. Building on recent results for regret minimization in the stochastic shortest path (SSP) setting (Cohen et al., 2020; Tarbouriech et al., 2020), we derive an algorithm that requires Õ( B D + D^3/2 S^2 A) time steps to collect the B = ∑_s,a b(s,a) desired samples, in any unknown and communicating MDP with S states, A actions and diameter D. Leveraging the generality of our strategy, we readily apply it to a variety of existing settings (e.g., model estimation, pure exploration in MDPs) for which we obtain improved sample-complexity guarantees, and to a set of new problems such as best-state identification and sparse reward discovery.


page 1

page 2

page 3

page 4


An Exponential Lower Bound for Linearly-Realizable MDPs with Constant Suboptimality Gap

A fundamental question in the theory of reinforcement learning is: suppo...

Reaching Goals is Hard: Settling the Sample Complexity of the Stochastic Shortest Path

We study the sample complexity of learning an ϵ-optimal policy in the St...

Q-learning with UCB Exploration is Sample Efficient for Infinite-Horizon MDP

A fundamental question in reinforcement learning is whether model-free a...

Stochastic Lipschitz Q-Learning

In an episodic Markov Decision Process (MDP) problem, an online algorith...

Solving Discounted Stochastic Two-Player Games with Near-Optimal Time and Sample Complexity

In this paper, we settle the sampling complexity of solving discounted t...

Fast active learning for pure exploration in reinforcement learning

Realistic environments often provide agents with very limited feedback. ...

No-Regret Exploration in Goal-Oriented Reinforcement Learning

Many popular reinforcement learning problems (e.g., navigation in a maze...