Exponential Lower Bounds for Batch Reinforcement Learning: Batch RL can be Exponentially Harder than Online RL

by   Andrea Zanette, et al.

Several practical applications of reinforcement learning involve an agent learning from past data without the possibility of further exploration. Often these applications require us to 1) identify a near optimal policy or to 2) estimate the value of a target policy. For both tasks we derive exponential information-theoretic lower bounds in discounted infinite horizon MDPs with a linear function representation for the action value function even if 1) realizability holds, 2) the batch algorithm observes the exact reward and transition functions, and 3) the batch algorithm is given the best a priori data distribution for the problem class. Furthermore, if the dataset does not come from policy rollouts then the lower bounds hold even if all policies admit a linear representation. If the objective is to find a near-optimal policy, we discover that these hard instances are easily solved by an online algorithm, showing that there exist RL problems where batch RL is exponentially harder than online RL even under the most favorable batch data distribution. In other words, online exploration is critical to enable sample efficient RL with function approximation. A second corollary is the exponential separation between finite and infinite horizon batch problems under our assumptions. On a technical level, this work helps formalize the issue known as deadly triad and explains that the bootstrapping problem is potentially more severe than the extrapolation issue for RL because unlike the latter, bootstrapping cannot be mitigated by adding more samples.



There are no comments yet.


page 1

page 2

page 3

page 4


On Reward-Free Reinforcement Learning with Linear Function Approximation

Reward-free reinforcement learning (RL) is a framework which is suitable...

Is a Good Representation Sufficient for Sample Efficient Reinforcement Learning?

Modern deep learning methods provide an effective means to learn good re...

Is Pessimism Provably Efficient for Offline RL?

We study offline reinforcement learning (RL), which aims to learn an opt...

Near-Optimal Reward-Free Exploration for Linear Mixture MDPs with Plug-in Solver

Although model-based reinforcement learning (RL) approaches are consider...

Corruption-Robust Offline Reinforcement Learning

We study the adversarial robustness in offline reinforcement learning. G...

Information-Theoretic Considerations in Batch Reinforcement Learning

Value-function approximation methods that operate in batch mode have fou...

Sparse Feature Selection Makes Batch Reinforcement Learning More Sample Efficient

This paper provides a statistical analysis of high-dimensional batch Rei...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.