Exponential Lower Bounds for Batch Reinforcement Learning: Batch RL can be Exponentially Harder than Online RL

12/14/2020
by   Andrea Zanette, et al.
0

Several practical applications of reinforcement learning involve an agent learning from past data without the possibility of further exploration. Often these applications require us to 1) identify a near optimal policy or to 2) estimate the value of a target policy. For both tasks we derive exponential information-theoretic lower bounds in discounted infinite horizon MDPs with a linear function representation for the action value function even if 1) realizability holds, 2) the batch algorithm observes the exact reward and transition functions, and 3) the batch algorithm is given the best a priori data distribution for the problem class. Furthermore, if the dataset does not come from policy rollouts then the lower bounds hold even if all policies admit a linear representation. If the objective is to find a near-optimal policy, we discover that these hard instances are easily solved by an online algorithm, showing that there exist RL problems where batch RL is exponentially harder than online RL even under the most favorable batch data distribution. In other words, online exploration is critical to enable sample efficient RL with function approximation. A second corollary is the exponential separation between finite and infinite horizon batch problems under our assumptions. On a technical level, this work helps formalize the issue known as deadly triad and explains that the bootstrapping problem is potentially more severe than the extrapolation issue for RL because unlike the latter, bootstrapping cannot be mitigated by adding more samples.

READ FULL TEXT

Please sign up or login with your details

Forgot password? Click here to reset