Survival Instinct in Offline Reinforcement Learning

06/05/2023
by   Anqi Li, et al.
0

We present a novel observation about the behavior of offline reinforcement learning (RL) algorithms: on many benchmark datasets, offline RL can produce well-performing and safe policies even when trained with "wrong" reward labels, such as those that are zero everywhere or are negatives of the true rewards. This phenomenon cannot be easily explained by offline RL's return maximization objective. Moreover, it gives offline RL a degree of robustness that is uncharacteristic of its online RL counterparts, which are known to be sensitive to reward design. We demonstrate that this surprising robustness property is attributable to an interplay between the notion of pessimism in offline RL algorithms and a certain bias implicit in common data collection practices. As we prove in this work, pessimism endows the agent with a "survival instinct", i.e., an incentive to stay within the data support in the long term, while the limited and biased data coverage further constrains the set of survival policies. Formally, given a reward class – which may not even contain the true reward – we identify conditions on the training data distribution that enable offline RL to learn a near-optimal and safe policy from any reward within the class. We argue that the survival instinct should be taken into account when interpreting results from existing offline RL benchmarks and when creating future ones. Our empirical and theoretical results suggest a new paradigm for RL, whereby an agent is "nudged" to learn a desirable behavior with imperfect reward but purposely biased data coverage.

READ FULL TEXT

page 7

page 9

page 18

page 21

research
07/19/2021

Constraints Penalized Q-Learning for Safe Offline Reinforcement Learning

We study the problem of safe offline reinforcement learning (RL), the go...
research
02/01/2021

NeoRL: A Near Real-World Benchmark for Offline Reinforcement Learning

Offline reinforcement learning (RL) aims at learning a good policy from ...
research
05/17/2023

Reward-agnostic Fine-tuning: Provable Statistical Benefits of Hybrid Reinforcement Learning

This paper studies tabular reinforcement learning (RL) in the hybrid set...
research
05/22/2022

Offline Policy Comparison with Confidence: Benchmarks and Baselines

Decision makers often wish to use offline historical data to compare seq...
research
10/12/2022

Semi-Supervised Offline Reinforcement Learning with Action-Free Trajectories

Natural agents can effectively learn from multiple data sources that dif...
research
10/09/2022

The Role of Coverage in Online Reinforcement Learning

Coverage conditions – which assert that the data logging distribution ad...
research
02/03/2022

How to Leverage Unlabeled Data in Offline Reinforcement Learning

Offline reinforcement learning (RL) can learn control policies from stat...

Please sign up or login with your details

Forgot password? Click here to reset