Safely Bridging Offline and Online Reinforcement Learning

10/25/2021

∙

A key challenge to deploying reinforcement learning in practice is exploring safely. We propose a natural safety property – uniformly outperforming a conservative policy (adaptively estimated from all data observed thus far), up to a per-episode exploration budget. We then design an algorithm that uses a UCB reinforcement learning policy for exploration, but overrides it as needed to ensure safety with high probability. We experimentally validate our results on a sepsis treatment task, demonstrating that our algorithm can learn while ensuring good performance compared to the baseline policy for every patient.

READ FULL TEXT

Safely Bridging Offline and Online Reinforcement Learning

Sign in with Google

Consider DeepAI Pro