Safely Bridging Offline and Online Reinforcement Learning
A key challenge to deploying reinforcement learning in practice is exploring safely. We propose a natural safety property – uniformly outperforming a conservative policy (adaptively estimated from all data observed thus far), up to a per-episode exploration budget. We then design an algorithm that uses a UCB reinforcement learning policy for exploration, but overrides it as needed to ensure safety with high probability. We experimentally validate our results on a sepsis treatment task, demonstrating that our algorithm can learn while ensuring good performance compared to the baseline policy for every patient.
READ FULL TEXT