Safely Bridging Offline and Online Reinforcement Learning

10/25/2021
by   Wanqiao Xu, et al.
0

A key challenge to deploying reinforcement learning in practice is exploring safely. We propose a natural safety property – uniformly outperforming a conservative policy (adaptively estimated from all data observed thus far), up to a per-episode exploration budget. We then design an algorithm that uses a UCB reinforcement learning policy for exploration, but overrides it as needed to ensure safety with high probability. We experimentally validate our results on a sepsis treatment task, demonstrating that our algorithm can learn while ensuring good performance compared to the baseline policy for every patient.

READ FULL TEXT

Please sign up or login with your details

Forgot password? Click here to reset