Safe Policy Improvement with an Estimated Baseline Policy

09/11/2019
by   Thiago D. Simão, et al.
0

Previous work has shown the unreliability of existing algorithms in the batch Reinforcement Learning setting, and proposed the theoretically-grounded Safe Policy Improvement with Baseline Bootstrapping (SPIBB) fix: reproduce the baseline policy in the uncertain state-action pairs, in order to control the variance on the trained policy performance. However, in many real-world applications such as dialogue systems, pharmaceutical tests or crop management, data is collected under human supervision and the baseline remains unknown. In this paper, we apply SPIBB algorithms with a baseline estimate built from the data. We formally show safe policy improvement guarantees over the true baseline even without direct access to it. Our empirical experiments on finite and continuous states tasks support the theoretical findings. It shows little loss of performance in comparison with SPIBB when the baseline policy is given, and more importantly, drastically and significantly outperforms competing algorithms both in safe policy improvement, and in average performance.

READ FULL TEXT

page 6

page 9

research
07/11/2019

Safe Policy Improvement with Soft Baseline Bootstrapping

Batch Reinforcement Learning (Batch RL) consists in training a policy us...
research
12/19/2017

Safe Policy Improvement with Baseline Bootstrapping

A common goal in Reinforcement Learning is to derive a good strategy giv...
research
05/13/2023

More for Less: Safe Policy Improvement With Stronger Performance Guarantees

In an offline reinforcement learning setting, the safe policy improvemen...
research
07/13/2016

Safe Policy Improvement by Minimizing Robust Baseline Regret

An important problem in sequential decision-making under uncertainty is ...
research
05/20/2018

Safe Policy Learning from Observations

In this paper, we consider the problem of learning a policy by observing...
research
08/01/2022

Safe Policy Improvement Approaches and their Limitations

Safe Policy Improvement (SPI) is an important technique for offline rein...
research
06/28/2022

Generalized Policy Improvement Algorithms with Theoretically Supported Sample Reuse

Real-world sequential decision making requires data-driven algorithms th...

Please sign up or login with your details

Forgot password? Click here to reset