Safe Policy Improvement Approaches on Discrete Markov Decision Processes

01/28/2022
by   Philipp Scholl, et al.
0

Safe Policy Improvement (SPI) aims at provable guarantees that a learned policy is at least approximately as good as a given baseline policy. Building on SPI with Soft Baseline Bootstrapping (Soft-SPIBB) by Nadjahi et al., we identify theoretical issues in their approach, provide a corrected theory, and derive a new algorithm that is provably safe on finite Markov Decision Processes (MDP). Additionally, we provide a heuristic algorithm that exhibits the best performance among many state of the art SPI algorithms on two different benchmarks. Furthermore, we introduce a taxonomy of SPI algorithms and empirically show an interesting property of two classes of SPI algorithms: while the mean performance of algorithms that incorporate the uncertainty as a penalty on the action-value is higher, actively restricting the set of policies more consistently produces good policies and is, thus, safer.

READ FULL TEXT
research
08/01/2022

Safe Policy Improvement Approaches and their Limitations

Safe Policy Improvement (SPI) is an important technique for offline rein...
research
06/14/2018

Configurable Markov Decision Processes

In many real-world problems, there is the possibility to configure, to a...
research
10/28/2021

Temporal-Difference Value Estimation via Uncertainty-Guided Soft Updates

Temporal-Difference (TD) learning methods, such as Q-Learning, have prov...
research
09/30/2022

Prioritizing emergency evacuations under compounding levels of uncertainty

Well-executed emergency evacuations can save lives and reduce suffering....
research
12/17/2021

Distillation of RL Policies with Formal Guarantees via Variational Abstraction of Markov Decision Processes (Technical Report)

We consider the challenge of policy simplification and verification in t...
research
01/12/2023

Safe Policy Improvement for POMDPs via Finite-State Controllers

We study safe policy improvement (SPI) for partially observable Markov d...
research
06/24/2016

Is the Bellman residual a bad proxy?

This paper aims at theoretically and empirically comparing two standard ...

Please sign up or login with your details

Forgot password? Click here to reset