More for Less: Safe Policy Improvement With Stronger Performance Guarantees

05/13/2023
by   Patrick Wienhöft, et al.
0

In an offline reinforcement learning setting, the safe policy improvement (SPI) problem aims to improve the performance of a behavior policy according to which sample data has been generated. State-of-the-art approaches to SPI require a high number of samples to provide practical probabilistic guarantees on the improved policy's performance. We present a novel approach to the SPI problem that provides the means to require less data for such guarantees. Specifically, to prove the correctness of these guarantees, we devise implicit transformations on the data set and the underlying environment model that serve as theoretical foundations to derive tighter improvement bounds for SPI. Our empirical evaluation, using the well-established SPI with baseline bootstrapping (SPIBB) algorithm, on standard benchmarks shows that our method indeed significantly reduces the sample complexity of the SPIBB algorithm.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
01/12/2023

Safe Policy Improvement for POMDPs via Finite-State Controllers

We study safe policy improvement (SPI) for partially observable Markov d...
research
09/11/2019

Safe Policy Improvement with an Estimated Baseline Policy

Previous work has shown the unreliability of existing algorithms in the ...
research
08/01/2022

Safe Policy Improvement Approaches and their Limitations

Safe Policy Improvement (SPI) is an important technique for offline rein...
research
07/11/2019

Safe Policy Improvement with Soft Baseline Bootstrapping

Batch Reinforcement Learning (Batch RL) consists in training a policy us...
research
02/20/2020

Safe Counterfactual Reinforcement Learning

We develop a method for predicting the performance of reinforcement lear...
research
07/13/2016

Safe Policy Improvement by Minimizing Robust Baseline Regret

An important problem in sequential decision-making under uncertainty is ...
research
05/20/2018

Safe Policy Learning from Observations

In this paper, we consider the problem of learning a policy by observing...

Please sign up or login with your details

Forgot password? Click here to reset