Safe Policy Improvement by Minimizing Robust Baseline Regret

07/13/2016
by   Marek Petrik, et al.
0

An important problem in sequential decision-making under uncertainty is to use limited data to compute a safe policy, i.e., a policy that is guaranteed to perform at least as well as a given baseline strategy. In this paper, we develop and analyze a new model-based approach to compute a safe policy when we have access to an inaccurate dynamics model of the system with known accuracy guarantees. Our proposed robust method uses this (inaccurate) model to directly minimize the (negative) regret w.r.t. the baseline policy. Contrary to the existing approaches, minimizing the regret allows one to improve the baseline policy in states with accurate dynamics and seamlessly fall back to the baseline policy, otherwise. We show that our formulation is NP-hard and propose an approximate algorithm. Our empirical results on several domains show that even this relatively simple approximate algorithm can significantly outperform standard approaches.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
12/19/2017

Safe Policy Improvement with Baseline Bootstrapping

A common goal in Reinforcement Learning is to derive a good strategy giv...
research
09/11/2019

Safe Policy Improvement with an Estimated Baseline Policy

Previous work has shown the unreliability of existing algorithms in the ...
research
11/28/2021

Identification of Subgroups With Similar Benefits in Off-Policy Policy Evaluation

Off-policy policy evaluation methods for sequential decision making can ...
research
02/26/2021

A Regret Minimization Approach to Iterative Learning Control

We consider the setting of iterative learning control, or model-based po...
research
05/13/2023

More for Less: Safe Policy Improvement With Stronger Performance Guarantees

In an offline reinforcement learning setting, the safe policy improvemen...
research
09/20/2022

A Joint Imitation-Reinforcement Learning Framework for Reduced Baseline Regret

In various control task domains, existing controllers provide a baseline...
research
11/14/2022

Follow the Clairvoyant: an Imitation Learning Approach to Optimal Control

We consider control of dynamical systems through the lens of competitive...

Please sign up or login with your details

Forgot password? Click here to reset