Safe Policy Learning from Observations

05/20/2018
by   Elad Sarafian, et al.
0

In this paper, we consider the problem of learning a policy by observing numerous non-expert agents. Our goal is to extract a policy that, with high-confidence, acts better than the average agents' performance. Such a setting is important for real-world problems where expert data is scarce but non-expert data can easily be obtained, e.g. by crowdsourcing. Our approach is to pose this problem as safe policy improvement in Reinforcement Learning. First, we evaluate an average behavior policy and approximate its value function. Then, we develop a stochastic policy improvement algorithm, termed Rerouted Behavior Improvement (RBI), that safely improves the average behavior. The primary advantages of RBI over current safe learning methods are its stability in the presence of value estimation errors and the elimination of a policy search process. We demonstrate these advantages in a Taxi grid-world domain and in four games from the Atari learning environment.

READ FULL TEXT
research
12/19/2017

Safe Policy Improvement with Baseline Bootstrapping

A common goal in Reinforcement Learning is to derive a good strategy giv...
research
09/11/2019

Safe Policy Improvement with an Estimated Baseline Policy

Previous work has shown the unreliability of existing algorithms in the ...
research
12/22/2016

Non-Deterministic Policy Improvement Stabilizes Approximated Reinforcement Learning

This paper investigates a type of instability that is linked to the gree...
research
02/22/2018

Diverse Exploration for Fast and Safe Policy Improvement

We study an important yet under-addressed problem of quickly and safely ...
research
08/01/2022

Safe Policy Improvement Approaches and their Limitations

Safe Policy Improvement (SPI) is an important technique for offline rein...
research
05/13/2023

More for Less: Safe Policy Improvement With Stronger Performance Guarantees

In an offline reinforcement learning setting, the safe policy improvemen...
research
11/18/2017

Leave no Trace: Learning to Reset for Safe and Autonomous Reinforcement Learning

Deep reinforcement learning algorithms can learn complex behavioral skil...

Please sign up or login with your details

Forgot password? Click here to reset