Safe Policy Improvement for POMDPs via Finite-State Controllers

01/12/2023
by   Thiago D. Simão, et al.
0

We study safe policy improvement (SPI) for partially observable Markov decision processes (POMDPs). SPI is an offline reinforcement learning (RL) problem that assumes access to (1) historical data about an environment, and (2) the so-called behavior policy that previously generated this data by interacting with the environment. SPI methods neither require access to a model nor the environment itself, and aim to reliably improve the behavior policy in an offline manner. Existing methods make the strong assumption that the environment is fully observable. In our novel approach to the SPI problem for POMDPs, we assume that a finite-state controller (FSC) represents the behavior policy and that finite memory is sufficient to derive optimal policies. This assumption allows us to map the POMDP to a finite-state fully observable MDP, the history MDP. We estimate this MDP by combining the historical data and the memory of the FSC, and compute an improved policy using an off-the-shelf SPI algorithm. The underlying SPI method constrains the policy-space according to the available data, such that the newly computed policy only differs from the behavior policy when sufficient data was available. We show that this new policy, converted into a new FSC for the (unknown) POMDP, outperforms the behavior policy with high probability. Experimental results on several well-established benchmarks show the applicability of the approach, even in cases where finite memory is not sufficient.

READ FULL TEXT

page 12

page 13

research
05/26/2022

Pessimism in the Face of Confounders: Provably Efficient Offline Reinforcement Learning in Partially Observable Markov Decision Processes

We study offline reinforcement learning (RL) in partially observable Mar...
research
05/13/2023

More for Less: Safe Policy Improvement With Stronger Performance Guarantees

In an offline reinforcement learning setting, the safe policy improvemen...
research
05/23/2023

Search and Explore: Symbiotic Policy Synthesis in POMDPs

This paper marries two state-of-the-art controller synthesis methods for...
research
01/28/2022

Safe Policy Improvement Approaches on Discrete Markov Decision Processes

Safe Policy Improvement (SPI) aims at provable guarantees that a learned...
research
02/02/2019

Certified Reinforcement Learning with Logic Guidance

This paper proposes the first model-free Reinforcement Learning (RL) fra...
research
01/23/2013

Learning Finite-State Controllers for Partially Observable Environments

Reactive (memoryless) policies are sufficient in completely observable M...
research
09/04/2020

Technical Report: The Policy Graph Improvement Algorithm

Optimizing a partially observable Markov decision process (POMDP) policy...

Please sign up or login with your details

Forgot password? Click here to reset