Distillation of RL Policies with Formal Guarantees via Variational Abstraction of Markov Decision Processes (Technical Report)

12/17/2021
by   Florent Delgrange, et al.
0

We consider the challenge of policy simplification and verification in the context of policies learned through reinforcement learning (RL) in continuous environments. In well-behaved settings, RL algorithms have convergence guarantees in the limit. While these guarantees are valuable, they are insufficient for safety-critical applications. Furthermore, they are lost when applying advanced techniques such as deep-RL. To recover guarantees when applying advanced RL algorithms to more complex environments with (i) reachability, (ii) safety-constrained reachability, or (iii) discounted-reward objectives, we build upon the DeepMDP framework introduced by Gelada et al. to derive new bisimulation bounds between the unknown environment and a learned discrete latent model of it. Our bisimulation bounds enable the application of formal methods for Markov decision processes. Finally, we show how one can use a policy obtained via state-of-the-art RL to efficiently train a variational autoencoder that yields a discrete latent model with provably approximately correct bisimulation guarantees. Additionally, we obtain a distilled version of the policy for the latent model.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
03/22/2023

Wasserstein Auto-encoded MDPs: Formal Verification of Efficiently Distilled RL Policies with Many-sided Guarantees

Although deep reinforcement learning (DRL) has many success stories, the...
research
04/02/2022

Safe Reinforcement Learning via Shielding for POMDPs

Reinforcement learning (RL) in safety-critical environments requires an ...
research
01/28/2022

Safe Policy Improvement Approaches on Discrete Markov Decision Processes

Safe Policy Improvement (SPI) aims at provable guarantees that a learned...
research
06/01/2023

Identifiability and Generalizability in Constrained Inverse Reinforcement Learning

Two main challenges in Reinforcement Learning (RL) are designing appropr...
research
07/02/2019

Learning the Arrow of Time

We humans seem to have an innate understanding of the asymmetric progres...
research
01/24/2023

AutoCost: Evolving Intrinsic Cost for Zero-violation Reinforcement Learning

Safety is a critical hurdle that limits the application of deep reinforc...
research
01/07/2022

Mirror Learning: A Unifying Framework of Policy Optimisation

General policy improvement (GPI) and trust-region learning (TRL) are the...

Please sign up or login with your details

Forgot password? Click here to reset