Explicit Explore, Exploit, or Escape (E^4): near-optimal safety-constrained reinforcement learning in polynomial time

11/14/2021
by   David M. Bossens, et al.
0

In reinforcement learning (RL), an agent must explore an initially unknown environment in order to learn a desired behaviour. When RL agents are deployed in real world environments, safety is of primary concern. Constrained Markov decision processes (CMDPs) can provide long-term safety constraints; however, the agent may violate the constraints in an effort to explore its environment. This paper proposes a model-based RL algorithm called Explicit Explore, Exploit, or Escape (E^4), which extends the Explicit Explore or Exploit (E^3) algorithm to a robust CMDP setting. E^4 explicitly separates exploitation, exploration, and escape CMDPs, allowing targeted policies for policy improvement across known states, discovery of unknown states, as well as safe return to known states. E^4 robustly optimises these policies on the worst-case CMDP from a set of CMDP models consistent with the empirical observations of the deployment environment. Theoretical results show that E^4 finds a near-optimal constraint-satisfying policy in polynomial time whilst satisfying safety constraints throughout the learning process. We discuss robust-constrained offline optimisation algorithms as well as how to incorporate uncertainty in transition dynamics of unknown states based on empirical inference and prior knowledge.

READ FULL TEXT
research
12/01/2021

Safe Exploration for Constrained Reinforcement Learning with Provable Guarantees

We consider the problem of learning an episodic safe control policy that...
research
04/30/2023

Joint Learning of Policy with Unknown Temporal Constraints for Safe Reinforcement Learning

In many real-world applications, safety constraints for reinforcement le...
research
11/01/2019

Explicit Explore-Exploit Algorithms in Continuous State Spaces

We present a new model-based algorithm for reinforcement learning (RL) w...
research
07/29/2021

Lyapunov-based uncertainty-aware safe reinforcement learning

Reinforcement learning (RL) has shown a promising performance in learnin...
research
08/26/2020

Constrained Markov Decision Processes via Backward Value Functions

Although Reinforcement Learning (RL) algorithms have found tremendous su...
research
08/01/2020

Learning with Safety Constraints: Sample Complexity of Reinforcement Learning for Constrained MDPs

Many physical systems have underlying safety considerations that require...
research
03/13/2022

Policy Learning for Robust Markov Decision Process with a Mismatched Generative Model

In high-stake scenarios like medical treatment and auto-piloting, it's r...

Please sign up or login with your details

Forgot password? Click here to reset