A Provably-Efficient Model-Free Algorithm for Constrained Markov Decision Processes

06/03/2021
by   Honghao Wei, et al.
0

This paper presents the first model-free, simulator-free reinforcement learning algorithm for Constrained Markov Decision Processes (CMDPs) with sublinear regret and zero constraint violation. The algorithm is named Triple-Q because it has three key components: a Q-function (also called action-value function) for the cumulative reward, a Q-function for the cumulative utility for the constraint, and a virtual-Queue that (over)-estimates the cumulative constraint violation. Under Triple-Q, at each step, an action is chosen based on the pseudo-Q-value that is a combination of the three Q values. The algorithm updates the reward and utility Q-values with learning rates that depend on the visit counts to the corresponding (state, action) pairs and are periodically reset. In the episodic CMDP setting, Triple-Q achieves Õ(1 /δH^4 S^1/2A^1/2K^4/5) regret, where K is the total number of episodes, H is the number of steps in each episode, S is the number of states, A is the number of actions, and δ is Slater's constant. Furthermore, Triple-Q guarantees zero constraint violation when K is sufficiently large. Finally, the computational complexity of Triple-Q is similar to SARSA for unconstrained MDPs and is computationally efficient.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
06/23/2022

Provably Efficient Model-Free Constrained RL with Linear Function Approximation

We study the constrained reinforcement learning problem, in which an age...
research
03/10/2023

Provably Efficient Model-Free Algorithms for Non-stationary CMDPs

We study model-free reinforcement learning (RL) algorithms in episodic n...
research
04/20/2015

Optimal Nudging: Solving Average-Reward Semi-Markov Decision Processes as a Minimal Sequence of Cumulative Tasks

This paper describes a novel method to solve average-reward semi-Markov ...
research
09/05/2019

√(n)-Regret for Learning in Markov Decision Processes with Function Approximation and Low Bellman Rank

In this paper, we consider the problem of online learning of Markov deci...
research
06/27/2022

Utility Theory for Sequential Decision Making

The von Neumann-Morgenstern (VNM) utility theorem shows that under certa...
research
12/03/2020

Model-free Neural Counterfactual Regret Minimization with Bootstrap Learning

Counterfactual Regret Minimization (CFR) has achieved many fascinating r...
research
12/12/2016

Online Reinforcement Learning for Real-Time Exploration in Continuous State and Action Markov Decision Processes

This paper presents a new method to learn online policies in continuous ...

Please sign up or login with your details

Forgot password? Click here to reset