Log In Sign Up

A Primal Approach to Constrained Policy Optimization: Global Optimality and Finite-Time Analysis

by   Tengyu Xu, et al.

Safe reinforcement learning (SRL) problems are typically modeled as constrained Markov Decision Process (CMDP), in which an agent explores the environment to maximize the expected total reward and meanwhile avoids violating certain constraints on a number of expected total costs. In general, such SRL problems have nonconvex objective functions subject to multiple nonconvex constraints, and hence are very challenging to solve, particularly to provide a globally optimal policy. Many popular SRL algorithms adopt a primal-dual structure which utilizes the updating of dual variables for satisfying the constraints. In contrast, we propose a primal approach, called constraint-rectified policy optimization (CRPO), which updates the policy alternatingly between objective improvement and constraint satisfaction. CRPO provides a primal-type algorithmic framework to solve SRL problems, where each policy update can take any variant of policy optimization step. To demonstrate the theoretical performance of CRPO, we adopt natural policy gradient (NPG) for each policy update step and show that CRPO achieves an 𝒪(1/√(T)) convergence rate to the global optimal policy in the constrained policy set and an 𝒪(1/√(T)) error bound on constraint satisfaction. This is the first finite-time analysis of SRL algorithms with global optimality guarantee. Our empirical results demonstrate that CRPO can outperform the existing primal-dual baseline algorithms significantly.


page 1

page 2

page 3

page 4


Accelerated Primal-Dual Policy Optimization for Safe Reinforcement Learning

Constrained Markov Decision Process (CMDP) is a natural framework for re...

Faster Algorithm and Sharper Analysis for Constrained Markov Decision Process

The problem of constrained Markov decision process (CMDP) is investigate...

Penalized Proximal Policy Optimization for Safe Reinforcement Learning

Safe reinforcement learning aims to learn the optimal policy while satis...

Projection-Based Constrained Policy Optimization

We consider the problem of learning control policies that optimize a rew...

Towards Painless Policy Optimization for Constrained MDPs

We study policy optimization in an infinite horizon, γ-discounted constr...

Lyapunov-based Safe Policy Optimization for Continuous Control

We study continuous action reinforcement learning problems in which it i...

A Single-Loop Gradient Descent and Perturbed Ascent Algorithm for Nonconvex Functional Constrained Optimization

Nonconvex constrained optimization problems can be used to model a numbe...