Provably Convergent Policy Optimization via Metric-aware Trust Region Methods

06/25/2023
by   Jun Song, et al.
0

Trust-region methods based on Kullback-Leibler divergence are pervasively used to stabilize policy optimization in reinforcement learning. In this paper, we exploit more flexible metrics and examine two natural extensions of policy optimization with Wasserstein and Sinkhorn trust regions, namely Wasserstein policy optimization (WPO) and Sinkhorn policy optimization (SPO). Instead of restricting the policy to a parametric distribution class, we directly optimize the policy distribution and derive their closed-form policy updates based on the Lagrangian duality. Theoretically, we show that WPO guarantees a monotonic performance improvement, and SPO provably converges to WPO as the entropic regularizer diminishes. Moreover, we prove that with a decaying Lagrangian multiplier to the trust region constraint, both methods converge to global optimality. Experiments across tabular domains, robotic locomotion, and continuous control tasks further demonstrate the performance improvement of both approaches, more robustness of WPO to sample insufficiency, and faster convergence of SPO, over state-of-art policy gradient methods.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
07/06/2017

Trust-PCL: An Off-Policy Trust Region Method for Continuous Control

Trust region methods, such as TRPO, are often used to stabilize policy o...
research
01/22/2021

Differentiable Trust Region Layers for Deep Reinforcement Learning

Trust region methods are a popular tool in reinforcement learning as the...
research
10/20/2022

Trust Region Policy Optimization with Optimal Transport Discrepancies: Duality and Algorithm for Continuous Actions

Policy Optimization (PO) algorithms have been proven particularly suited...
research
04/20/2022

Memory-Constrained Policy Optimization

We introduce a new constrained optimization method for policy gradient r...
research
12/03/2021

An Analytical Update Rule for General Policy Optimization

We present an analytical policy update rule that is independent of param...
research
08/04/2020

Faded-Experience Trust Region Policy Optimization for Model-Free Power Allocation in Interference Channel

Policy gradient reinforcement learning techniques enable an agent to dir...
research
12/19/2017

On Wasserstein Reinforcement Learning and the Fokker-Planck equation

Policy gradients methods often achieve better performance when the chang...

Please sign up or login with your details

Forgot password? Click here to reset