Adaptive Trust Region Policy Optimization: Global Convergence and Faster Rates for Regularized MDPs

09/06/2019
by   Lior Shani, et al.
0

Trust region policy optimization (TRPO) is a popular and empirically successful policy search algorithm in Reinforcement Learning (RL) in which a surrogate problem, that restricts consecutive policies to be `close' to one another, is iteratively solved. Nevertheless, TRPO has been considered a heuristic algorithm inspired by Conservative Policy Iteration (CPI). We show that the adaptive scaling mechanism used in TRPO is in fact the natural "RL version" of traditional trust-region methods from convex analysis. We first analyze TRPO in the planning setting, in which we have access to the model and the entire state space. Then, we consider sample-based TRPO and establish Õ(1/√(N)) convergence rate to the global optimum. Importantly, the adaptive scaling mechanism allows us to analyze TRPO in regularized MDPs for which we prove fast rates of Õ(1/N), much like results in convex optimization. This is the first result in RL of better rates when regularizing the instantaneous cost or reward.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
10/26/2021

EnTRPO: Trust Region Policy Optimization Method with Entropy Regularization

Trust Region Policy Optimization (TRPO) is a popular and empirically suc...
research
05/20/2020

Mirror Descent Policy Optimization

We propose deep Reinforcement Learning (RL) algorithms inspired by mirro...
research
02/15/2023

Trust-Region-Free Policy Optimization for Stochastic Policies

Trust Region Policy Optimization (TRPO) is an iterative method that simu...
research
02/02/2023

Average-Constrained Policy Optimization

Reinforcement Learning (RL) with constraints is becoming an increasingly...
research
01/29/2019

Trust Region-Guided Proximal Policy Optimization

Model-free reinforcement learning relies heavily on a safe yet explorato...
research
06/24/2019

Deep Conservative Policy Iteration

Conservative Policy Iteration (CPI) is a founding algorithm of Approxima...
research
10/21/2019

Policy Optimization for H_2 Linear Control with H_∞ Robustness Guarantee: Implicit Regularization and Global Convergence

Policy optimization (PO) is a key ingredient for reinforcement learning ...

Please sign up or login with your details

Forgot password? Click here to reset