Mirror Descent Policy Optimization

05/20/2020
by   Manan Tomar, et al.
0

We propose deep Reinforcement Learning (RL) algorithms inspired by mirror descent, a well-known first-order trust region optimization method for solving constrained convex problems. Our approach, which we call as Mirror Descent Policy Optimization (MDPO), is based on the idea of iteratively solving a `trust-region' problem that minimizes a sum of two terms: a linearization of the objective function and a proximity term that restricts two consecutive updates to be close to each other. Following this approach we derive on-policy and off-policy variants of the MDPO algorithm and analyze their performance while emphasizing important implementation details, motivated by the existing theoretical framework. We highlight the connections between on-policy MDPO and two popular trust region RL algorithms: TRPO and PPO, and conduct a comprehensive empirical comparison of these algorithms. We then derive off-policy MDPO and compare its performance to existing approaches. Importantly, we show that the theoretical framework of MDPO can be scaled to deep RL while achieving good performance on popular benchmarks.

READ FULL TEXT
research
09/06/2019

Adaptive Trust Region Policy Optimization: Global Convergence and Faster Rates for Regularized MDPs

Trust region policy optimization (TRPO) is a popular and empirically suc...
research
03/09/2020

Stable Policy Optimization via Off-Policy Divergence Regularization

Trust Region Policy Optimization (TRPO) and Proximal Policy Optimization...
research
10/26/2021

EnTRPO: Trust Region Policy Optimization Method with Entropy Regularization

Trust Region Policy Optimization (TRPO) is a popular and empirically suc...
research
01/22/2021

Differentiable Trust Region Layers for Deep Reinforcement Learning

Trust region methods are a popular tool in reinforcement learning as the...
research
10/18/2019

On Connections between Constrained Optimization and Reinforcement Learning

Dynamic Programming (DP) provides standard algorithms to solve Markov De...
research
12/19/2020

Uncertainty-Aware Policy Optimization: A Robust, Adaptive Trust Region Approach

In order for reinforcement learning techniques to be useful in real-worl...
research
11/08/2013

An Experimental Comparison of Trust Region and Level Sets

High-order (non-linear) functionals have become very popular in segmenta...

Please sign up or login with your details

Forgot password? Click here to reset