Policy Optimization with Stochastic Mirror Descent

06/25/2019
by   Long Yang, et al.
16

Stochastic mirror descent (SMD) keeps the advantages of simplicity of implementation, low memory requirement, and low computational complexity. However, the non-convexity of objective function with its non-stationary sampling process is the main bottleneck of applying SMD to reinforcement learning. To address the above problem, we propose the mirror policy optimization (MPO) by estimating the policy gradient via dynamic batch-size of gradient information. Comparing with REINFORCE or VPG, the proposed MPO improves the convergence rate from O(1/√(N)) to O( N/N). We also propose VRMPO algorithm, a variance reduction implementation of MPO. We prove the convergence of VRMPO and show its computational complexity. We evaluate the performance of VRMPO on the MuJoCo continuous control tasks, results show that VRMPO outperforms or matches several state-of-art algorithms DDPG, TRPO, PPO, and TD3.

READ FULL TEXT

page 8

page 22

research
06/14/2018

Stochastic Variance-Reduced Policy Gradient

In this paper, we propose a novel reinforcement- learning algorithm cons...
research
11/25/2021

Distributed Policy Gradient with Variance Reduction in Multi-Agent Reinforcement Learning

This paper studies a distributed policy gradient in collaborative multi-...
research
03/01/2020

A Hybrid Stochastic Policy Gradient Algorithm for Reinforcement Learning

We propose a novel hybrid stochastic policy gradient estimator by combin...
research
11/15/2020

Simple and optimal methods for stochastic variational inequalities, II: Markovian noise and policy evaluation in reinforcement learning

The focus of this paper is on stochastic variational inequalities (VI) u...
research
05/17/2022

Adaptive Momentum-Based Policy Gradient with Second-Order Information

The variance reduced gradient estimators for policy gradient methods has...
research
01/07/2020

Reanalysis of Variance Reduced Temporal Difference Learning

Temporal difference (TD) learning is a popular algorithm for policy eval...
research
06/21/2019

Entropic Risk Measure in Policy Search

With the increasing pace of automation, modern robotic systems need to a...

Please sign up or login with your details

Forgot password? Click here to reset