On- and Off-Policy Monotonic Policy Improvement

10/10/2017
by   Ryo Iwaki, et al.
0

Monotonic policy improvement and off-policy learning are two main desirable properties for reinforcement learning algorithms. In this paper, by lower bounding the performance difference of two policies, we show that the monotonic policy improvement is guaranteed from on- and off-policy mixture samples. An optimization procedure which applies the proposed bound can be regarded as an off-policy natural policy gradient method. In order to support the theoretical result, we provide a trust region policy optimization method using experience replay as a naive application of our bound, and evaluate its performance in two classical benchmark problems.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
02/15/2023

Trust-Region-Free Policy Optimization for Stochastic Policies

Trust Region Policy Optimization (TRPO) is an iterative method that simu...
research
02/29/2016

Easy Monotonic Policy Iteration

A key problem in reinforcement learning for control with general functio...
research
01/18/2019

On-Policy Trust Region Policy Optimisation with Replay Buffers

Building upon the recent success of deep reinforcement learning methods,...
research
06/13/2019

Jacobian Policy Optimizations

Recently, natural policy gradient algorithms gained widespread recogniti...
research
09/04/2020

Policy Gradient Reinforcement Learning for Policy Represented by Fuzzy Rules: Application to Simulations of Speed Control of an Automobile

A method of a fusion of fuzzy inference and policy gradient reinforcemen...
research
12/03/2021

An Analytical Update Rule for General Policy Optimization

We present an analytical policy update rule that is independent of param...
research
07/16/2021

Refined Policy Improvement Bounds for MDPs

The policy improvement bound on the difference of the discounted returns...

Please sign up or login with your details

Forgot password? Click here to reset