On-Policy Trust Region Policy Optimisation with Replay Buffers

01/18/2019
by   Dmitry Kangin, et al.
0

Building upon the recent success of deep reinforcement learning methods, we investigate the possibility of on-policy reinforcement learning improvement by reusing the data from several consecutive policies. On-policy methods bring many benefits, such as ability to evaluate each resulting policy. However, they usually discard all the information about the policies which existed before. In this work, we propose adaptation of the replay buffer concept, borrowed from the off-policy learning setting, to create the method, combining advantages of on- and off-policy learning. To achieve this, the proposed algorithm generalises the Q-, value and advantage functions for data from multiple policies. The method uses trust region optimisation, while avoiding some of the common problems of the algorithms such as TRPO or ACKTR: it uses hyperparameters to replace the trust region selection heuristics, as well as the trainable covariance matrix instead of the fixed one. In many cases, the method not only improves the results comparing to the state-of-the-art trust region on-policy learning algorithms such as PPO, ACKTR and TRPO, but also with respect to their off-policy counterpart DDPG.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
10/26/2021

EnTRPO: Trust Region Policy Optimization Method with Entropy Regularization

Trust Region Policy Optimization (TRPO) is a popular and empirically suc...
research
10/10/2017

On- and Off-Policy Monotonic Policy Improvement

Monotonic policy improvement and off-policy learning are two main desira...
research
06/24/2022

Reinforcement learning based adaptive metaheuristics

Parameter adaptation, that is the capability to automatically adjust an ...
research
01/22/2021

Differentiable Trust Region Layers for Deep Reinforcement Learning

Trust region methods are a popular tool in reinforcement learning as the...
research
11/04/2021

Towards an Understanding of Default Policies in Multitask Policy Optimization

Much of the recent success of deep reinforcement learning has been drive...
research
06/07/2021

Average-Reward Reinforcement Learning with Trust Region Methods

Most of reinforcement learning algorithms optimize the discounted criter...
research
10/01/2019

Advantage-Weighted Regression: Simple and Scalable Off-Policy Reinforcement Learning

In this paper, we aim to develop a simple and scalable reinforcement lea...

Please sign up or login with your details

Forgot password? Click here to reset