Hindsight Trust Region Policy Optimization

07/29/2019
by   Hanbo Zhang, et al.
0

As reinforcement learning continues to drive machine intelligence beyond its conventional boundary, unsubstantial practices in sparse reward environment severely limit further applications in a broader range of advanced fields. Motivated by the demand for an effective deep reinforcement learning algorithm that accommodates sparse reward environment, this paper presents Hindsight Trust Region Policy Optimization (Hindsight TRPO), a method that efficiently utilizes interactions in sparse reward conditions and maintains learning stability by restricting variance during the policy update process. Firstly, the hindsight methodology is expanded to TRPO, an advanced and efficient on-policy policy gradient method. Then, under the condition that the distributions are close, the KL-divergence is appropriately approximated by another f-divergence. Such approximation results in the decrease of variance during KL-divergence estimation and alleviates the instability during policy update. Experimental results on both discrete and continuous benchmark tasks demonstrate that Hindsight TRPO converges steadily and significantly faster than previous policy gradient methods. It achieves effective performances and high data-efficiency for training policies in sparse reward environments.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
10/17/2017

Stochastic Variance Reduction for Policy Gradient Estimation

Recent advances in policy gradient methods and deep learning have demons...
research
05/25/2018

Learning Self-Imitating Diverse Policies

Deep reinforcement learning algorithms, including policy gradient method...
research
07/06/2018

Memory Augmented Policy Optimization for Program Synthesis with Generalization

This paper presents Memory Augmented Policy Optimization (MAPO): a novel...
research
09/19/2019

Revisit Policy Optimization in Matrix Form

In tabular case, when the reward and environment dynamics are known, pol...
research
04/20/2022

Memory-Constrained Policy Optimization

We introduce a new constrained optimization method for policy gradient r...
research
02/07/2019

Compatible Natural Gradient Policy Search

Trust-region methods have yielded state-of-the-art results in policy sea...
research
12/29/2017

f-Divergence constrained policy improvement

To ensure stability of learning, state-of-the-art generalized policy ite...

Please sign up or login with your details

Forgot password? Click here to reset