Variance Reduction based Experience Replay for Policy Optimization

08/25/2022
by   Hua Zheng, et al.
0

For reinforcement learning on complex stochastic systems where many factors dynamically impact the output trajectories, it is desirable to effectively leverage the information from historical samples collected in previous iterations to accelerate policy optimization. Classical experience replay allows agents to remember by reusing historical observations. However, the uniform reuse strategy that treats all observations equally overlooks the relative importance of different samples. To overcome this limitation, we propose a general variance reduction based experience replay (VRER) framework that can selectively reuse the most relevant samples to improve policy gradient estimation. This selective mechanism can adaptively put more weight on past samples that are more likely to be generated by the current target distribution. Our theoretical and empirical studies show that the proposed VRER can accelerate the learning of optimal policy and enhance the performance of state-of-the-art policy optimization approaches.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
05/06/2022

Variance Reduction based Partial Trajectory Reuse to Accelerate Policy Gradient Optimization

We extend the idea underlying the success of green simulation assisted p...
research
10/17/2021

Green Simulation Assisted Policy Gradient to Accelerate Stochastic Process Control

This study is motivated by the critical challenges in the biopharmaceuti...
research
12/07/2021

PTR-PPO: Proximal Policy Optimization with Prioritized Trajectory Replay

On-policy deep reinforcement learning algorithms have low data utilizati...
research
02/17/2020

Adaptive Experience Selection for Policy Gradient

Policy gradient reinforcement learning (RL) algorithms have achieved imp...
research
05/15/2021

Regret Minimization Experience Replay

Experience replay is widely used in various deep off-policy reinforcemen...
research
09/15/2022

On the Reuse Bias in Off-Policy Reinforcement Learning

Importance sampling (IS) is a popular technique in off-policy evaluation...
research
12/08/2021

Convergence Results For Q-Learning With Experience Replay

A commonly used heuristic in RL is experience replay (e.g. <cit.>), in w...

Please sign up or login with your details

Forgot password? Click here to reset