Variance Reduction based Partial Trajectory Reuse to Accelerate Policy Gradient Optimization

05/06/2022
by   Hua Zheng, et al.
0

We extend the idea underlying the success of green simulation assisted policy gradient (GS-PG) to partial historical trajectory reuse for infinite-horizon Markov Decision Processes (MDP). The existing GS-PG method was designed to learn from complete episodes or process trajectories, which limits its applicability to low-data environment and online process control. In this paper, the mixture likelihood ratio (MLR) based policy gradient estimation is used to leverage the information from historical state decision transitions generated under different behavioral policies. We propose a variance reduction experience replay (VRER) approach that can intelligently select and reuse most relevant transition observations, improve the policy gradient estimation accuracy, and accelerate the learning of optimal policy. Then we create a process control strategy by incorporating VRER with the state-of-the-art step-based policy optimization approaches such as actor-critic method and proximal policy optimizations. The empirical study demonstrates that the proposed policy gradient methodology can significantly outperform the existing policy optimization approaches.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
10/17/2021

Green Simulation Assisted Policy Gradient to Accelerate Stochastic Process Control

This study is motivated by the critical challenges in the biopharmaceuti...
research
08/25/2022

Variance Reduction based Experience Replay for Policy Optimization

For reinforcement learning on complex stochastic systems where many fact...
research
06/17/2020

Green Simulation Assisted Reinforcement Learning with Model Risk for Biomanufacturing Learning and Control

Biopharmaceutical manufacturing faces critical challenges, including com...
research
09/15/2022

Continuous MDP Homomorphisms and Homomorphic Policy Gradient

Abstraction has been widely studied as a way to improve the efficiency a...
research
01/26/2023

Partial advantage estimator for proximal policy optimization

Estimation of value in policy gradient methods is a fundamental problem....
research
01/24/2022

TOPS: Transition-based VOlatility-controlled Policy Search and its Global Convergence

Risk-averse problems receive far less attention than risk-neutral contro...
research
01/21/2022

Occupancy Information Ratio: Infinite-Horizon, Information-Directed, Parameterized Policy Search

We develop a new measure of the exploration/exploitation trade-off in in...

Please sign up or login with your details

Forgot password? Click here to reset