Accelerating Policy Gradient by Estimating Value Function from Prior Computation in Deep Reinforcement Learning

02/02/2023
by   Md Masudur Rahman, et al.
0

This paper investigates the use of prior computation to estimate the value function to improve sample efficiency in on-policy policy gradient methods in reinforcement learning. Our approach is to estimate the value function from prior computations, such as from the Q-network learned in DQN or the value function trained for different but related environments. In particular, we learn a new value function for the target task while combining it with a value estimate from the prior computation. Finally, the resulting value function is used as a baseline in the policy gradient method. This use of a baseline has the theoretical property of reducing variance in gradient computation and thus improving sample efficiency. The experiments show the successful use of prior value estimates in various settings and improved sample efficiency in several tasks.

READ FULL TEXT
research
09/09/2020

Phasic Policy Gradient

We introduce Phasic Policy Gradient (PPG), a reinforcement learning fram...
research
01/02/2011

The Local Optimality of Reinforcement Learning by Value Gradients, and its Relationship to Policy Gradient Learning

In this theoretical paper we are concerned with the problem of learning ...
research
04/09/2020

Policy Gradient using Weak Derivatives for Reinforcement Learning

This paper considers policy search in continuous state-action reinforcem...
research
02/20/2023

Improving Deep Policy Gradients with Value Function Search

Deep Policy Gradient (PG) algorithms employ value networks to drive the ...
research
12/22/2016

Non-Deterministic Policy Improvement Stabilizes Approximated Reinforcement Learning

This paper investigates a type of instability that is linked to the gree...
research
05/29/2023

VA-learning as a more efficient alternative to Q-learning

In reinforcement learning, the advantage function is critical for policy...
research
12/27/2022

Variance Reduction for Score Functions Using Optimal Baselines

Many problems involve the use of models which learn probability distribu...

Please sign up or login with your details

Forgot password? Click here to reset