Parameter-based Value Functions
Learning value functions off-policy is at the core of modern Reinforcement Learning (RL). Traditional off-policy actor-critic algorithms, however, only approximate the true policy gradient, since the gradient ∇_θ Q^π_θ(s,a) of the action-value function with respect to the policy parameters is often ignored. We introduce a class of value functions called Parameter-based Value Functions (PVFs) whose inputs include the policy parameters. PVFs can evaluate the performance of any policy given a state, a state-action pair, or a distribution over the RL agent's initial states. We show how PVFs yield exact policy gradient theorems. We derive off-policy actor-critic algorithms based on PVFs trained using Monte Carlo or Temporal Difference methods. Preliminary experimental results indicate that PVFs can effectively evaluate deterministic linear and nonlinear policies, outperforming state-of-the-art algorithms in the continuous control environment Swimmer-v3. Finally, we show how recurrent neural networks can be trained through PVFs to solve supervised and RL problems involving partial observability and long time lags between relevant events. This provides an alternative to backpropagation through time.
READ FULL TEXT