Combining policy gradient and Q-learning

by   Brendan O'Donoghue, et al.

Policy gradient is an efficient technique for improving a policy in a reinforcement learning setting. However, vanilla online variants are on-policy only and not able to take advantage of off-policy data. In this paper we describe a new technique that combines policy gradient with off-policy Q-learning, drawing experience from a replay buffer. This is motivated by making a connection between the fixed points of the regularized policy gradient algorithm and the Q-values. This connection allows us to estimate the Q-values from the action preferences of the policy, to which we apply Q-learning updates. We refer to the new technique as 'PGQL', for policy gradient and Q-learning. We also establish an equivalency between action-value fitting techniques and actor-critic algorithms, showing that regularized policy gradient techniques can be interpreted as advantage function learning algorithms. We conclude with some numerical examples that demonstrate improved data efficiency and stability of PGQL. In particular, we tested PGQL on the full suite of Atari games and achieved performance exceeding that of both asynchronous advantage actor-critic (A3C) and Q-learning.


page 1

page 2

page 3

page 4


Mean Actor Critic

We propose a new algorithm, Mean Actor-Critic (MAC), for discrete-action...

Investigation on the generalization of the Sampled Policy Gradient algorithm

The Sampled Policy Gradient (SPG) algorithm is a new offline actor-criti...

GRAC: Self-Guided and Self-Regularized Actor-Critic

Deep reinforcement learning (DRL) algorithms have successfully been demo...

Sampled Policy Gradient for Learning to Play the Game

In this paper, a new offline actor-critic learning algorithm is introduc...

A review of motion planning algorithms for intelligent robotics

We investigate and analyze principles of typical motion planning algorit...

Continuous-action Reinforcement Learning for Playing Racing Games: Comparing SPG to PPO

In this paper, a novel racing environment for OpenAI Gym is introduced. ...

Beyond the Policy Gradient Theorem for Efficient Policy Updates in Actor-Critic Algorithms

In Reinforcement Learning, the optimal action at a given state is depend...

Please sign up or login with your details

Forgot password? Click here to reset