Generalized Off-Policy Actor-Critic

by   Shangtong Zhang, et al.
University of Oxford

We propose a new objective, the counterfactual objective, unifying existing objectives for off-policy policy gradient algorithms in the continuing reinforcement learning (RL) setting. Compared to the commonly used excursion objective, which can be misleading about the performance of the target policy when deployed, our new objective better predicts such performance. We prove the Generalized Off-Policy Policy Gradient Theorem to compute the policy gradient of the counterfactual objective and use an emphatic approach to get an unbiased sample from this policy gradient, yielding the Generalized Off-Policy Actor-Critic (Geoff-PAC) algorithm. We demonstrate the merits of Geoff-PAC over existing algorithms in Mujoco robot simulation tasks, the first empirical success of emphatic algorithms in prevailing deep RL benchmarks.


page 1

page 2

page 3

page 4


Characterizing the Gap Between Actor-Critic and Policy Gradient

Actor-critic (AC) methods are ubiquitous in reinforcement learning. Alth...

Multi-Preference Actor Critic

Policy gradient algorithms typically combine discounted future rewards w...

Multi-objective evolution for Generalizable Policy Gradient Algorithms

Performance, generalizability, and stability are three Reinforcement Lea...

Off-Policy Actor-Critic with Emphatic Weightings

A variety of theoretically-sound policy gradient algorithms exist for th...

Coordinate Ascent for Off-Policy RL with Global Convergence Guarantees

We revisit the domain of off-policy policy optimization in RL from the p...

Novel Reinforcement Learning Algorithm for Suppressing Synchronization in Closed Loop Deep Brain Stimulators

Parkinson's disease is marked by altered and increased firing characteri...

Code Repositories


Reimplementation of

view repo

Please sign up or login with your details

Forgot password? Click here to reset