Convergent Actor-Critic Algorithms Under Off-Policy Training and Function Approximation

02/21/2018
by   Hamid Reza Maei, et al.
0

We present the first class of policy-gradient algorithms that work with both state-value and policy function-approximation, and are guaranteed to converge under off-policy training. Our solution targets problems in reinforcement learning where the action representation adds to the-curse-of-dimensionality; that is, with continuous or large action sets, thus making it infeasible to estimate state-action value functions (Q functions). Using state-value functions helps to lift the curse and as a result naturally turn our policy-gradient solution into classical Actor-Critic architecture whose Actor uses state-value function for the update. Our algorithms, Gradient Actor-Critic and Emphatic Actor-Critic, are derived based on the exact gradient of averaged state-value function objective and thus are guaranteed to converge to its optimal solution, while maintaining all the desirable properties of classical Actor-Critic methods with no additional hyper-parameters. To our knowledge, this is the first time that convergent off-policy learning methods have been extended to classical Actor-Critic methods with function approximation.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
10/09/2020

Is Standard Deviation the New Standard? Revisiting the Critic in Deep Policy Gradients

Policy gradient algorithms have proven to be successful in diverse decis...
research
12/11/2019

Doubly Robust Off-Policy Actor-Critic Algorithms for Reinforcement Learning

We study the problem of off-policy critic evaluation in several variants...
research
02/07/2019

The Actor-Advisor: Policy Gradient With Off-Policy Advice

Actor-critic algorithms learn an explicit policy (actor), and an accompa...
research
06/16/2020

Parameter-based Value Functions

Learning value functions off-policy is at the core of modern Reinforceme...
research
07/04/2012

A Function Approximation Approach to Estimation of Policy Gradient for POMDP with Structured Policies

We consider the estimation of the policy gradient in partially observabl...
research
09/01/2017

Mean Actor Critic

We propose a new algorithm, Mean Actor-Critic (MAC), for discrete-action...
research
04/15/2017

The Reactor: A Sample-Efficient Actor-Critic Architecture

In this work we present a new reinforcement learning agent, called React...

Please sign up or login with your details

Forgot password? Click here to reset