Is Standard Deviation the New Standard? Revisiting the Critic in Deep Policy Gradients

10/09/2020
by   Yannis Flet-Berliac, et al.
7

Policy gradient algorithms have proven to be successful in diverse decision making and control tasks. However, these methods suffer from high sample complexity and instability issues. In this paper, we address these challenges by providing a different approach for training the critic in the actor-critic framework. Our work builds on recent studies indicating that traditional actor-critic algorithms do not succeed in fitting the true value function, calling for the need to identify a better objective for the critic. In our method, the critic uses a new state-value (resp. state-action-value) function approximation that learns the relative value of the states (resp. state-action pairs) rather than their absolute value as in conventional actor-critic. We prove the theoretical consistency of the new gradient estimator and observe dramatic empirical improvement across a variety of continuous control tasks and algorithms. Furthermore, we validate our method in tasks with sparse rewards, where we provide experimental evidence and theoretical insights.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
02/21/2018

Convergent Actor-Critic Algorithms Under Off-Policy Training and Function Approximation

We present the first class of policy-gradient algorithms that work with ...
research
06/10/2018

Distributional Advantage Actor-Critic

In traditional reinforcement learning, an agent maximizes the reward col...
research
01/23/2017

Learning to Decode for Future Success

We introduce a simple, general strategy to manipulate the behavior of a ...
research
04/09/2018

Policy Gradient With Value Function Approximation For Collective Multiagent Planning

Decentralized (PO)MDPs provide an expressive framework for sequential de...
research
09/07/2019

Multi Pseudo Q-learning Based Deterministic Policy Gradient for Tracking Control of Autonomous Underwater Vehicles

This paper investigates trajectory tracking problem for a class of under...
research
06/19/2020

Band-limited Soft Actor Critic Model

Soft Actor Critic (SAC) algorithms show remarkable performance in comple...
research
06/10/2019

Exploiting the sign of the advantage function to learn deterministic policies in continuous domains

In the context of learning deterministic policies in continuous domains,...

Please sign up or login with your details

Forgot password? Click here to reset