Addressing Value Estimation Errors in Reinforcement Learning with a State-Action Return Distribution Function

01/09/2020
by   Jingliang Duan, et al.
0

In current reinforcement learning (RL) methods, function approximation errors are known to lead to the overestimated or underestimated state-action values Q, which further lead to suboptimal policies. We show that the learning of a state-action return distribution function can be used to improve the estimation accuracy of the Q-value. We combine the distributional return function within the maximum entropy RL framework in order to develop what we call the Distributional Soft Actor-Critic algorithm, DSAC, which is an off-policy method for continuous control setting. Unlike traditional distributional Q algorithms which typically only learn a discrete return distribution, DSAC can directly learn a continuous return distribution by truncating the difference between the target and current return distribution to prevent gradient explosion. Additionally, we propose a new Parallel Asynchronous Buffer-Actor-Learner architecture (PABAL) to improve the learning efficiency. We evaluate our method on the suite of MuJoCo continuous control tasks, achieving the state of the art performance.

READ FULL TEXT
research
04/30/2020

Distributional Soft Actor Critic for Risk Sensitive Learning

Most of reinforcement learning (RL) algorithms aim at maximizing the exp...
research
02/06/2022

Exploration with Multi-Sample Target Values for Distributional Reinforcement Learning

Distributional reinforcement learning (RL) aims to learn a value-network...
research
07/13/2020

Implicit Distributional Reinforcement Learning

To improve the sample efficiency of policy-gradient based reinforcement ...
research
08/03/2023

Bag of Policies for Distributional Deep Exploration

Efficient exploration in complex environments remains a major challenge ...
research
12/29/2022

Invariance to Quantile Selection in Distributional Continuous Control

In recent years distributional reinforcement learning has produced many ...
research
06/14/2018

Qualitative Measurements of Policy Discrepancy for Return-based Deep Q-Network

In this paper, we focus on policy discrepancy in return-based deep Q-net...
research
07/24/2020

Distributional Reinforcement Learning with Maximum Mean Discrepancy

Distributional reinforcement learning (RL) has achieved state-of-the-art...

Please sign up or login with your details

Forgot password? Click here to reset