Improving On-policy Learning with Statistical Reward Accumulation

09/07/2018
by   Yubin Deng, et al.
3

Deep reinforcement learning has obtained significant breakthroughs in recent years. Most methods in deep-RL achieve good results via the maximization of the reward signal provided by the environment, typically in the form of discounted cumulative returns. Such reward signals represent the immediate feedback of a particular action performed by an agent. However, tasks with sparse reward signals are still challenging to on-policy methods. In this paper, we introduce an effective characterization of past reward statistics (which can be seen as long-term feedback signals) to supplement this immediate reward feedback. In particular, value functions are learned with multi-critics supervision, enabling complex value functions to be more easily approximated in on-policy learning, even when the reward signals are sparse. We also introduce a novel exploration mechanism called "hot-wiring" that can give a boost to seemingly trapped agents. We demonstrate the effectiveness of our advantage actor multi-critic (A2MC) method across the discrete domains in Atari games as well as continuous domains in the MuJoCo environments. A video demo is provided at https://youtu.be/zBmpf3Yz8tc.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
02/21/2023

Potential-based reward shaping for learning to play text-based adventure games

Text-based games are a popular testbed for language-based reinforcement ...
research
05/29/2020

Reinforcement Learning

Reinforcement learning (RL) is a general framework for adaptive control,...
research
12/10/2022

Effects of Spectral Normalization in Multi-agent Reinforcement Learning

A reliable critic is central to on-policy actor-critic learning. But it ...
research
12/21/2018

NADPEx: An on-policy temporally consistent exploration method for deep reinforcement learning

Reinforcement learning agents need exploratory behaviors to escape from ...
research
07/26/2019

A Unified Bellman Optimality Principle Combining Reward Maximization and Empowerment

Empowerment is an information-theoretic method that can be used to intri...
research
10/10/2022

Long N-step Surrogate Stage Reward to Reduce Variances of Deep Reinforcement Learning in Complex Problems

High variances in reinforcement learning have shown impeding successful ...
research
06/09/2021

Interaction-Grounded Learning

Consider a prosthetic arm, learning to adapt to its user's control signa...

Please sign up or login with your details

Forgot password? Click here to reset