Biased Estimates of Advantages over Path Ensembles

09/15/2019
by   Lanxin Lei, et al.
0

The estimation of advantage is crucial for a number of reinforcement learning algorithms, as it directly influences the choices of future paths. In this work, we propose a family of estimates based on the order statistics over the path ensemble, which allows one to flexibly drive the learning process, towards or against risks. On top of this formulation, we systematically study the impacts of different methods for estimating advantages. Our findings reveal that biased estimates, when chosen appropriately, can result in significant benefits. In particular, for the environments with sparse rewards, optimistic estimates would lead to more efficient exploration of the policy space; while for those where individual actions can have critical impacts, conservative estimates are preferable. On various benchmarks, including MuJoCo continuous control, Terrain locomotion, Atari games, and sparse-reward environments, the proposed biased estimation schemes consistently demonstrate improvement over mainstream methods, not only accelerating the learning process but also obtaining substantial performance gains.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
09/14/2017

Shared Learning : Enhancing Reinforcement in Q-Ensembles

Deep Reinforcement Learning has been able to achieve amazing successes i...
research
09/27/2022

DCE: Offline Reinforcement Learning With Double Conservative Estimates

Offline Reinforcement Learning has attracted much interest in solving th...
research
06/24/2021

Unifying Gradient Estimators for Meta-Reinforcement Learning via Off-Policy Evaluation

Model-agnostic meta-reinforcement learning requires estimating the Hessi...
research
03/26/2023

Balancing policy constraint and ensemble size in uncertainty-based offline reinforcement learning

Offline reinforcement learning agents seek optimal policies from fixed d...
research
02/17/2019

A new Potential-Based Reward Shaping for Reinforcement Learning Agent

Potential-based reward shaping (PBRS) is a particular category of machin...
research
11/26/2019

The problem with DDPG: understanding failures in deterministic environments with sparse rewards

In environments with continuous state and action spaces, state-of-the-ar...
research
01/06/2022

Data-Efficient Learning of High-Quality Controls for Kinodynamic Planning used in Vehicular Navigation

This paper aims to improve the path quality and computational efficiency...

Please sign up or login with your details

Forgot password? Click here to reset