Long-Term Visitation Value for Deep Exploration in Sparse Reward Reinforcement Learning

01/01/2020
by   Simone Parisi, et al.
7

Reinforcement learning with sparse rewards is still an open challenge. Classic methods rely on getting feedback via extrinsic rewards to train the agent, and in situations where this occurs very rarely the agent learns slowly or cannot learn at all. Similarly, if the agent receives also rewards that create suboptimal modes of the objective function, it will likely prematurely stop exploring. More recent methods add auxiliary intrinsic rewards to encourage exploration. However, auxiliary rewards lead to a non-stationary target for the Q-function. In this paper, we present a novel approach that (1) plans exploration actions far into the future by using a long-term visitation count, and (2) decouples exploration and exploitation by learning a separate function assessing the exploration value of the actions. Contrary to existing methods which use models of reward and dynamics, our approach is off-policy and model-free. We further propose new tabular environments for benchmarking exploration in reinforcement learning. Empirical results on classic and novel benchmarks show that the proposed approach outperforms existing methods in environments with sparse rewards, especially in the presence of rewards that create suboptimal modes of the objective function. Results also suggest that our approach scales gracefully with the size of the environment. Source code is available at https://github.com/sparisi/visit-value-explore

READ FULL TEXT

page 8

page 11

page 13

page 20

page 22

page 24

page 25

page 27

research
01/20/2021

Rank the Episodes: A Simple Approach for Exploration in Procedurally-Generated Environments

Exploration under sparse reward is a long-standing challenge of model-fr...
research
10/31/2022

Agent-Time Attention for Sparse Rewards Multi-Agent Reinforcement Learning

Sparse and delayed rewards pose a challenge to single agent reinforcemen...
research
01/21/2019

A Short Survey on Probabilistic Reinforcement Learning

A reinforcement learning agent tries to maximize its cumulative payoff b...
research
06/18/2021

MADE: Exploration via Maximizing Deviation from Explored Regions

In online reinforcement learning (RL), efficient exploration remains par...
research
10/30/2019

RBED: Reward Based Epsilon Decay

ε-greedy is a policy used to balance exploration and exploitation in man...
research
12/26/2020

Locally Persistent Exploration in Continuous Control Tasks with Sparse Rewards

A major challenge in reinforcement learning is the design of exploration...
research
07/12/2022

Reactive Exploration to Cope with Non-Stationarity in Lifelong Reinforcement Learning

In lifelong learning, an agent learns throughout its entire life without...

Please sign up or login with your details

Forgot password? Click here to reset