Shortest-Path Constrained Reinforcement Learning for Sparse Reward Tasks

07/13/2021
by   Sungryull Sohn, et al.
10

We propose the k-Shortest-Path (k-SP) constraint: a novel constraint on the agent's trajectory that improves the sample efficiency in sparse-reward MDPs. We show that any optimal policy necessarily satisfies the k-SP constraint. Notably, the k-SP constraint prevents the policy from exploring state-action pairs along the non-k-SP trajectories (e.g., going back and forth). However, in practice, excluding state-action pairs may hinder the convergence of RL algorithms. To overcome this, we propose a novel cost function that penalizes the policy violating SP constraint, instead of completely excluding it. Our numerical experiment in a tabular RL setting demonstrates that the SP constraint can significantly reduce the trajectory space of policy. As a result, our constraint enables more sample efficient learning by suppressing redundant exploration and exploitation. Our experiments on MiniGrid, DeepMind Lab, Atari, and Fetch show that the proposed method significantly improves proximal policy optimization (PPO) and outperforms existing novelty-seeking exploration methods including count-based exploration even in continuous control tasks, indicating that it improves the sample efficiency by preventing the agent from taking redundant actions.

READ FULL TEXT

page 5

page 7

page 8

page 13

page 14

page 17

research
04/21/2019

Generative Exploration and Exploitation

Sparse reward is one of the biggest challenges in reinforcement learning...
research
06/09/2021

Online Learning for Stochastic Shortest Path Model via Posterior Sampling

We consider the problem of online reinforcement learning for the Stochas...
research
11/11/2019

Multi-Path Policy Optimization

Recent years have witnessed a tremendous improvement of deep reinforceme...
research
07/08/2021

Computational Benefits of Intermediate Rewards for Hierarchical Planning

Many hierarchical reinforcement learning (RL) applications have empirica...
research
06/19/2019

QXplore: Q-learning Exploration by Maximizing Temporal Difference Error

A major challenge in reinforcement learning for continuous state-action ...
research
09/15/2020

Soft policy optimization using dual-track advantage estimator

In reinforcement learning (RL), we always expect the agent to explore as...
research
07/12/2018

A Constrained Randomized Shortest-Paths Framework for Optimal Exploration

The present work extends the randomized shortest-paths framework (RSP), ...

Please sign up or login with your details

Forgot password? Click here to reset