Topological Experience Replay

03/29/2022
by   Zhang-Wei Hong, et al.
7

State-of-the-art deep Q-learning methods update Q-values using state transition tuples sampled from the experience replay buffer. This strategy often uniformly and randomly samples or prioritizes data sampling based on measures such as the temporal difference (TD) error. Such sampling strategies can be inefficient at learning Q-function because a state's Q-value depends on the Q-value of successor states. If the data sampling strategy ignores the precision of the Q-value estimate of the next state, it can lead to useless and often incorrect updates to the Q-values. To mitigate this issue, we organize the agent's experience into a graph that explicitly tracks the dependency between Q-values of states. Each edge in the graph represents a transition between two states by executing a single action. We perform value backups via a breadth-first search starting from that expands vertices in the graph starting from the set of terminal states and successively moving backward. We empirically show that our method is substantially more data-efficient than several baselines on a diverse range of goal-reaching tasks. Notably, the proposed method also outperforms baselines that consume more batches of training experience and operates from high-dimensional observational data such as images.

READ FULL TEXT

page 8

page 15

page 16

page 17

research
04/23/2018

State Distribution-aware Sampling for Deep Q-learning

A critical and challenging problem in reinforcement learning is how to l...
research
06/12/2018

Organizing Experience: A Deeper Look at Replay Mechanisms for Sample-based Planning in Continuous State Domains

Model-based strategies for control are critical to obtain sample efficie...
research
10/28/2021

Hindsight Goal Ranking on Replay Buffer for Sparse Reward Environment

This paper proposes a method for prioritizing the replay experience refe...
research
08/17/2021

Diversity-based Trajectory and Goal Selection with Hindsight Experience Replay

Hindsight experience replay (HER) is a goal relabelling technique typica...
research
11/02/2021

Off-Policy Correction for Deep Deterministic Policy Gradient Algorithms via Batch Prioritized Experience Replay

The experience replay mechanism allows agents to use the experiences mul...
research
07/15/2020

Qgraph-bounded Q-learning: Stabilizing Model-Free Off-Policy Deep Reinforcement Learning

In state of the art model-free off-policy deep reinforcement learning, a...
research
07/12/2021

Learning Expected Emphatic Traces for Deep RL

Off-policy sampling and experience replay are key for improving sample e...

Please sign up or login with your details

Forgot password? Click here to reset