Qgraph-bounded Q-learning: Stabilizing Model-Free Off-Policy Deep Reinforcement Learning

07/15/2020
by   Sabrina Hoppe, et al.
13

In state of the art model-free off-policy deep reinforcement learning, a replay memory is used to store past experience and derive all network updates. Even if both state and action spaces are continuous, the replay memory only holds a finite number of transitions. We represent these transitions in a data graph and link its structure to soft divergence. By selecting a subgraph with a favorable structure, we construct a simplified Markov Decision Process for which exact Q-values can be computed efficiently as more data comes in. The subgraph and its associated Q-values can be represented as a QGraph. We show that the Q-value for each transition in the simplified MDP is a lower bound of the Q-value for the same transition in the original continuous Q-learning problem. By using these lower bounds in temporal difference learning, our method QG-DDPG is less prone to soft divergence and exhibits increased sample efficiency while being more robust to hyperparameters. QGraphs also retain information from transitions that have already been overwritten in the replay memory, which can decrease the algorithm's sensitivity to the replay memory capacity.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
05/18/2022

Neighborhood Mixup Experience Replay: Local Convex Interpolation for Improved Sample Efficiency in Continuous Control Tasks

Experience replay plays a crucial role in improving the sample efficienc...
research
04/14/2020

A Demonstration of Issues with Value-Based Multiobjective Reinforcement Learning Under Stochastic State Transitions

We report a previously unidentified issue with model-free, value-based a...
research
04/17/2020

Deep Reinforcement Learning for Adaptive Learning Systems

In this paper, we formulate the adaptive learning problem—the problem of...
research
07/21/2014

Practical Kernel-Based Reinforcement Learning

Kernel-based reinforcement learning (KBRL) stands out among reinforcemen...
research
11/02/2021

Off-Policy Correction for Deep Deterministic Policy Gradient Algorithms via Batch Prioritized Experience Replay

The experience replay mechanism allows agents to use the experiences mul...
research
11/12/2021

Improving Experience Replay through Modeling of Similar Transitions' Sets

In this work, we propose and evaluate a new reinforcement learning metho...
research
03/29/2022

Topological Experience Replay

State-of-the-art deep Q-learning methods update Q-values using state tra...

Please sign up or login with your details

Forgot password? Click here to reset