DisCor: Corrective Feedback in Reinforcement Learning via Distribution Correction

03/16/2020
by   Aviral Kumar, et al.
9

Deep reinforcement learning can learn effective policies for a wide range of tasks, but is notoriously difficult to use due to instability and sensitivity to hyperparameters. The reasons for this remain unclear. When using standard supervised methods (e.g., for bandits), on-policy data collection provides "hard negatives" that correct the model in precisely those states and actions that the policy is likely to visit. We call this phenomenon "corrective feedback." We show that bootstrapping-based Q-learning algorithms do not necessarily benefit from this corrective feedback, and training on the experience collected by the algorithm is not sufficient to correct errors in the Q-function. In fact, Q-learning and related methods can exhibit pathological interactions between the distribution of experience collected by the agent and the policy induced by training on that experience, leading to potential instability, sub-optimal convergence, and poor results when learning from noisy, sparse or delayed rewards. We demonstrate the existence of this problem, both theoretically and empirically. We then show that a specific correction to the data distribution can mitigate this issue. Based on these observations, we propose a new algorithm, DisCor, which computes an approximation to this optimal distribution and uses it to re-weight the transitions used for training, resulting in substantial improvements in a range of challenging RL settings, such as multi-task learning and learning from noisy reward signals. Blog post presenting a summary of this work is available at: https://bair.berkeley.edu/blog/2020/03/16/discor/.

READ FULL TEXT

page 11

page 30

page 35

research
06/03/2019

Stabilizing Off-Policy Q-Learning via Bootstrapping Error Reduction

Off-policy reinforcement learning aims to leverage experience collected ...
research
01/16/2019

ReNeg and Backseat Driver: Learning from Demonstration with Continuous Human Feedback

In autonomous vehicle (AV) control, allowing mistakes can be quite dange...
research
06/23/2020

Environment Shaping in Reinforcement Learning using State Abstraction

One of the central challenges faced by a reinforcement learning (RL) age...
research
01/27/2022

The Challenges of Exploration for Offline Reinforcement Learning

Offline Reinforcement Learning (ORL) enablesus to separately study the t...
research
11/12/2020

Reinforcement Learning with Videos: Combining Offline Observations with Interaction

Reinforcement learning is a powerful framework for robots to acquire ski...
research
07/17/2021

Hierarchical Reinforcement Learning with Optimal Level Synchronization based on a Deep Generative Model

The high-dimensional or sparse reward task of a reinforcement learning (...
research
09/15/2022

Human-level Atari 200x faster

The task of building general agents that perform well over a wide range ...

Please sign up or login with your details

Forgot password? Click here to reset