Stabilizing Off-Policy Q-Learning via Bootstrapping Error Reduction

06/03/2019
by   Aviral Kumar, et al.
7

Off-policy reinforcement learning aims to leverage experience collected from prior policies for sample-efficient learning. However, in practice, commonly used off-policy approximate dynamic programming methods based on Q-learning and actor-critic methods are highly sensitive to the data distribution, and can make only limited progress without collecting additional on-policy data. As a step towards more robust off-policy algorithms, we study the setting where the off-policy experience is fixed and there is no further interaction with the environment. We identify bootstrapping error as a key source of instability in current methods. Bootstrapping error is due to bootstrapping from actions that lie outside of the training data distribution, and it accumulates via the Bellman backup operator. We theoretically analyze bootstrapping error, and demonstrate how carefully constraining action selection in the backup can mitigate it. Based on our analysis, we propose a practical algorithm, bootstrapping error accumulation reduction (BEAR). We demonstrate that BEAR is able to learn robustly from different off-policy distributions, including random and suboptimal demonstrations, on a range of continuous control tasks.

READ FULL TEXT
research
07/05/2021

The Least Restriction for Offline Reinforcement Learning

Many practical applications of reinforcement learning (RL) constrain the...
research
03/16/2020

DisCor: Corrective Feedback in Reinforcement Learning via Distribution Correction

Deep reinforcement learning can learn effective policies for a wide rang...
research
05/05/2023

Knowledge Transfer from Teachers to Learners in Growing-Batch Reinforcement Learning

Standard approaches to sequential decision-making exploit an agent's abi...
research
05/23/2022

Distance-Sensitive Offline Reinforcement Learning

In offline reinforcement learning (RL), one detrimental issue to policy ...
research
06/07/2022

Generalized Data Distribution Iteration

To obtain higher sample efficiency and superior final performance simult...
research
03/22/2021

Improving Actor-Critic Reinforcement Learning via Hamiltonian Policy

Approximating optimal policies in reinforcement learning (RL) is often n...
research
11/23/2021

Understanding the Impact of Data Distribution on Q-learning with Function Approximation

In this work, we focus our attention on the study of the interplay betwe...

Please sign up or login with your details

Forgot password? Click here to reset