Continuous Doubly Constrained Batch Reinforcement Learning

by   Rasool Fakoor, et al.

Reliant on too many experiments to learn good actions, current Reinforcement Learning (RL) algorithms have limited applicability in real-world settings, which can be too expensive to allow exploration. We propose an algorithm for batch RL, where effective policies are learned using only a fixed offline dataset instead of online interactions with the environment. The limited data in batch RL produces inherent uncertainty in value estimates of states/actions that were insufficiently represented in the training data. This leads to particularly severe extrapolation when our candidate policies diverge from one that generated the data. We propose to mitigate this issue via two straightforward penalties: a policy-constraint to reduce this divergence and a value-constraint that discourages overly optimistic estimates. Over a comprehensive set of 32 continuous-action batch RL benchmarks, our approach compares favorably to state-of-the-art methods, regardless of how the offline data were collected.


page 1

page 2

page 3

page 4


Provably Good Batch Reinforcement Learning Without Great Exploration

Batch reinforcement learning (RL) is important to apply RL algorithms to...

Keep Doing What Worked: Behavioral Modelling Priors for Offline Reinforcement Learning

Off-policy reinforcement learning algorithms promise to be applicable in...

Offline Reinforcement Learning for Autonomous Driving with Safety and Exploration Enhancement

Reinforcement learning (RL) is a powerful data-driven control method tha...

Graph Backup: Data Efficient Backup Exploiting Markovian Transitions

The successes of deep Reinforcement Learning (RL) are limited to setting...

Way Off-Policy Batch Deep Reinforcement Learning of Implicit Human Preferences in Dialog

Most deep reinforcement learning (RL) systems are not able to learn effe...

Batch-Constrained Distributional Reinforcement Learning for Session-based Recommendation

Most of the existing deep reinforcement learning (RL) approaches for ses...

Capturing positive utilities during the estimation of recursive logit models: A prism-based approach

Although the recursive logit (RL) model has been recently popular and ha...