Sample-Efficient Model-Free Reinforcement Learning with Off-Policy Critics

03/11/2019
by   Denis Steckelmacher, et al.
0

Value-based reinforcement-learning algorithms are currently state-of-the-art in model-free discrete-action settings, and tend to outperform actor-critic algorithms. We argue that actor-critic algorithms are currently limited by their need for an on-policy critic, which severely constraints how the critic is learned. We propose Bootstrapped Dual Policy Iteration (BDPI), a novel model-free actor-critic reinforcement-learning algorithm for continuous states and discrete actions, with off-policy critics. Off-policy critics are compatible with experience replay, ensuring high sample-efficiency, without the need for off-policy corrections. The actor, by slowly imitating the average greedy policy of the critics, leads to high-quality and state-specific exploration, which we show approximates Thompson sampling. Because the actor and critics are fully decoupled, BDPI is remarkably stable and, contrary to other state-of-the-art algorithms, unusually forgiving for poorly-configured hyper-parameters. BDPI is significantly more sample-efficient compared to Bootstrapped DQN, PPO, A3C and ACKTR, on a variety of tasks. Source code: https://github.com/vub-ai-lab/bdpi.

READ FULL TEXT
research
10/16/2019

Soft Actor-Critic for Discrete Action Settings

Soft Actor-Critic is a state-of-the-art reinforcement learning algorithm...
research
10/04/2020

FORK: A Forward-Looking Actor For Model-Free Reinforcement Learning

In this paper, we propose a new type of Actor, named forward-looking Act...
research
10/28/2019

Better Exploration with Optimistic Actor-Critic

Actor-critic methods, a type of model-free Reinforcement Learning, have ...
research
06/23/2022

CGAR: Critic Guided Action Redistribution in Reinforcement Leaning

Training a game-playing reinforcement learning agent requires multiple i...
research
06/16/2021

Solving Continuous Control with Episodic Memory

Episodic memory lets reinforcement learning algorithms remember and expl...
research
03/11/2021

A Quadratic Actor Network for Model-Free Reinforcement Learning

In this work we discuss the incorporation of quadratic neurons into poli...
research
06/05/2023

Seizing Serendipity: Exploiting the Value of Past Success in Off-Policy Actor-Critic

Learning high-quality Q-value functions plays a key role in the success ...

Please sign up or login with your details

Forgot password? Click here to reset