Safe and Efficient Off-Policy Reinforcement Learning

06/08/2016
by   Remi Munos, et al.
0

In this work, we take a fresh look at some old and new algorithms for off-policy, return-based reinforcement learning. Expressing these in a common form, we derive a novel algorithm, Retrace(λ), with three desired properties: (1) it has low variance; (2) it safely uses samples collected from any behaviour policy, whatever its degree of "off-policyness"; and (3) it is efficient as it makes the best use of samples collected from near on-policy behaviour policies. We analyze the contractive nature of the related operator under both off-policy policy evaluation and control settings and derive online sample-based algorithms. We believe this is the first return-based off-policy control algorithm converging a.s. to Q^* without the GLIE assumption (Greedy in the Limit with Infinite Exploration). As a corollary, we prove the convergence of Watkins' Q(λ), which was an open problem since 1989. We illustrate the benefits of Retrace(λ) on a standard suite of Atari 2600 games.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
06/16/2020

Accelerating Online Reinforcement Learning with Offline Datasets

Reinforcement learning provides an appealing formalism for learning cont...
research
01/23/2021

Rethinking Exploration for Sample-Efficient Policy Learning

Off-policy reinforcement learning for control has made great strides in ...
research
04/22/2020

Per-Step Reward: A New Perspective for Risk-Averse Reinforcement Learning

We present a new per-step reward perspective for risk-averse control in ...
research
05/20/2022

Sigmoidally Preconditioned Off-policy Learning:a new exploration method for reinforcement learning

One of the major difficulties of reinforcement learning is learning from...
research
05/17/2019

TBQ(σ): Improving Efficiency of Trace Utilization for Off-Policy Reinforcement Learning

Off-policy reinforcement learning with eligibility traces is challenging...
research
05/21/2018

Multiple-Step Greedy Policies in Online and Approximate Reinforcement Learning

Multiple-step lookahead policies have demonstrated high empirical compet...
research
01/17/2022

Chaining Value Functions for Off-Policy Learning

To accumulate knowledge and improve its policy of behaviour, a reinforce...

Please sign up or login with your details

Forgot password? Click here to reset