Off-policy Multi-step Q-learning

09/30/2019
by   Gabriel Kalweit, et al.
0

In the past few years, off-policy reinforcement learning methods have shown promising results in their application for robot control. Deep Q-learning, however, still suffers from poor data-efficiency which is limiting with regard to real-world applications. We follow the idea of multi-step TD-learning to enhance data-efficiency while remaining off-policy by proposing two novel Temporal-Difference formulations: (1) Truncated Q-functions which represent the return for the first n steps of a policy rollout and (2) Shifted Q-functions, acting as the farsighted return after this truncated rollout. We prove that the combination of these short- and long-term predictions is a representation of the full return, leading to the Composite Q-learning algorithm. We show the efficacy of Composite Q-learning in the tabular case and compare our approach in the function-approximation setting with TD3, Model-based Value Expansion and TD3(Delta), which we introduce as an off-policy variant of TD(Delta). We show on three simulated robot tasks that Composite TD3 outperforms TD3 as well as state-of-the-art off-policy multi-step approaches in terms of data-efficiency.

READ FULL TEXT

page 6

page 7

research
09/12/2022

Model-based Reinforcement Learning with Multi-step Plan Value Estimation

A promising way to improve the sample efficiency of reinforcement learni...
research
05/28/2023

The Statistical Benefits of Quantile Temporal-Difference Learning for Value Estimation

We study the problem of temporal-difference-based policy evaluation in r...
research
07/05/2018

Per-decision Multi-step Temporal Difference Learning with Control Variates

Multi-step temporal difference (TD) learning is an important approach in...
research
02/25/2021

Bias-reduced multi-step hindsight experience replay

Multi-goal reinforcement learning is widely used in planning and robot m...
research
07/05/2019

Incrementally Learning Functions of the Return

Temporal difference methods enable efficient estimation of value functio...
research
09/16/2022

Value Summation: A Novel Scoring Function for MPC-based Model-based Reinforcement Learning

This paper proposes a novel scoring function for the planning module of ...
research
09/22/2020

Structured Hierarchical Dialogue Policy with Graph Neural Networks

Dialogue policy training for composite tasks, such as restaurant reserva...

Please sign up or login with your details

Forgot password? Click here to reset