DoMo-AC: Doubly Multi-step Off-policy Actor-Critic Algorithm

05/29/2023
by   Yunhao Tang, et al.
0

Multi-step learning applies lookahead over multiple time steps and has proved valuable in policy evaluation settings. However, in the optimal control case, the impact of multi-step learning has been relatively limited despite a number of prior efforts. Fundamentally, this might be because multi-step policy improvements require operations that cannot be approximated by stochastic samples, hence hindering the widespread adoption of such methods in practice. To address such limitations, we introduce doubly multi-step off-policy VI (DoMo-VI), a novel oracle algorithm that combines multi-step policy improvements and policy evaluations. DoMo-VI enjoys guaranteed convergence speed-up to the optimal policy and is applicable in general off-policy learning settings. We then propose doubly multi-step off-policy actor-critic (DoMo-AC), a practical instantiation of the DoMo-VI algorithm. DoMo-AC introduces a bias-variance trade-off that ensures improved policy gradient estimates. When combined with the IMPALA architecture, DoMo-AC has showed improvements over the baseline algorithm on Atari-57 game benchmarks.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
07/18/2016

A Batch, Off-Policy, Actor-Critic Algorithm for Optimizing the Average Reward

We develop an off-policy actor-critic algorithm for learning an optimal ...
research
03/21/2019

Distributed off-Policy Actor-Critic Reinforcement Learning with Policy Consensus

In this paper, we propose a distributed off-policy actor critic method t...
research
11/11/2019

Context-aware Active Multi-Step Reinforcement Learning

Reinforcement learning has attracted great attention recently, especiall...
research
11/16/2021

Off-Policy Actor-Critic with Emphatic Weightings

A variety of theoretically-sound policy gradient algorithms exist for th...
research
08/16/2021

Optimal Actor-Critic Policy with Optimized Training Datasets

Actor-critic (AC) algorithms are known for their efficacy and high perfo...
research
10/06/2021

Explaining Off-Policy Actor-Critic From A Bias-Variance Perspective

Off-policy Actor-Critic algorithms have demonstrated phenomenal experime...
research
06/20/2022

DNA: Proximal Policy Optimization with a Dual Network Architecture

This paper explores the problem of simultaneously learning a value funct...

Please sign up or login with your details

Forgot password? Click here to reset