Off-Policy Correction for Actor-Critic Algorithms in Deep Reinforcement Learning

08/01/2022
by   Baturay Saglam, et al.
0

Compared to on-policy policy gradient techniques, off-policy model-free deep reinforcement learning (RL) approaches that use previously gathered data can improve sampling efficiency. However, off-policy learning becomes challenging when the discrepancy between the distributions of the policy of interest and the policies that collected the data increases. Although the well-studied importance sampling and off-policy policy gradient techniques were proposed to compensate for this discrepancy, they usually require a collection of long trajectories that increases the computational complexity and induce additional problems such as vanishing or exploding gradients. Moreover, their generalization to continuous action domains is strictly limited as they require action probabilities, which is unsuitable for deterministic policies. To overcome these limitations, we introduce an alternative off-policy correction algorithm for continuous action spaces, Actor-Critic Off-Policy Correction (AC-Off-POC), to mitigate the potential drawbacks introduced by the previously collected data. Through a novel discrepancy measure computed by the agent's most recent action decisions on the states of the randomly sampled batch of transitions, the approach does not require actual or estimated action probabilities for any policy and offers an adequate one-step importance sampling. Theoretical results show that the introduced approach can achieve a contraction mapping with a fixed unique point, which allows a "safe" off-policy learning. Our empirical results suggest that AC-Off-POC consistently improves the state-of-the-art and attains higher returns in fewer steps than the competing methods by efficiently scheduling the learning rate in Q-learning and policy optimization.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
10/30/2018

Relative Importance Sampling For Off-Policy Actor-Critic in Deep Reinforcement Learning

Off-policy learning is more unstable compared to on-policy learning in r...
research
09/01/2022

Actor Prioritized Experience Replay

A widely-studied deep reinforcement learning (RL) technique known as Pri...
research
05/23/2019

Distributional Policy Optimization: An Alternative Approach for Continuous Control

We identify a fundamental problem in policy gradient-based methods in co...
research
09/15/2018

Sampled Policy Gradient for Learning to Play the Game Agar.io

In this paper, a new offline actor-critic learning algorithm is introduc...
research
04/10/2023

Deep Reinforcement Learning with Importance Weighted A3C for QoE enhancement in Video Delivery Services

Adaptive bitrate (ABR) algorithms are used to adapt the video bitrate ba...
research
11/11/2019

Context-aware Active Multi-Step Reinforcement Learning

Reinforcement learning has attracted great attention recently, especiall...
research
11/30/2021

Continuous Control With Ensemble Deep Deterministic Policy Gradients

The growth of deep reinforcement learning (RL) has brought multiple exci...

Please sign up or login with your details

Forgot password? Click here to reset