Zeroth-order Deterministic Policy Gradient

06/12/2020
by   Harshat Kumar, et al.
1

Deterministic Policy Gradient (DPG) removes a level of randomness from standard randomized-action Policy Gradient (PG), and demonstrates substantial empirical success for tackling complex dynamic problems involving Markov decision processes. At the same time, though, DPG loses its ability to learn in a model-free (i.e., actor-only) fashion, frequently necessitating the use of critics in order to obtain consistent estimates of the associated policy-reward gradient. In this work, we introduce Zeroth-order Deterministic Policy Gradient (ZDPG), which approximates policy-reward gradients via two-point stochastic evaluations of the Q-function, constructed by properly designed low-dimensional action-space perturbations. Exploiting the idea of random horizon rollouts for obtaining unbiased estimates of the Q-function, ZDPG lifts the dependence on critics and restores true model-free policy learning, while enjoying built-in and provable algorithmic stability. Additionally, we present new finite sample complexity bounds for ZDPG, which improve upon existing results by up to two orders of magnitude. Our findings are supported by several numerical experiments, which showcase the effectiveness of ZDPG in a practical setting, and its advantages over both PG and Baseline PG.

READ FULL TEXT
research
05/11/2022

Stochastic first-order methods for average-reward Markov decision processes

We study the problem of average-reward Markov decision processes (AMDPs)...
research
09/09/2019

Deterministic Value-Policy Gradients

Reinforcement learning algorithms such as the deep deterministic policy ...
research
03/09/2021

Model-free Policy Learning with Reward Gradients

Policy gradient methods estimate the gradient of a policy objective sole...
research
01/24/2019

Sample Complexity of Estimating the Policy Gradient for Nearly Deterministic Dynamical Systems

Reinforcement learning is a promising approach to learning robot control...
research
08/23/2021

Model-Free Learning of Optimal Deterministic Resource Allocations in Wireless Systems via Action-Space Exploration

Wireless systems resource allocation refers to perpetual and challenging...
research
01/30/2023

SoftTreeMax: Exponential Variance Reduction in Policy Gradient via Tree Search

Despite the popularity of policy gradient methods, they are known to suf...
research
06/03/2011

Experiments with Infinite-Horizon, Policy-Gradient Estimation

In this paper, we present algorithms that perform gradient ascent of the...

Please sign up or login with your details

Forgot password? Click here to reset