Generalized deterministic policy gradient algorithms

by   Qingpeng Cai, et al.

We study a setting of reinforcement learning, where the state transition is a convex combination of a stochastic continuous function and a deterministic discontinuous function. Such a setting include as a special case the stochastic state transition setting, namely the setting of deterministic policy gradient (DPG). We introduce a theoretical technique to prove the existence of the policy gradient in this generalized setting. Using this technique, we prove that the deterministic policy gradient indeed exists for a certain set of discount factors, and further prove two conditions that guarantee the existence for all discount factors. We then derive a closed form of the policy gradient whenever exists. Interestingly, the form of the policy gradient in such setting is equivalent to that in DPG. Furthermore, to overcome the challenge of high sample complexity of DPG in this setting, we propose the Generalized Deterministic Policy Gradient (GDPG) algorithm. The main innovation of the algorithm is to optimize a weighted objective of the original Markov decision process (MDP) and an augmented MDP that simplifies the original one, and serves as its lower bound. To solve the augmented MDP, we make use of the model-based methods which enable fast convergence. We finally conduct extensive experiments comparing GDPG with state-of-the-art methods on several standard benchmarks. Results demonstrate that GDPG substantially outperforms other baselines in terms of both convergence and long-term rewards.


page 1

page 2

page 3

page 4


Deterministic Value-Policy Gradients

Reinforcement learning algorithms such as the deep deterministic policy ...

Regularized Robust MDPs and Risk-Sensitive MDPs: Equivalence, Policy Gradient, and Sample Complexity

This paper focuses on reinforcement learning for the regularized robust ...

Ranking Policy Gradient

Sample inefficiency is a long-lasting problem in reinforcement learning ...

Rebalancing Dockless Bike Sharing Systems

Bike sharing provides an environment-friendly way for traveling and is b...

A Policy Gradient Method with Variance Reduction for Uplift Modeling

Uplift modeling aims to directly model the incremental impact of a treat...

Recurrent Network-based Deterministic Policy Gradient for Solving Bipedal Walking Challenge on Rugged Terrains

This paper presents the learning algorithm based on the Recurrent Networ...

Regularly Updated Deterministic Policy Gradient Algorithm

Deep Deterministic Policy Gradient (DDPG) algorithm is one of the most w...

Please sign up or login with your details

Forgot password? Click here to reset