V-MPO: On-Policy Maximum a Posteriori Policy Optimization for Discrete and Continuous Control

by   H. Francis Song, et al.

Some of the most successful applications of deep reinforcement learning to challenging domains in discrete and continuous control have used policy gradient methods in the on-policy setting. However, policy gradients can suffer from large variance that may limit performance, and in practice require carefully tuned entropy regularization to prevent policy collapse. As an alternative to policy gradient algorithms, we introduce V-MPO, an on-policy adaptation of Maximum a Posteriori Policy Optimization (MPO) that performs policy iteration based on a learned state-value function. We show that V-MPO surpasses previously reported scores for both the Atari-57 and DMLab-30 benchmark suites in the multi-task setting, and does so reliably without importance weighting, entropy regularization, or population-based tuning of hyperparameters. On individual DMLab and Atari levels, the proposed algorithm can achieve scores that are substantially higher than has previously been reported. V-MPO is also applicable to problems with high-dimensional, continuous action spaces, which we demonstrate in the context of learning to control simulated humanoids with 22 degrees of freedom from full state observations and 56 degrees of freedom from pixel observations, as well as example OpenAI Gym tasks where V-MPO achieves substantially higher asymptotic scores than previously reported.


DiGrad: Multi-Task Reinforcement Learning with Shared Actions

Most reinforcement learning algorithms are inefficient for learning mult...

Action Branching Architectures for Deep Reinforcement Learning

Discrete-action algorithms have been central to numerous recent successe...

Deep Reinforcement Learning with Relative Entropy Stochastic Search

Many reinforcement learning methods for continuous control tasks are bas...

OffCon^3: What is state of the art anyway?

Two popular approaches to model-free continuous control tasks are SAC an...

Efficient Entropy for Policy Gradient with Multidimensional Action Space

In recent years, deep reinforcement learning has been shown to be adept ...

Lifelong Policy Gradient Learning of Factored Policies for Faster Training Without Forgetting

Policy gradient methods have shown success in learning control policies ...

Deterministic Policy Optimization by Combining Pathwise and Score Function Estimators for Discrete Action Spaces

Policy optimization methods have shown great promise in solving complex ...

Please sign up or login with your details

Forgot password? Click here to reset