Maximum a Posteriori Policy Optimisation

06/14/2018
by   Abbas Abdolmaleki, et al.
0

We introduce a new algorithm for reinforcement learning called Maximum aposteriori Policy Optimisation (MPO) based on coordinate ascent on a relative entropy objective. We show that several existing methods can directly be related to our derivation. We develop two off-policy algorithms and demonstrate that they are competitive with the state-of-the-art in deep reinforcement learning. In particular, for continuous control, our method outperforms existing methods with respect to sample efficiency, premature convergence and robustness to hyperparameter settings while achieving similar or better final performance.

READ FULL TEXT
research
11/21/2019

Sample-Efficient Reinforcement Learning with Maximum Entropy Mellowmax Episodic Control

Deep networks have enabled reinforcement learning to scale to more compl...
research
06/10/2018

Implicit Policy for Reinforcement Learning

We introduce Implicit Policy, a general class of expressive policies tha...
research
05/09/2018

Solving Sudoku with Ant Colony Optimisation

In this paper we present a new Ant Colony Optimisation-based algorithm f...
research
10/05/2019

Towards Simplicity in Deep Reinforcement Learning: Streamlined Off-Policy Learning

The field of Deep Reinforcement Learning (DRL) has recently seen a surge...
research
02/11/2018

Sample Efficient Deep Reinforcement Learning for Dialogue Systems with Large Action Spaces

In spoken dialogue systems, we aim to deploy artificial intelligence to ...
research
05/31/2022

Sample-Efficient, Exploration-Based Policy Optimisation for Routing Problems

Model-free deep-reinforcement-based learning algorithms have been applie...
research
05/24/2016

Alternating Optimisation and Quadrature for Robust Control

Bayesian optimisation has been successfully applied to a variety of rein...

Please sign up or login with your details

Forgot password? Click here to reset