Local Search for Policy Iteration in Continuous Control

by   Jost Tobias Springenberg, et al.

We present an algorithm for local, regularized, policy improvement in reinforcement learning (RL) that allows us to formulate model-based and model-free variants in a single framework. Our algorithm can be interpreted as a natural extension of work on KL-regularized RL and introduces a form of tree search for continuous action spaces. We demonstrate that additional computation spent on model-based policy improvement during learning can improve data efficiency, and confirm that model-based policy improvement during action selection can also be beneficial. Quantitatively, our algorithm improves data efficiency on several continuous control benchmarks (when a model is learned in parallel), and it provides significant improvements in wall-clock time in high-dimensional domains (when a ground truth model is available). The unified framework also helps us to better understand the space of model-based and model-free algorithms. In particular, we demonstrate that some benefits attributed to model-based RL can be obtained without a model, simply by utilizing more computation.


Variational Model-based Policy Optimization

Model-based reinforcement learning (RL) algorithms allow us to combine m...

Policy Prediction Network: Model-Free Behavior Policy with Model-Based Learning in Continuous Action Space

This paper proposes a novel deep reinforcement learning architecture tha...

Randomized Ensembled Double Q-Learning: Learning Fast Without a Model

Using a high Update-To-Data (UTD) ratio, model-based methods have recent...

Muesli: Combining Improvements in Policy Optimization

We propose a novel policy update that combines regularized policy optimi...

Information asymmetry in KL-regularized RL

Many real world tasks exhibit rich structure that is repeated across dif...

Reinforcement Learning as Iterative and Amortised Inference

There are several ways to categorise reinforcement learning (RL) algorit...

Accelerated Policy Learning with Parallel Differentiable Simulation

Deep reinforcement learning can generate complex control policies, but r...