Relative Entropy Regularized Policy Iteration

by   Abbas Abdolmaleki, et al.

We present an off-policy actor-critic algorithm for Reinforcement Learning (RL) that combines ideas from gradient-free optimization via stochastic search with learned action-value function. The result is a simple procedure consisting of three steps: i) policy evaluation by estimating a parametric action-value function; ii) policy improvement via the estimation of a local non-parametric policy; and iii) generalization by fitting a parametric policy. Each step can be implemented in different ways, giving rise to several algorithm variants. Our algorithm draws on connections to existing literature on black-box optimization and 'RL as an inference' and it can be seen either as an extension of the Maximum a Posteriori Policy Optimisation algorithm (MPO) [Abdolmaleki et al., 2018a], or as an extension of Trust Region Covariance Matrix Adaptation Evolutionary Strategy (CMA-ES) [Abdolmaleki et al., 2017b; Hansen et al., 1997] to a policy iteration scheme. Our comparison on 31 continuous control tasks from parkour suite [Heess et al., 2017], DeepMind control suite [Tassa et al., 2018] and OpenAI Gym [Brockman et al., 2016] with diverse properties, limited amount of compute and a single set of hyperparameters, demonstrate the effectiveness of our method and the state of art results. Videos, summarizing results, can be found at .


On The Convergence Of Policy Iteration-Based Reinforcement Learning With Monte Carlo Policy Evaluation

A common technique in reinforcement learning is to evaluate the value fu...

Compatible features for Monotonic Policy Improvement

Recent policy optimization approaches have achieved substantial empirica...

An Elementary Proof that Q-learning Converges Almost Surely

Watkins' and Dayan's Q-learning is a model-free reinforcement learning a...

Logistic Q-Learning

We propose a new reinforcement learning algorithm derived from a regular...

Planning with Expectation Models for Control

In model-based reinforcement learning (MBRL), Wan et al. (2019) showed c...

Off-Policy Deep Reinforcement Learning by Bootstrapping the Covariate Shift

In this paper we revisit the method of off-policy corrections for reinfo...

Learning Latent Permutations with Gumbel-Sinkhorn Networks

Permutations and matchings are core building blocks in a variety of late...

Please sign up or login with your details

Forgot password? Click here to reset