Matryoshka Policy Gradient for Entropy-Regularized RL: Convergence and Global Optimality

03/22/2023
by   François Ged, et al.
0

A novel Policy Gradient (PG) algorithm, called Matryoshka Policy Gradient (MPG), is introduced and studied, in the context of max-entropy reinforcement learning, where an agent aims at maximising entropy bonuses additional to its cumulative rewards. MPG differs from standard PG in that it trains a sequence of policies to learn finite horizon tasks simultaneously, instead of a single policy for the single standard objective. For softmax policies, we prove convergence of MPG and global optimality of the limit by showing that the only critical point of the MPG objective is the optimal policy; these results hold true even in the case of continuous compact state space. MPG is intuitive, theoretically sound and we furthermore show that the optimal policy of the standard max-entropy objective can be approximated arbitrarily well by the optimal policy of the MPG framework. Finally, we justify that MPG is well suited when the policies are parametrized with neural networks and we provide an simple criterion to verify the global optimality of the policy at convergence. As a proof of concept, we evaluate numerically MPG on standard test benchmarks.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
10/10/2022

A policy gradient approach for Finite Horizon Constrained Markov Decision Processes

The infinite horizon setting is widely adopted for problems of reinforce...
research
10/19/2021

Beyond Exact Gradients: Convergence of Stochastic Soft-Max Policy Gradient Methods with Entropy Regularization

Entropy regularization is an efficient technique for encouraging explora...
research
02/24/2017

Bayes-Optimal Entropy Pursuit for Active Choice-Based Preference Learning

We analyze the problem of learning a single user's preferences in an act...
research
02/11/2021

Robust Policy Gradient against Strong Data Corruption

We study the problem of robust reinforcement learning under adversarial ...
research
01/21/2022

Occupancy Information Ratio: Infinite-Horizon, Information-Directed, Parameterized Policy Search

We develop a new measure of the exploration/exploitation trade-off in in...
research
05/24/2023

Policy Learning based on Deep Koopman Representation

This paper proposes a policy learning algorithm based on the Koopman ope...
research
01/27/2022

Quantile-Based Policy Optimization for Reinforcement Learning

Classical reinforcement learning (RL) aims to optimize the expected cumu...

Please sign up or login with your details

Forgot password? Click here to reset