q-Munchausen Reinforcement Learning

05/16/2022
by   Lingwei Zhu, et al.
0

The recently successful Munchausen Reinforcement Learning (M-RL) features implicit Kullback-Leibler (KL) regularization by augmenting the reward function with logarithm of the current stochastic policy. Though significant improvement has been shown with the Boltzmann softmax policy, when the Tsallis sparsemax policy is considered, the augmentation leads to a flat learning curve for almost every problem considered. We show that it is due to the mismatch between the conventional logarithm and the non-logarithmic (generalized) nature of Tsallis entropy. Drawing inspiration from the Tsallis statistics literature, we propose to correct the mismatch of M-RL with the help of q-logarithm/exponential functions. The proposed formulation leads to implicit Tsallis KL regularization under the maximum Tsallis entropy framework. We show such formulation of M-RL again achieves superior performance on benchmark problems and sheds light on more general M-RL with various entropic indices q.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
07/13/2021

Cautious Policy Programming: Exploiting KL Regularization in Monotonic Policy Improvement for Reinforcement Learning

In this paper, we propose cautious policy programming (CPP), a novel val...
research
05/16/2022

Enforcing KL Regularization in General Tsallis Entropy Reinforcement Learning via Advantage Learning

Maximum Tsallis entropy (MTE) framework in reinforcement learning has ga...
research
03/31/2020

Leverage the Average: an Analysis of Regularization in RL

Building upon the formalism of regularized Markov decision processes, we...
research
05/23/2022

RL with KL penalties is better viewed as Bayesian inference

Reinforcement learning (RL) is frequently employed in fine-tuning large ...
research
07/16/2021

Geometric Value Iteration: Dynamic Error-Aware KL Regularization for Reinforcement Learning

The recent booming of entropy-regularized literature reveals that Kullba...
research
10/20/2020

Iterative Amortized Policy Optimization

Policy networks are a central feature of deep reinforcement learning (RL...
research
05/27/2021

Optimistic Reinforcement Learning by Forward Kullback-Leibler Divergence Optimization

This paper addresses a new interpretation of reinforcement learning (RL)...

Please sign up or login with your details

Forgot password? Click here to reset