Optimization Issues in KL-Constrained Approximate Policy Iteration

02/11/2021
by   Nevena Lazic, et al.
0

Many reinforcement learning algorithms can be seen as versions of approximate policy iteration (API). While standard API often performs poorly, it has been shown that learning can be stabilized by regularizing each policy update by the KL-divergence to the previous policy. Popular practical algorithms such as TRPO, MPO, and VMPO replace regularization by a constraint on KL-divergence of consecutive policies, arguing that this is easier to implement and tune. In this work, we study this implementation choice in more detail. We compare the use of KL divergence as a constraint vs. as a regularizer, and point out several optimization issues with the widely-used constrained approach. We show that the constrained algorithm is not guaranteed to converge even on simple problem instances where the constrained problem can be solved exactly, and in fact incurs linear expected regret. With approximate implementation using softmax policies, we show that regularization can improve the optimization landscape of the original objective. We demonstrate these issues empirically on several bandit and RL environments.

READ FULL TEXT
research
01/27/2023

Generalized Munchausen Reinforcement Learning using Tsallis KL Divergence

Many policy optimization approaches in reinforcement learning incorporat...
research
07/17/2021

Greedification Operators for Policy Optimization: Investigating Forward and Reverse KL Divergences

Approximate Policy Iteration (API) algorithms alternate between (approxi...
research
10/20/2021

CIM-PPO:Proximal Policy Optimization with Liu-Correntropy Induced Metric

As an algorithm based on deep reinforcement learning, Proximal Policy Op...
research
03/19/2019

A Note on KL-UCB+ Policy for the Stochastic Bandit

A classic setting of the stochastic K-armed bandit problem is considered...
research
12/29/2017

f-Divergence constrained policy improvement

To ensure stability of learning, state-of-the-art generalized policy ite...
research
07/16/2021

Geometric Value Iteration: Dynamic Error-Aware KL Regularization for Reinforcement Learning

The recent booming of entropy-regularized literature reveals that Kullba...
research
08/26/2013

A Comparison of Algorithms for Learning Hidden Variables in Normal Graphs

A Bayesian factor graph reduced to normal form consists in the interconn...

Please sign up or login with your details

Forgot password? Click here to reset