Actor-critic is implicitly biased towards high entropy optimal policies

10/21/2021
by   Yuzheng Hu, et al.
0

We show that the simplest actor-critic method – a linear softmax policy updated with TD through interaction with a linear MDP, but featuring no explicit regularization or exploration – does not merely find an optimal policy, but moreover prefers high entropy optimal policies. To demonstrate the strength of this bias, the algorithm not only has no regularization, no projections, and no exploration like ϵ-greedy, but is moreover trained on a single trajectory with no resets. The key consequence of the high entropy bias is that uniform mixing assumptions on the MDP, which exist in some form in all prior work, can be dropped: the implicit regularization of the high entropy bias is enough to ensure that all chains mix and an optimal policy is reached with high probability. As auxiliary contributions, this work decouples concerns between the actor and critic by writing the actor update as an explicit mirror descent, provides tools to uniformly bound mixing times within KL balls of policy space, and provides a projection-free TD analysis with its own implicit bias which can be run from an unmixed starting distribution.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
07/18/2016

A Batch, Off-Policy, Actor-Critic Algorithm for Optimizing the Average Reward

We develop an off-policy actor-critic algorithm for learning an optimal ...
research
06/02/2020

Diversity Actor-Critic: Sample-Aware Entropy Regularization for Sample-Efficient Exploration

Policy entropy regularization is commonly used for better exploration in...
research
06/02/2022

Finite-Time Analysis of Entropy-Regularized Neural Natural Actor-Critic Algorithm

Natural actor-critic (NAC) and its variants, equipped with the represent...
research
10/06/2021

Explaining Off-Policy Actor-Critic From A Bias-Variance Perspective

Off-policy Actor-Critic algorithms have demonstrated phenomenal experime...
research
05/19/2023

Regularization of Soft Actor-Critic Algorithms with Automatic Temperature Adjustment

This work presents a comprehensive analysis to regularize the Soft Actor...
research
04/20/2023

IDQL: Implicit Q-Learning as an Actor-Critic Method with Diffusion Policies

Effective offline RL methods require properly handling out-of-distributi...

Please sign up or login with your details

Forgot password? Click here to reset