CASA-B: A Unified Framework of Model-Free Reinforcement Learning

05/09/2021

∙

Building on the breakthrough of reinforcement learning, this paper introduces a unified framework of model-free reinforcement learning, CASA-B, Critic AS an Actor with Bandits Vote Algorithm. CASA-B is an actor-critic framework that estimates state-value, state-action-value and policy. An expectation-correct Doubly Robust Trace is introduced to learn state-value and state-action-value, whose convergence properties are guaranteed. We prove that CASA-B integrates a consistent path for the policy evaluation and the policy improvement. The policy evaluation is equivalent to a compensational policy improvement, which alleviates the function approximation error, and is also equivalent to an entropy-regularized policy improvement, which prevents the policy from collapsing to a suboptimal solution. Building on this design, we find the entropy of the behavior policies' and the target policy's are disentangled. Based on this observation, we propose a progressive closed-form entropy control mechanism, which explicitly controls the behavior policies' entropy to arbitrary range. Our experiments show that CASAB is super sample efficient and achieves State-Of-The-Art on Arcade Learning Environment. Our mean Human Normalized Score is 6456.63 under 200M training scale.

READ FULL TEXT

CASA-B: A Unified Framework of Model-Free Reinforcement Learning

Off-Policy Actor-Critic in an Ensemble: Achieving Maximum General Entropy and Effective Environment Exploration in Deep Reinforcement Learning

An Entropy Regularization Free Mechanism for Policy-based Reinforcement Learning

Bridging the Gap Between Value and Policy Based Reinforcement Learning

Quinoa: a Q-function You Infer Normalized Over Actions

A Quadratic Actor Network for Model-Free Reinforcement Learning

Tsallis Reinforcement Learning: A Unified Framework for Maximum Entropy Reinforcement Learning

Maximum Entropy Reinforcement Learning with Mixture Policies

CASA-B: A Unified Framework of Model-Free Reinforcement Learning

Related Research

Off-Policy Actor-Critic in an Ensemble: Achieving Maximum General Entropy and Effective Environment Exploration in Deep Reinforcement Learning

An Entropy Regularization Free Mechanism for Policy-based Reinforcement Learning

Bridging the Gap Between Value and Policy Based Reinforcement Learning

Quinoa: a Q-function You Infer Normalized Over Actions

A Quadratic Actor Network for Model-Free Reinforcement Learning

Tsallis Reinforcement Learning: A Unified Framework for Maximum Entropy Reinforcement Learning

Maximum Entropy Reinforcement Learning with Mixture Policies