Log In Sign Up

CASA-B: A Unified Framework of Model-Free Reinforcement Learning

by   Changnan Xiao, et al.

Building on the breakthrough of reinforcement learning, this paper introduces a unified framework of model-free reinforcement learning, CASA-B, Critic AS an Actor with Bandits Vote Algorithm. CASA-B is an actor-critic framework that estimates state-value, state-action-value and policy. An expectation-correct Doubly Robust Trace is introduced to learn state-value and state-action-value, whose convergence properties are guaranteed. We prove that CASA-B integrates a consistent path for the policy evaluation and the policy improvement. The policy evaluation is equivalent to a compensational policy improvement, which alleviates the function approximation error, and is also equivalent to an entropy-regularized policy improvement, which prevents the policy from collapsing to a suboptimal solution. Building on this design, we find the entropy of the behavior policies' and the target policy's are disentangled. Based on this observation, we propose a progressive closed-form entropy control mechanism, which explicitly controls the behavior policies' entropy to arbitrary range. Our experiments show that CASAB is super sample efficient and achieves State-Of-The-Art on Arcade Learning Environment. Our mean Human Normalized Score is 6456.63 under 200M training scale.


page 1

page 2

page 3

page 4


An Entropy Regularization Free Mechanism for Policy-based Reinforcement Learning

Policy-based reinforcement learning methods suffer from the policy colla...

Bridging the Gap Between Value and Policy Based Reinforcement Learning

We establish a new connection between value and policy based reinforceme...

Quinoa: a Q-function You Infer Normalized Over Actions

We present an algorithm for learning an approximate action-value soft Q-...

A Quadratic Actor Network for Model-Free Reinforcement Learning

In this work we discuss the incorporation of quadratic neurons into poli...

Offline Reinforcement Learning with Pseudometric Learning

Offline Reinforcement Learning methods seek to learn a policy from logge...

Stabilizing Off-Policy Q-Learning via Bootstrapping Error Reduction

Off-policy reinforcement learning aims to leverage experience collected ...