CASA-B: A Unified Framework of Model-Free Reinforcement Learning

05/09/2021
by   Changnan Xiao, et al.
0

Building on the breakthrough of reinforcement learning, this paper introduces a unified framework of model-free reinforcement learning, CASA-B, Critic AS an Actor with Bandits Vote Algorithm. CASA-B is an actor-critic framework that estimates state-value, state-action-value and policy. An expectation-correct Doubly Robust Trace is introduced to learn state-value and state-action-value, whose convergence properties are guaranteed. We prove that CASA-B integrates a consistent path for the policy evaluation and the policy improvement. The policy evaluation is equivalent to a compensational policy improvement, which alleviates the function approximation error, and is also equivalent to an entropy-regularized policy improvement, which prevents the policy from collapsing to a suboptimal solution. Building on this design, we find the entropy of the behavior policies' and the target policy's are disentangled. Based on this observation, we propose a progressive closed-form entropy control mechanism, which explicitly controls the behavior policies' entropy to arbitrary range. Our experiments show that CASAB is super sample efficient and achieves State-Of-The-Art on Arcade Learning Environment. Our mean Human Normalized Score is 6456.63 under 200M training scale.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
02/14/2019

Off-Policy Actor-Critic in an Ensemble: Achieving Maximum General Entropy and Effective Environment Exploration in Deep Reinforcement Learning

We propose a new policy iteration theory as an important extension of so...
research
06/01/2021

An Entropy Regularization Free Mechanism for Policy-based Reinforcement Learning

Policy-based reinforcement learning methods suffer from the policy colla...
research
02/28/2017

Bridging the Gap Between Value and Policy Based Reinforcement Learning

We establish a new connection between value and policy based reinforceme...
research
11/05/2019

Quinoa: a Q-function You Infer Normalized Over Actions

We present an algorithm for learning an approximate action-value soft Q-...
research
03/11/2021

A Quadratic Actor Network for Model-Free Reinforcement Learning

In this work we discuss the incorporation of quadratic neurons into poli...
research
01/31/2019

Tsallis Reinforcement Learning: A Unified Framework for Maximum Entropy Reinforcement Learning

In this paper, we present a new class of Markov decision processes (MDPs...
research
03/18/2021

Maximum Entropy Reinforcement Learning with Mixture Policies

Mixture models are an expressive hypothesis class that can approximate a...

Please sign up or login with your details

Forgot password? Click here to reset