
Offline Reinforcement Learning with Soft Behavior Regularization
Most prior approaches to offline reinforcement learning (RL) utilize beh...
read it

BRAC+: Improved Behavior Regularized Actor Critic for Offline Reinforcement Learning
Online interactions with the environment to collect data samples for tra...
read it

Discount Factor as a Regularizer in Reinforcement Learning
Specifying a Reinforcement Learning (RL) task involves choosing a suitab...
read it

UncertaintyBased Offline Reinforcement Learning with Diversified QEnsemble
Offline reinforcement learning (offline RL), which aims to find an optim...
read it

Policy Mirror Descent for Regularized Reinforcement Learning: A Generalized Framework with Linear Convergence
Policy optimization, which learns the policy of interest by maximizing t...
read it

Efficient FullyOffline MetaReinforcement Learning via Distance Metric Learning and Behavior Regularization
We study the offline metareinforcement learning (OMRL) problem, a parad...
read it

EMaQ: ExpectedMax QLearning Operator for Simple Yet Effective Offline and Online RL
Offpolicy reinforcement learning (RL) holds the promise of sampleeffic...
read it
Offline Reinforcement Learning with Fisher Divergence Critic Regularization
Many modern approaches to offline Reinforcement Learning (RL) utilize behavior regularization, typically augmenting a modelfree actor critic algorithm with a penalty measuring divergence of the policy from the offline data. In this work, we propose an alternative approach to encouraging the learned policy to stay close to the data, namely parameterizing the critic as the logbehaviorpolicy, which generated the offline data, plus a stateaction value offset term, which can be learned using a neural network. Behavior regularization then corresponds to an appropriate regularizer on the offset term. We propose using a gradient penalty regularizer for the offset term and demonstrate its equivalence to Fisher divergence regularization, suggesting connections to the score matching and generative energybased model literature. We thus term our resulting algorithm FisherBRC (Behavior Regularized Critic). On standard offline RL benchmarks, FisherBRC achieves both improved performance and faster convergence over existing stateoftheart methods.
READ FULL TEXT
Comments
There are no comments yet.