EMaQ: Expected-Max Q-Learning Operator for Simple Yet Effective Offline and Online RL

Off-policy reinforcement learning (RL) holds the promise of sample-efficient learning of decision-making policies by leveraging past experience. However, in the offline RL setting – where a fixed collection of interactions are provided and no further interactions are allowed – it has been shown that standard off-policy RL methods can significantly underperform. Recently proposed methods aim to address this shortcoming by regularizing learned policies to remain close to the given dataset of interactions. However, these methods involve several configurable components such as learning a separate policy network on top of a behavior cloning actor, and explicitly constraining action spaces through clipping or reward penalties. Striving for simultaneous simplicity and performance, in this work we present a novel backup operator, Expected-Max Q-Learning (EMaQ), which naturally restricts learned policies to remain within the support of the offline dataset without any explicit regularization, while retaining desirable theoretical properties such as contraction. We demonstrate that EMaQ is competitive with Soft Actor Critic (SAC) in online RL, and surpasses SAC in the deployment-efficient setting. In the offline RL setting – the main focus of this work – through EMaQ we are able to make important observations regarding key components of offline RL, and the nature of standard benchmark tasks. Lastly but importantly, we observe that EMaQ achieves state-of-the-art performance with fewer moving parts such as one less function approximation, making it a strong, yet easy to implement baseline for future work.

READ FULL TEXT
research
02/22/2023

Behavior Proximal Policy Optimization

Offline reinforcement learning (RL) is a challenging setting where exist...
research
12/26/2020

POPO: Pessimistic Offline Policy Optimization

Offline reinforcement learning (RL), also known as batch RL, aims to opt...
research
11/26/2019

Behavior Regularized Offline Reinforcement Learning

In reinforcement learning (RL) research, it is common to assume access t...
research
10/05/2021

You Only Evaluate Once: a Simple Baseline Algorithm for Offline RL

The goal of offline reinforcement learning (RL) is to find an optimal po...
research
02/28/2023

The In-Sample Softmax for Offline Reinforcement Learning

Reinforcement learning (RL) agents can leverage batches of previously co...
research
10/04/2021

Uncertainty-Based Offline Reinforcement Learning with Diversified Q-Ensemble

Offline reinforcement learning (offline RL), which aims to find an optim...
research
11/21/2022

Improving TD3-BC: Relaxed Policy Constraint for Offline Learning and Stable Online Fine-Tuning

The ability to discover optimal behaviour from fixed data sets has the p...

Please sign up or login with your details

Forgot password? Click here to reset