You Only Evaluate Once: a Simple Baseline Algorithm for Offline RL

10/05/2021
by   Wonjoon Goo, et al.
0

The goal of offline reinforcement learning (RL) is to find an optimal policy given prerecorded trajectories. Many current approaches customize existing off-policy RL algorithms, especially actor-critic algorithms in which policy evaluation and improvement are iterated. However, the convergence of such approaches is not guaranteed due to the use of complex non-linear function approximation and an intertwined optimization process. By contrast, we propose a simple baseline algorithm for offline RL that only performs the policy evaluation step once so that the algorithm does not require complex stabilization schemes. Since the proposed algorithm is not likely to converge to an optimal policy, it is an appropriate baseline for actor-critic algorithms that ought to be outperformed if there is indeed value in iterative optimization in the offline setting. Surprisingly, we empirically find that the proposed algorithm exhibits competitive and sometimes even state-of-the-art performance in a subset of the D4RL offline RL benchmark. This result suggests that future work is needed to fully exploit the potential advantages of iterative optimization in order to justify the reduced stability of such methods.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
06/16/2021

Offline RL Without Off-Policy Evaluation

Most prior approaches to offline reinforcement learning (RL) have taken ...
research
10/04/2021

Uncertainty-Based Offline Reinforcement Learning with Diversified Q-Ensemble

Offline reinforcement learning (offline RL), which aims to find an optim...
research
03/22/2021

Provably Correct Optimization and Exploration with Non-linear Policies

Policy optimization methods remain a powerful workhorse in empirical Rei...
research
08/16/2021

Optimal Actor-Critic Policy with Optimized Training Datasets

Actor-critic (AC) algorithms are known for their efficacy and high perfo...
research
05/13/2022

Upside-Down Reinforcement Learning Can Diverge in Stochastic Environments With Episodic Resets

Upside-Down Reinforcement Learning (UDRL) is an approach for solving RL ...
research
07/21/2020

EMaQ: Expected-Max Q-Learning Operator for Simple Yet Effective Offline and Online RL

Off-policy reinforcement learning (RL) holds the promise of sample-effic...
research
02/12/2021

Q-Value Weighted Regression: Reinforcement Learning with Limited Data

Sample efficiency and performance in the offline setting have emerged as...

Please sign up or login with your details

Forgot password? Click here to reset