Efficient Offline Policy Optimization with a Learned Model

10/12/2022
by   Zichen Liu, et al.
0

MuZero Unplugged presents a promising approach for offline policy learning from logged data. It conducts Monte-Carlo Tree Search (MCTS) with a learned model and leverages Reanalyze algorithm to learn purely from offline data. For good performance, MCTS requires accurate learned models and a large number of simulations, thus costing huge computing time. This paper investigates a few hypotheses where MuZero Unplugged may not work well under the offline RL settings, including 1) learning with limited data coverage; 2) learning from offline data of stochastic environments; 3) improperly parameterized models given the offline data; 4) with a low compute budget. We propose to use a regularized one-step look-ahead approach to tackle the above issues. Instead of planning with the expensive MCTS, we use the learned model to construct an advantage estimation based on a one-step rollout. Policy improvements are towards the direction that maximizes the estimated advantage with regularization of the dataset. We conduct extensive empirical studies with BSuite environments to verify the hypotheses and then run our algorithm on the RL Unplugged Atari benchmark. Experimental results show that our proposed approach achieves stable performance even with an inaccurate learned model. On the large-scale Atari benchmark, the proposed method outperforms MuZero Unplugged by 43 1 hour) compared to MuZero Unplugged (i.e., 17.8 hours) to achieve a 150 normalized score with the same hardware and software stacks.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
04/13/2021

Online and Offline Reinforcement Learning by Planning with a Learned Model

Learning efficiently from small amounts of data has long been the focus ...
research
02/07/2022

Model-Based Offline Meta-Reinforcement Learning with Regularization

Existing offline reinforcement learning (RL) methods face a few major ch...
research
01/03/2023

Benchmarks and Algorithms for Offline Preference-Based Reward Learning

Learning a reward function from human preferences is challenging as it t...
research
12/29/2022

Offline Policy Optimization in RL with Variance Regularizaton

Learning policies from fixed offline datasets is a key challenge to scal...
research
07/24/2023

A Connection between One-Step Regularization and Critic Regularization in Reinforcement Learning

As with any machine learning problem with limited data, effective offlin...
research
02/08/2023

PASTA: Pessimistic Assortment Optimization

We consider a class of assortment optimization problems in an offline da...
research
10/12/2021

Planning from Pixels in Environments with Combinatorially Hard Search Spaces

The ability to form complex plans based on raw visual input is a litmus ...

Please sign up or login with your details

Forgot password? Click here to reset