Model-based Offline Reinforcement Learning with Local Misspecification

01/26/2023
by   Kefan Dong, et al.
0

We present a model-based offline reinforcement learning policy performance lower bound that explicitly captures dynamics model misspecification and distribution mismatch and we propose an empirical algorithm for optimal offline policy selection. Theoretically, we prove a novel safe policy improvement theorem by establishing pessimism approximations to the value function. Our key insight is to jointly consider selecting over dynamics models and policies: as long as a dynamics model can accurately represent the dynamics of the state-action pairs visited by a given policy, it is possible to approximate the value of that particular policy. We analyze our lower bound in the LQR setting and also show competitive performance to previous lower bounds on policy selection across a set of D4RL tasks.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
02/16/2021

COMBO: Conservative Offline Model-Based Policy Optimization

Model-based algorithms, which learn a dynamics model from logged experie...
research
06/04/2022

Hybrid Value Estimation for Off-policy Evaluation and Offline Reinforcement Learning

Value function estimation is an indispensable subroutine in reinforcemen...
research
09/15/2021

DROMO: Distributionally Robust Offline Model-based Policy Optimization

We consider the problem of offline reinforcement learning with model-bas...
research
10/12/2022

A Unified Framework for Alternating Offline Model Training and Policy Learning

In offline model-based reinforcement learning (offline MBRL), we learn a...
research
08/25/2020

Ensuring Monotonic Policy Improvement in Entropy-regularized Value-based Reinforcement Learning

This paper aims to establish an entropy-regularized value-based reinforc...
research
07/04/2021

Improve Agents without Retraining: Parallel Tree Search with Off-Policy Correction

Tree Search (TS) is crucial to some of the most influential successes in...
research
07/10/2019

An Intrinsically-Motivated Approach for Learning Highly Exploring and Fast Mixing Policies

What is a good exploration strategy for an agent that interacts with an ...

Please sign up or login with your details

Forgot password? Click here to reset