Mind the Gap: Offline Policy Optimization for Imperfect Rewards

02/03/2023
by   Jianxiong Li, et al.
0

Reward function is essential in reinforcement learning (RL), serving as the guiding signal to incentivize agents to solve given tasks, however, is also notoriously difficult to design. In many cases, only imperfect rewards are available, which inflicts substantial performance loss for RL agents. In this study, we propose a unified offline policy optimization approach, RGM (Reward Gap Minimization), which can smartly handle diverse types of imperfect rewards. RGM is formulated as a bi-level optimization problem: the upper layer optimizes a reward correction term that performs visitation distribution matching w.r.t. some expert data; the lower layer solves a pessimistic RL problem with the corrected rewards. By exploiting the duality of the lower layer, we derive a tractable algorithm that enables sampled-based learning without any online interactions. Comprehensive experiments demonstrate that RGM achieves superior performance to existing methods under diverse settings of imperfect rewards. Further, RGM can effectively correct wrong or inconsistent rewards against expert preference and retrieve useful information from biased rewards.

READ FULL TEXT

page 9

page 20

page 21

page 22

page 23

page 28

research
06/23/2023

CLUE: Calibrated Latent Guidance for Offline Reinforcement Learning

Offline reinforcement learning (RL) aims to learn an optimal policy from...
research
05/02/2021

InferNet for Delayed Reinforcement Tasks: Addressing the Temporal Credit Assignment Problem

The temporal Credit Assignment Problem (CAP) is a well-known and challen...
research
01/07/2020

Reinforcement Learning via Fenchel-Rockafellar Duality

We review basic concepts of convex duality, focusing on the very general...
research
01/03/2023

Benchmarks and Algorithms for Offline Preference-Based Reward Learning

Learning a reward function from human preferences is challenging as it t...
research
05/30/2022

Designing Rewards for Fast Learning

To convey desired behavior to a Reinforcement Learning (RL) agent, a des...
research
10/05/2020

Policy Learning Using Weak Supervision

Most existing policy learning solutions require the learning agents to r...
research
09/04/2023

Leveraging Reward Consistency for Interpretable Feature Discovery in Reinforcement Learning

The black-box nature of deep reinforcement learning (RL) hinders them fr...

Please sign up or login with your details

Forgot password? Click here to reset