Beyond Reward: Offline Preference-guided Policy Optimization

05/25/2023
by   Yachen Kang, et al.
0

This study focuses on the topic of offline preference-based reinforcement learning (PbRL), a variant of conventional reinforcement learning that dispenses with the need for online interaction or specification of reward functions. Instead, the agent is provided with pre-existing offline trajectories and human preferences between pairs of trajectories to extract the dynamics and task information, respectively. Since the dynamics and task information are orthogonal, a naive approach would involve using preference-based reward learning followed by an off-the-shelf offline RL algorithm. However, this requires the separate learning of a scalar reward function, which is assumed to be an information bottleneck. To address this issue, we propose the offline preference-guided policy optimization (OPPO) paradigm, which models offline trajectories and preferences in a one-step process, eliminating the need for separately learning a reward function. OPPO achieves this by introducing an offline hindsight information matching objective for optimizing a contextual policy and a preference modeling objective for finding the optimal context. OPPO further integrates a well-performing decision policy by optimizing the two objectives iteratively. Our empirical results demonstrate that OPPO effectively models offline preferences and outperforms prior competing baselines, including offline RL algorithms performed over either true or pseudo reward function specifications. Our code is available at https://github.com/bkkgbkjb/OPPO .

READ FULL TEXT
research
07/20/2021

Offline Preference-Based Apprenticeship Learning

We study how an offline dataset of prior (possibly random) experience ca...
research
07/24/2023

Provable Benefits of Policy Learning from Human Preferences in Contextual Bandit Problems

A crucial task in decision-making problems is reward engineering. It is ...
research
05/24/2023

Inverse Preference Learning: Preference-based RL without a Reward Function

Reward functions are difficult to design and often hard to align with hu...
research
08/11/2023

Learning Control Policies for Variable Objectives from Offline Data

Offline reinforcement learning provides a viable approach to obtain adva...
research
01/03/2023

Benchmarks and Algorithms for Offline Preference-Based Reward Learning

Learning a reward function from human preferences is challenging as it t...
research
02/12/2019

Preferences Implicit in the State of the World

Reinforcement learning (RL) agents optimize only the features specified ...
research
01/11/2023

Efficient Preference-Based Reinforcement Learning Using Learned Dynamics Models

Preference-based reinforcement learning (PbRL) can enable robots to lear...

Please sign up or login with your details

Forgot password? Click here to reset