Provable Offline Reinforcement Learning with Human Feedback

05/24/2023
by   Wenhao Zhan, et al.
0

In this paper, we investigate the problem of offline reinforcement learning with human feedback where feedback is available in the form of preference between trajectory pairs rather than explicit rewards. Our proposed algorithm consists of two main steps: (1) estimate the implicit reward using Maximum Likelihood Estimation (MLE) with general function approximation from offline data and (2) solve a distributionally robust planning problem over a confidence set around the MLE. We consider the general reward setting where the reward can be defined over the whole trajectory and provide a novel guarantee that allows us to learn any target policy with a polynomial number of samples, as long as the target policy is covered by the offline data. This guarantee is the first of its kind with general function approximation. To measure the coverage of the target policy, we introduce a new single-policy concentrability coefficient, which can be upper bounded by the per-trajectory concentrability coefficient. We also establish lower bounds that highlight the necessity of such concentrability and the difference from standard RL, where state-action-wise rewards are directly observed. We further extend and analyze our algorithm when the feedback is given over action pairs.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
06/13/2022

Provably Efficient Offline Reinforcement Learning with Trajectory-Wise Reward

The remarkable success of reinforcement learning (RL) heavily relies on ...
research
05/29/2023

Reinforcement Learning with Human Feedback: Learning Dynamic Choices via Pessimism

In this paper, we study offline Reinforcement Learning with Human Feedba...
research
02/09/2022

Reinforcement Learning with Sparse Rewards using Guidance from Offline Demonstration

A major challenge in real-world reinforcement learning (RL) is the spars...
research
06/25/2023

Is RLHF More Difficult than Standard RL?

Reinforcement learning from Human Feedback (RLHF) learns from preference...
research
05/24/2023

Matrix Estimation for Offline Reinforcement Learning with Low-Rank Structure

We consider offline Reinforcement Learning (RL), where the agent does no...
research
08/13/2020

Reinforcement Learning with Trajectory Feedback

The computational model of reinforcement learning is based upon the abil...
research
03/07/2023

Zeroth-Order Optimization Meets Human Feedback: Provable Learning via Ranking Oracles

In this paper, we focus on a novel optimization problem in which the obj...

Please sign up or login with your details

Forgot password? Click here to reset