Reinforcement Learning with Human Feedback: Learning Dynamic Choices via Pessimism

05/29/2023
by   Zihao Li, et al.
0

In this paper, we study offline Reinforcement Learning with Human Feedback (RLHF) where we aim to learn the human's underlying reward and the MDP's optimal policy from a set of trajectories induced by human choices. RLHF is challenging for multiple reasons: large state space but limited human feedback, the bounded rationality of human decisions, and the off-policy distribution shift. In this paper, we focus on the Dynamic Discrete Choice (DDC) model for modeling and understanding human choices. DCC, rooted in econometrics and decision theory, is widely used to model a human decision-making process with forward-looking and bounded rationality. We propose a Dynamic-Choice-Pessimistic-Policy-Optimization (DCPPO) method. The method involves a three-stage process: The first step is to estimate the human behavior policy and the state-action value function via maximum likelihood estimation (MLE); the second step recovers the human reward function via minimizing Bellman mean squared error using the learned value functions; the third step is to plug in the learned reward and invoke pessimistic value iteration for finding a near-optimal policy. With only single-policy coverage (i.e., optimal policy) of the dataset, we prove that the suboptimality of DCPPO almost matches the classical pessimistic offline RL algorithm in terms of suboptimality's dependency on distribution shift and dimension. To the best of our knowledge, this paper presents the first theoretical guarantees for off-policy offline RLHF with dynamic discrete choice model.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
06/19/2021

Boosting Offline Reinforcement Learning with Residual Generative Modeling

Offline reinforcement learning (RL) tries to learn the near-optimal poli...
research
11/16/2022

Minimum information divergence of Q-functions for dynamic treatment resumes

This paper aims at presenting a new application of information geometry ...
research
05/24/2023

Provable Offline Reinforcement Learning with Human Feedback

In this paper, we investigate the problem of offline reinforcement learn...
research
10/04/2022

Structural Estimation of Markov Decision Processes in High-Dimensional State Space with Finite-Time Guarantees

We consider the task of estimating a structural model of dynamic decisio...
research
07/11/2019

Reward Advancement: Transforming Policy under Maximum Causal Entropy Principle

Many real-world human behaviors can be characterized as a sequential dec...
research
04/11/2023

A Data-Driven State Aggregation Approach for Dynamic Discrete Choice Models

We study dynamic discrete choice models, where a commonly studied proble...
research
08/23/2022

Strategic Decision-Making in the Presence of Information Asymmetry: Provably Efficient RL with Algorithmic Instruments

We study offline reinforcement learning under a novel model called strat...

Please sign up or login with your details

Forgot password? Click here to reset