Improving Language Models with Advantage-based Offline Policy Gradients

05/24/2023
by   Ashutosh Baheti, et al.
0

Improving language model generations according to some user-defined quality or style constraints is challenging. Typical approaches include learning on additional human-written data, filtering “low-quality” data using heuristics and/or using reinforcement learning with human feedback (RLHF). However, filtering can remove valuable training signals, whereas data collection and RLHF constantly require additional human-written or LM exploration data which can be costly to obtain. A natural question to ask is “Can we leverage RL to optimize LM utility on existing crowd-sourced and internet data?” To this end, we present Left-over Lunch RL (LoL-RL), a simple training algorithm that uses offline policy gradients for learning language generation tasks as a 1-step RL game. LoL-RL can finetune LMs to optimize arbitrary classifier-based or human-defined utility functions on any sequence-to-sequence data. Experiments with five different language generation tasks using models of varying sizes and multiple rewards show that models trained with LoL-RL can consistently outperform the best supervised learning models. We also release our experimental code. https://github.com/abaheti95/LoL-RL

READ FULL TEXT

page 1

page 2

page 3

page 4

research
07/08/2021

Offline Meta-Reinforcement Learning with Online Self-Supervision

Meta-reinforcement learning (RL) can meta-train policies that adapt to n...
research
07/10/2023

RLTF: Reinforcement Learning from Unit Test Feedback

The goal of program synthesis, or code generation, is to generate execut...
research
10/12/2020

Human-centric Dialog Training via Offline Reinforcement Learning

How can we train a dialog model to produce better conversations by learn...
research
08/04/2023

ESRL: Efficient Sampling-based Reinforcement Learning for Sequence Generation

Applying Reinforcement Learning (RL) to sequence generation models enabl...
research
07/23/2023

On the Effectiveness of Offline RL for Dialogue Response Generation

A common training technique for language models is teacher forcing (TF)....
research
08/28/2020

Next-Best View Policy for 3D Reconstruction

Manually selecting viewpoints or using commonly available flight planner...
research
06/01/2023

Improving Offline RL by Blending Heuristics

We propose Heuristic Blending (HUBL), a simple performance-improving tec...

Please sign up or login with your details

Forgot password? Click here to reset