Reinforced Self-Training (ReST) for Language Modeling

08/17/2023
by   Caglar Gulcehre, et al.
0

Reinforcement learning from human feedback (RLHF) can improve the quality of large language model's (LLM) outputs by aligning them with human preferences. We propose a simple algorithm for aligning LLMs with human preferences inspired by growing batch reinforcement learning (RL), which we call Reinforced Self-Training (ReST). Given an initial LLM policy, ReST produces a dataset by generating samples from the policy, which are then used to improve the LLM policy using offline RL algorithms. ReST is more efficient than typical online RLHF methods because the training dataset is produced offline, which allows data reuse. While ReST is a general approach applicable to all generative learning settings, we focus on its application to machine translation. Our results show that ReST can substantially improve translation quality, as measured by automated metrics and human evaluation on machine translation benchmarks in a compute and sample-efficient manner.

READ FULL TEXT
research
08/23/2023

Aligning Language Models with Offline Reinforcement Learning from Human Feedback

Learning from human preferences is crucial for language models (LMs) to ...
research
08/04/2023

ESRL: Efficient Sampling-based Reinforcement Learning for Sequence Generation

Applying Reinforcement Learning (RL) to sequence generation models enabl...
research
10/12/2020

Human-centric Dialog Training via Offline Reinforcement Learning

How can we train a dialog model to produce better conversations by learn...
research
11/30/2022

General policy mapping: online continual reinforcement learning inspired on the insect brain

We have developed a model for online continual or lifelong reinforcement...
research
10/16/2022

Data-Efficient Pipeline for Offline Reinforcement Learning with Limited Data

Offline reinforcement learning (RL) can be used to improve future perfor...
research
09/01/2023

Efficient RLHF: Reducing the Memory Usage of PPO

Reinforcement Learning with Human Feedback (RLHF) has revolutionized lan...
research
02/15/2018

Prioritized Sweeping Neural DynaQ with Multiple Predecessors, and Hippocampal Replays

During sleep and awake rest, the hippocampus replays sequences of place ...

Please sign up or login with your details

Forgot password? Click here to reset