The Wisdom of Hindsight Makes Language Models Better Instruction Followers

02/10/2023
by   Tianjun Zhang, et al.
0

Reinforcement learning has seen wide success in finetuning large language models to better align with instructions via human feedback. The so-called algorithm, Reinforcement Learning with Human Feedback (RLHF) demonstrates impressive performance on the GPT series models. However, the underlying Reinforcement Learning (RL) algorithm is complex and requires an additional training pipeline for reward and value networks. In this paper, we consider an alternative approach: converting feedback to instruction by relabeling the original one and training the model for better alignment in a supervised manner. Such an algorithm doesn't require any additional parameters except for the original language model and maximally reuses the pretraining pipeline. To achieve this, we formulate instruction alignment problem for language models as a goal-reaching problem in decision making. We propose Hindsight Instruction Relabeling (HIR), a novel algorithm for aligning language models with instructions. The resulting two-stage algorithm shed light to a family of reward-free approaches that utilize the hindsightly relabeled instructions based on feedback. We evaluate the performance of HIR extensively on 12 challenging BigBench reasoning tasks and show that HIR outperforms the baseline algorithms and is comparable to or even surpasses supervised finetuning.

READ FULL TEXT

page 1

page 3

research
08/23/2023

Aligning Language Models with Offline Reinforcement Learning from Human Feedback

Learning from human preferences is crucial for language models (LMs) to ...
research
05/23/2023

Aligning Large Language Models through Synthetic Feedback

Aligning large language models (LLMs) to human values has become increas...
research
04/06/2023

When do you need Chain-of-Thought Prompting for ChatGPT?

Chain-of-Thought (CoT) prompting can effectively elicit complex multi-st...
research
02/05/2019

Interactively shaping robot behaviour with unlabeled human instructions

In this paper, we propose a framework that enables a human teacher to sh...
research
05/18/2023

LIMA: Less Is More for Alignment

Large language models are trained in two stages: (1) unsupervised pretra...
research
04/13/2023

Language Instructed Reinforcement Learning for Human-AI Coordination

One of the fundamental quests of AI is to produce agents that coordinate...
research
05/23/2023

Probing in Context: Toward Building Robust Classifiers via Probing Large Language Models

Large language models are able to learn new tasks in context, where they...

Please sign up or login with your details

Forgot password? Click here to reset