Fine-tuning Language Models with Generative Adversarial Feedback

05/09/2023
by   Zhang Ze Yu, et al.
0

Reinforcement Learning with Human Feedback (RLHF) has been demonstrated to significantly enhance the performance of large language models (LLMs) by aligning their outputs with desired human values. However, RLHF is constrained by the expertise and productivity limitations of human evaluators. In this study, we investigate an alternative approach: Reinforcement Learning with Generative Adversarial Feedback (RLGAF) to RLHF. Our preliminary findings indicate that RLGAF can help align LLMs outputs while not suffering from the inherent restrictions of RLHF, suggesting promising avenues for further research on automating AI alignment.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
03/04/2022

Training language models to follow instructions with human feedback

Making language models bigger does not inherently make them better at fo...
research
07/27/2023

Open Problems and Fundamental Limitations of Reinforcement Learning from Human Feedback

Reinforcement learning from human feedback (RLHF) is a technique for tra...
research
04/19/2023

Fundamental Limitations of Alignment in Large Language Models

An important aspect in developing language models that interact with hum...
research
05/15/2023

RL4F: Generating Natural Language Feedback with Reinforcement Learning for Repairing Model Outputs

Despite their unprecedented success, even the largest language models ma...
research
01/01/2023

Second Thoughts are Best: Learning to Re-Align With Human Values from Text Edits

We present Second Thought, a new learning paradigm that enables language...
research
05/28/2023

Language Models are Pragmatic Speakers

How do language models "think"? This paper formulates a probabilistic co...

Please sign up or login with your details

Forgot password? Click here to reset