Self-critiquing models for assisting human evaluators

06/12/2022
by   William Saunders, et al.
8

We fine-tune large language models to write natural language critiques (natural language critical comments) using behavioral cloning. On a topic-based summarization task, critiques written by our models help humans find flaws in summaries that they would have otherwise missed. Our models help find naturally occurring flaws in both model and human written summaries, and intentional flaws in summaries written by humans to be deliberately misleading. We study scaling properties of critiquing with both topic-based summarization and synthetic tasks. Larger models write more helpful critiques, and on most tasks, are better at self-critiquing, despite having harder-to-critique outputs. Larger models can also integrate their own self-critiques as feedback, refining their own summaries into better ones. Finally, we motivate and introduce a framework for comparing critiquing ability to generation and discrimination ability. Our measurements suggest that even large models may still have relevant knowledge they cannot or do not articulate as critiques. These results are a proof of concept for using AI-assisted human feedback to scale the supervision of machine learning systems to tasks that are difficult for humans to evaluate directly. We release our training datasets, as well as samples from our critique assistance experiments.

READ FULL TEXT

page 10

page 35

research
09/22/2021

Recursively Summarizing Books with Human Feedback

A major challenge for scaling machine learning is training models to per...
research
12/20/2022

On Improving Summarization Factual Consistency from Natural Language Feedback

Despite the recent progress in language generation models, their outputs...
research
12/19/2022

Human-in-the-loop Abstractive Dialogue Summarization

Abstractive dialogue summarization has received increasing attention rec...
research
05/18/2023

TrueTeacher: Learning Factual Consistency Evaluation with Large Language Models

Factual consistency evaluation is often conducted using Natural Language...
research
09/01/2023

RLAIF: Scaling Reinforcement Learning from Human Feedback with AI Feedback

Reinforcement learning from human feedback (RLHF) is effective at aligni...
research
03/30/2023

Comparing Abstractive Summaries Generated by ChatGPT to Real Summaries Through Blinded Reviewers and Text Classification Algorithms

Large Language Models (LLMs) have gathered significant attention due to ...
research
06/02/2023

OMNI: Open-endedness via Models of human Notions of Interestingness

Open-ended algorithms aim to learn new, interesting behaviors forever. T...

Please sign up or login with your details

Forgot password? Click here to reset