Task Ambiguity in Humans and Language Models

12/20/2022
by   Alex Tamkin, et al.
0

Language models have recently achieved strong performance across a wide range of NLP benchmarks. However, unlike benchmarks, real world tasks are often poorly specified, and agents must deduce the user's intended behavior from a combination of context, instructions, and examples. We investigate how both humans and models behave in the face of such task ambiguity by proposing AmbiBench, a new benchmark of six ambiguously-specified classification tasks. We evaluate humans and models on AmbiBench by seeing how well they identify the intended task using 1) instructions with varying degrees of ambiguity, and 2) different numbers of labeled examples. We find that the combination of model scaling (to 175B parameters) and training with human feedback data enables models to approach or exceed the accuracy of human participants across tasks, but that either one alone is not sufficient. In addition, we show how to dramatically improve the accuracy of language models trained without large-scale human feedback training by finetuning on a small number of ambiguous in-context examples, providing a promising direction for teaching models to generalize well in the face of ambiguity.

READ FULL TEXT

page 7

page 15

page 24

page 25

page 27

page 28

page 29

page 30

research
12/19/2022

Unnatural Instructions: Tuning Language Models with (Almost) No Human Labor

Instruction tuning enables pretrained language models to perform new tas...
research
03/04/2022

Training language models to follow instructions with human feedback

Making language models bigger does not inherently make them better at fo...
research
11/04/2021

Adversarial GLUE: A Multi-Task Benchmark for Robustness Evaluation of Language Models

Large-scale pre-trained language models have achieved tremendous success...
research
02/06/2023

Languages are Rewards: Chain of Hindsight Finetuning using Human Feedback

Learning from human preferences is important for language models to be h...
research
05/23/2023

Probing in Context: Toward Building Robust Classifiers via Probing Large Language Models

Large language models are able to learn new tasks in context, where they...
research
05/19/2023

Prompting with Pseudo-Code Instructions

Prompting with natural language instructions has recently emerged as a p...
research
11/03/2022

Inverse scaling can become U-shaped

Although scaling language models improves performance on a range of task...

Please sign up or login with your details

Forgot password? Click here to reset