Evaluating Machines by their Real-World Language Use

04/07/2020
by   Rowan Zellers, et al.
0

There is a fundamental gap between how humans understand and use language – in open-ended, real-world situations – and today's NLP benchmarks for language understanding. To narrow this gap, we propose to evaluate machines by their success at real-world language use – which greatly expands the scope of language tasks that can be measured and studied. We introduce TuringAdvice, a new challenge for language understanding systems. Given a complex situation faced by a real person, a machine must generate helpful advice. We make our challenge concrete by introducing RedditAdvice, a dataset and leaderboard for measuring progress. Though we release a training set with 600k examples, our evaluation is dynamic, continually evolving with the language people use: models must generate helpful advice for recently-written situations. Empirical results show that today's models struggle at our task, even those with billions of parameters. The best model, a finetuned T5, writes advice that is at least as helpful as human-written advice in only 9 performance reveals language understanding errors that are hard to spot outside of a generative setting, showing much room for progress.

READ FULL TEXT

page 1

page 2

page 3

page 4

09/10/2021

Tiered Reasoning for Intuitive Physics: Toward Verifiable Commonsense Language Understanding

Large-scale, pre-trained language models (LMs) have achieved human-level...
03/11/2022

CoDA21: Evaluating Language Understanding Capabilities of NLP Models With Context-Definition Alignment

Pretrained language models (PLMs) have achieved superhuman performance o...
05/02/2019

SuperGLUE: A Stickier Benchmark for General-Purpose Language Understanding Systems

In the last year, new models and methods for pretraining and transfer le...
12/12/2019

Extending Machine Language Models toward Human-Level Language Understanding

Language is central to human intelligence. We review recent breakthrough...
09/27/2021

FewNLU: Benchmarking State-of-the-Art Methods for Few-Shot Natural Language Understanding

The few-shot natural language understanding (NLU) task has attracted muc...
08/21/2019

Are We Modeling the Task or the Annotator? An Investigation of Annotator Bias in Natural Language Understanding Datasets

Crowdsourcing has been the prevalent paradigm for creating natural langu...
10/29/2014

Towards a Visual Turing Challenge

As language and visual understanding by machines progresses rapidly, we ...