Evaluating Machines by their Real-World Language Use

by   Rowan Zellers, et al.

There is a fundamental gap between how humans understand and use language – in open-ended, real-world situations – and today's NLP benchmarks for language understanding. To narrow this gap, we propose to evaluate machines by their success at real-world language use – which greatly expands the scope of language tasks that can be measured and studied. We introduce TuringAdvice, a new challenge for language understanding systems. Given a complex situation faced by a real person, a machine must generate helpful advice. We make our challenge concrete by introducing RedditAdvice, a dataset and leaderboard for measuring progress. Though we release a training set with 600k examples, our evaluation is dynamic, continually evolving with the language people use: models must generate helpful advice for recently-written situations. Empirical results show that today's models struggle at our task, even those with billions of parameters. The best model, a finetuned T5, writes advice that is at least as helpful as human-written advice in only 9 performance reveals language understanding errors that are hard to spot outside of a generative setting, showing much room for progress.


page 1

page 2

page 3

page 4


Tiered Reasoning for Intuitive Physics: Toward Verifiable Commonsense Language Understanding

Large-scale, pre-trained language models (LMs) have achieved human-level...

CoDA21: Evaluating Language Understanding Capabilities of NLP Models With Context-Definition Alignment

Pretrained language models (PLMs) have achieved superhuman performance o...

SuperGLUE: A Stickier Benchmark for General-Purpose Language Understanding Systems

In the last year, new models and methods for pretraining and transfer le...

Extending Machine Language Models toward Human-Level Language Understanding

Language is central to human intelligence. We review recent breakthrough...

FewNLU: Benchmarking State-of-the-Art Methods for Few-Shot Natural Language Understanding

The few-shot natural language understanding (NLU) task has attracted muc...

Are We Modeling the Task or the Annotator? An Investigation of Annotator Bias in Natural Language Understanding Datasets

Crowdsourcing has been the prevalent paradigm for creating natural langu...

Towards a Visual Turing Challenge

As language and visual understanding by machines progresses rapidly, we ...