What's the best place for an AI conference, Vancouver or ______: Why completing comparative questions is difficult

04/05/2021
by   Avishai Zagoury, et al.
0

Although large neural language models (LMs) like BERT can be finetuned to yield state-of-the-art results on many NLP tasks, it is often unclear what these models actually learn. Here we study using such LMs to fill in entities in human-authored comparative questions, like “Which country is older, India or ______?” – i.e., we study the ability of neural LMs to ask (not answer) reasonable questions. We show that accuracy in this fill-in-the-blank task is well-correlated with human judgements of whether a question is reasonable, and that these models can be trained to achieve nearly human-level performance in completing comparative questions in three different subdomains. However, analysis shows that what they learn fails to model any sort of broad notion of which entities are semantically comparable or similar – instead the trained models are very domain-specific, and performance is highly correlated with co-occurrences between specific entities observed in the training set. This is true both for models that are pretrained on general text corpora, as well as models trained on a large corpus of comparison questions. Our study thus reinforces recent results on the difficulty of making claims about a deep model's world knowledge or linguistic competence based on performance on specific benchmark problems. We make our evaluation datasets publicly available to foster future research on complex understanding and reasoning in such models at standards of human interaction.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
09/22/2021

Small-Bench NLP: Benchmark for small single GPU trained models in Natural Language Processing

Recent progress in the Natural Language Processing domain has given us s...
research
07/15/2023

Coupling Large Language Models with Logic Programming for Robust and General Reasoning from Text

While large language models (LLMs), such as GPT-3, appear to be robust a...
research
02/06/2021

Does He Wink or Does He Nod? A Challenging Benchmark for Evaluating Word Understanding of Language Models

Recent progress in pretraining language models on large corpora has resu...
research
09/08/2023

Can NLP Models 'Identify', 'Distinguish', and 'Justify' Questions that Don't have a Definitive Answer?

Though state-of-the-art (SOTA) NLP systems have achieved remarkable perf...
research
03/31/2023

Attention is Not Always What You Need: Towards Efficient Classification of Domain-Specific Text

For large-scale IT corpora with hundreds of classes organized in a hiera...
research
05/26/2023

Improving accuracy of GPT-3/4 results on biomedical data using a retrieval-augmented language model

Large language models (LLMs) have made significant advancements in natur...
research
09/21/2018

Opacity, Obscurity, and the Geometry of Question-Asking

Asking questions is a pervasive human activity, but little is understood...

Please sign up or login with your details

Forgot password? Click here to reset