AQuA: An Adversarially Authored Question-Answer Dataset for Common Sense

04/08/2019
by   Michael Chen, et al.
0

Commonsense reasoning is a critical AI capability, but it is difficult to construct challenging datasets that test common sense. Recent neural question-answering systems, based on large pre-trained models of language, have already achieved near-human-level performance on commonsense knowledge benchmarks. These systems do not possess human-level common sense, but are able to exploit limitations of the datasets to achieve human-level scores. We introduce the AQuA dataset, an adversarially-constructed evaluation dataset for testing common sense. AQuA forms a challenging extension to the recently-proposed SWAG dataset, which tests commonsense knowledge using sentence-completion questions that describe situations observed in video. To produce a more difficult dataset, we introduce a novel procedure for question acquisition in which workers author questions designed to target weaknesses of state-of-the-art neural question answering systems. Workers are rewarded for submissions that models fail to answer correctly both before and after fine-tuning (in cross-validation). We create 2.8k questions via this procedure and evaluate the performance of multiple state-of-the-art question answering systems on our dataset. We observe a significant gap between human performance, which is 95.3 the OpenAI GPT model.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
04/08/2019

CODAH: An Adversarially Authored Question-Answer Dataset for Common Sense

Commonsense reasoning is a critical AI capability, but it is difficult t...
research
09/18/2019

Conversational AI : Open Domain Question Answering and Commonsense Reasoning

Our research is focused on making a human-like question answering system...
research
11/02/2018

CommonsenseQA: A Question Answering Challenge Targeting Commonsense Knowledge

When answering a question, people often draw upon their rich world knowl...
research
05/02/2020

ProtoQA: A Question Answering Dataset for Prototypical Common-Sense Reasoning

Given questions regarding some prototypical situation – such as Name som...
research
03/19/2023

FVQA 2.0: Introducing Adversarial Samples into Fact-based Visual Question Answering

The widely used Fact-based Visual Question Answering (FVQA) dataset cont...
research
01/19/2022

Evaluating Machine Common Sense via Cloze Testing

Language models (LMs) show state of the art performance for common sense...
research
09/19/2023

An Evaluation of GPT-4 on the ETHICS Dataset

This report summarizes a short study of the performance of GPT-4 on the ...

Please sign up or login with your details

Forgot password? Click here to reset