Tiered Reasoning for Intuitive Physics: Toward Verifiable Commonsense Language Understanding

09/10/2021
by   Shane Storks, et al.
11

Large-scale, pre-trained language models (LMs) have achieved human-level performance on a breadth of language understanding tasks. However, evaluations only based on end task performance shed little light on machines' true ability in language understanding and reasoning. In this paper, we highlight the importance of evaluating the underlying reasoning process in addition to end performance. Toward this goal, we introduce Tiered Reasoning for Intuitive Physics (TRIP), a novel commonsense reasoning dataset with dense annotations that enable multi-tiered evaluation of machines' reasoning process. Our empirical results show that while large LMs can achieve high end performance, they struggle to support their predictions with valid supporting evidence. The TRIP dataset and our baseline results will motivate verifiable evaluation of commonsense reasoning and facilitate future research toward developing better language understanding and reasoning models.

READ FULL TEXT
research
04/02/2019

Commonsense Reasoning for Natural Language Understanding: A Survey of Benchmarks, Resources, and Approaches

Commonsense knowledge and commonsense reasoning are some of the main bot...
research
09/10/2021

Beyond the Tip of the Iceberg: Assessing Coherence of Text Classifiers

As large-scale, pre-trained language models achieve human-level and supe...
research
04/07/2020

Evaluating Machines by their Real-World Language Use

There is a fundamental gap between how humans understand and use languag...
research
05/19/2023

Examining the Inter-Consistency of Large Language Models: An In-depth Analysis via Debate

Large Language Models (LLMs) have demonstrated human-like intelligence a...
research
07/12/2023

Can Vision-Language Models be a Good Guesser? Exploring VLMs for Times and Location Reasoning

Vision-Language Models (VLMs) are expected to be capable of reasoning wi...
research
08/29/2023

Empowering LLM to use Smartphone for Intelligent Task Automation

Mobile task automation is an attractive technique that aims to enable vo...
research
11/09/2017

Large-scale Cloze Test Dataset Designed by Teachers

Cloze test is widely adopted in language exams to evaluate students' lan...

Please sign up or login with your details

Forgot password? Click here to reset