Dialectical language model evaluation: An initial appraisal of the commonsense spatial reasoning abilities of LLMs

04/22/2023
by   Anthony G. Cohn, et al.
0

Language models have become very popular recently and many claims have been made about their abilities, including for commonsense reasoning. Given the increasingly better results of current language models on previous static benchmarks for commonsense reasoning, we explore an alternative dialectical evaluation. The goal of this kind of evaluation is not to obtain an aggregate performance value but to find failures and map the boundaries of the system. Dialoguing with the system gives the opportunity to check for consistency and get more reassurance of these boundaries beyond anecdotal evidence. In this paper we conduct some qualitative investigations of this kind of evaluation for the particular case of spatial reasoning (which is a fundamental aspect of commonsense reasoning). We conclude with some suggestions for future work both to improve the capabilities of language models and to systematise this kind of dialectical evaluation.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
04/14/2023

Prompt Engineering and Calibration for Zero-Shot Commonsense Reasoning

Prompt engineering and calibration make large language models excel at r...
research
05/19/2023

Examining the Inter-Consistency of Large Language Models: An In-depth Analysis via Debate

Large Language Models (LLMs) have demonstrated human-like intelligence a...
research
07/07/2022

Can Language Models perform Abductive Commonsense Reasoning?

Abductive Reasoning is a task of inferring the most plausible hypothesis...
research
06/02/2023

Evaluating Language Models for Mathematics through Interactions

The standard methodology of evaluating large language models (LLMs) base...
research
06/01/2023

Examining the Causal Effect of First Names on Language Models: The Case of Social Commonsense Reasoning

As language models continue to be integrated into applications of person...
research
04/16/2021

Back to Square One: Bias Detection, Training and Commonsense Disentanglement in the Winograd Schema

The Winograd Schema (WS) has been proposed as a test for measuring commo...
research
05/09/2023

MoT: Pre-thinking and Recalling Enable ChatGPT to Self-Improve with Memory-of-Thoughts

Large Language Models have shown impressive abilities on various tasks. ...

Please sign up or login with your details

Forgot password? Click here to reset