Do Fine-tuned Commonsense Language Models Really Generalize?

11/18/2020
by   Mayank Kejriwal, et al.
0

Recently, transformer-based methods such as RoBERTa and GPT-3 have led to significant experimental advances in natural language processing tasks such as question answering and commonsense reasoning. The latter is typically evaluated through multiple benchmarks framed as multiple-choice instances of the former. According to influential leaderboards hosted by the Allen Institute (evaluating state-of-the-art performance on commonsense reasoning benchmarks), models based on such transformer methods are approaching human-like performance and have average accuracy well over 80 benchmarks, a model that generalizes on commonsense reasoning should not experience much performance loss across multiple commonsense benchmarks. In this paper, we study the generalization issue in detail by designing and conducting a rigorous scientific study. Using five common benchmarks, multiple controls and statistical analysis, we find clear evidence that fine-tuned commonsense language models still do not generalize well, even with moderate changes to the experimental setup, and may, in fact, be susceptible to dataset bias. We also perform selective studies, including qualitative and consistency analyses, to gain deeper insight into the problem.

READ FULL TEXT
research
04/14/2023

Prompt Engineering and Calibration for Zero-Shot Commonsense Reasoning

Prompt engineering and calibration make large language models excel at r...
research
10/03/2022

Understanding Prior Bias and Choice Paralysis in Transformer-based Language Representation Models through Four Experimental Probes

Recent work on transformer-based neural networks has led to impressive a...
research
03/24/2021

UNICORN on RAINBOW: A Universal Commonsense Reasoning Model on a New Multitask Benchmark

Commonsense AI has long been seen as a near impossible goal – until rece...
research
03/23/2022

A Theoretically Grounded Benchmark for Evaluating Machine Commonsense

Programming machines with commonsense reasoning (CSR) abilities is a lon...
research
11/28/2022

GPT-Neo for commonsense reasoning-a theoretical and practical lens

Recent work has demonstrated substantial gains in pre-training large-sca...
research
05/12/2022

Predicting Human Psychometric Properties Using Computational Language Models

Transformer-based language models (LMs) continue to achieve state-of-the...
research
08/08/2021

Leveraging Commonsense Knowledge on Classifying False News and Determining Checkworthiness of Claims

Widespread and rapid dissemination of false news has made fact-checking ...

Please sign up or login with your details

Forgot password? Click here to reset