Impact of Pretraining Term Frequencies on Few-Shot Reasoning

02/15/2022
by   Yasaman Razeghi, et al.
0

Pretrained Language Models (LMs) have demonstrated ability to perform numerical reasoning by extrapolating from a few examples in few-shot settings. However, the extent to which this extrapolation relies on robust reasoning is unclear. In this paper, we investigate how well these models reason with terms that are less frequent in the pretraining data. In particular, we examine the correlations between the model performance on test instances and the frequency of terms from those instances in the pretraining data. We measure the strength of this correlation for a number of GPT-based language models (pretrained on the Pile dataset) on various numerical deduction tasks (e.g., arithmetic and unit conversion). Our results consistently demonstrate that models are more accurate on instances whose terms are more prevalent, in some cases above 70% (absolute) more accurate on the top 10% frequent terms in comparison to the bottom 10%. Overall, although LMs exhibit strong performance at few-shot numerical reasoning tasks, our results raise the question of how much models actually generalize beyond pretraining data, and we encourage researchers to take the pretraining data into account when interpreting evaluation results.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
05/13/2022

Improving the Numerical Reasoning Skills of Pretrained Language Models

State-of-the-art pretrained language models tend to perform below their ...
research
09/09/2018

How clever is the FiLM model, and how clever can it be?

The FiLM model achieves close-to-perfect performance on the diagnostic C...
research
05/25/2022

ORCA: Interpreting Prompted Language Models via Locating Supporting Data Evidence in the Ocean of Pretraining Data

Large pretrained language models have been performing increasingly well ...
research
05/25/2022

Teaching Broad Reasoning Skills via Decomposition-Guided Contexts

Question-answering datasets require a broad set of reasoning skills. We ...
research
12/22/2020

Seeing past words: Testing the cross-modal capabilities of pretrained V L models

We investigate the ability of general-purpose pretrained vision and lang...
research
06/07/2021

PROST: Physical Reasoning of Objects through Space and Time

We present a new probing dataset named PROST: Physical Reasoning about O...
research
11/09/2020

An Analysis of Dataset Overlap on Winograd-Style Tasks

The Winograd Schema Challenge (WSC) and variants inspired by it have bec...

Please sign up or login with your details

Forgot password? Click here to reset