Large Language Models Struggle to Learn Long-Tail Knowledge

11/15/2022
by   Nikhil Kandpal, et al.
0

The internet contains a wealth of knowledge – from the birthdays of historical figures to tutorials on how to code – all of which may be learned by language models. However, there is a huge variability in the number of times a given piece of information appears on the web. In this paper, we study the relationship between the knowledge memorized by large language models and the information in their pre-training datasets. In particular, we show that a language model's ability to answer a fact-based question relates to how many documents associated with that question were seen during pre-training. We identify these relevant documents by entity linking pre-training datasets and counting documents that contain the same entities as a given question-answer pair. Our results demonstrate strong correlational and causal relationships between accuracy and relevant document count for numerous question answering datasets (e.g., TriviaQA), pre-training corpora (e.g., ROOTS), and model sizes (e.g., 176B parameters). Moreover, we find that while larger models are better at learning long-tail knowledge, we estimate that today's models must be scaled by many orders of magnitude to reach competitive QA performance on questions with little support in the pre-training data. Finally, we show that retrieval-augmentation can reduce the dependence on relevant document count, presenting a promising approach for capturing the long-tail.

READ FULL TEXT

page 2

page 8

page 11

research
05/24/2023

Peek Across: Improving Multi-Document Modeling via Cross-Document Question-Answering

The integration of multi-document pre-training objectives into language ...
research
06/12/2023

The Effect of Masking Strategies on Knowledge Retention by Language Models

Language models retain a significant amount of world knowledge from thei...
research
02/10/2020

REALM: Retrieval-Augmented Language Model Pre-Training

Language model pre-training has been shown to capture a surprising amoun...
research
12/31/2020

Studying Strategically: Learning to Mask for Closed-book QA

Closed-book question-answering (QA) is a challenging task that requires ...
research
02/23/2023

Dr ChatGPT, tell me what I want to hear: How prompt knowledge impacts health answer correctness

Generative pre-trained language models (GPLMs) like ChatGPT encode in th...
research
08/20/2023

Head-to-Tail: How Knowledgeable are Large Language Models (LLM)? A.K.A. Will LLMs Replace Knowledge Graphs?

Since the recent prosperity of Large Language Models (LLMs), there have ...
research
08/29/2023

Characterizing Learning Curves During Language Model Pre-Training: Learning, Forgetting, and Stability

How do language models learn to make predictions during pre-training? To...

Please sign up or login with your details

Forgot password? Click here to reset