Emergent and Predictable Memorization in Large Language Models

04/21/2023
by   Stella Biderman, et al.
0

Memorization, or the tendency of large language models (LLMs) to output entire sequences from their training data verbatim, is a key concern for safely deploying language models. In particular, it is vital to minimize a model's memorization of sensitive datapoints such as those containing personal identifiable information (PII). The prevalence of such undesirable memorization can pose issues for model trainers, and may even require discarding an otherwise functional model. We therefore seek to predict which sequences will be memorized before a large model's full train-time by extrapolating the memorization behavior of lower-compute trial runs. We measure memorization of the Pythia model suite, and find that intermediate checkpoints are better predictors of a model's memorization behavior than smaller fully-trained models. We additionally provide further novel discoveries on the distribution of memorization scores across models and data.

READ FULL TEXT

page 4

page 5

page 12

research
02/14/2022

Deduplicating Training Data Mitigates Privacy Risks in Language Models

Past work has shown that large language models are susceptible to privac...
research
06/06/2023

Turning large language models into cognitive models

Large language models are powerful systems that excel at many tasks, ran...
research
08/03/2023

The Capability of Large Language Models to Measure Psychiatric Functioning

The current work investigates the capability of Large language models (L...
research
09/05/2023

Language Models for Novelty Detection in System Call Traces

Due to the complexity of modern computer systems, novel and unexpected b...
research
12/14/2021

Deciphering antibody affinity maturation with language models and weakly supervised learning

In response to pathogens, the adaptive immune system generates specific ...
research
04/03/2023

Pythia: A Suite for Analyzing Large Language Models Across Training and Scaling

How do large language models (LLMs) develop and evolve over the course o...
research
04/10/2023

Learnings from Data Integration for Augmented Language Models

One of the limitations of large language models is that they do not have...

Please sign up or login with your details

Forgot password? Click here to reset